Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

JLVarjo · 2020-06-18T07:53:46Z

Quite often there's all kinds of metadata embedded in JCAMP-DX files. This PR changes the parser behavior to consider only the relevant DATATYPE sections, namely NMR SPECTRUM and NMR FID. Error messaging is also improved for cases where no data is found after all.

In addition, this PR fixes a bug in recognition of JCAMP pseudodigit format.

…AMP-DX

kaustubhmote · 2020-06-23T06:27:06Z

@JLVarjo, This looks good!

I have only one small suggestion. It seems like as it stands currently, only NMRSPECTRUM and NMRFID are considered valid. I think some of the other parameters such as NMRPEAKASSIGMMENTS might also be useful. Although the parsing for these datatypes is not currently not ideal, one can still get the data then manually parse these datasets.

Ideally, one can make a allowed_datatype list and just check whether the datatype is in this list before adding it. Something like:

allowed_datatype = ["NMRSPECTRUM", "NMRFID", "NMRPEAKTABLE", "NMRPEAKASSIGNMENTS" , "LINK"]
if datatype.strip().upper().replace(" ", "") in allowed_datatype:
...

This way, if there are other datatypes that are considered valid later, we could just add them to this list. What do you think?

JLVarjo · 2020-06-23T06:49:03Z

The actual problem behind this PR was that sometimes the different DATATYPE sections have the same data tags, which messed up things if they end up to one dict. Which is why I decided here to just dump the irrelevant ones. Your idea to store other data as well is feasible though. What if I make _readrawdic to return a dict of dicts instead, i.e. the key to outer dict is the DATATYPE? The actual parser could then continue with the spectrum/fid but all the other data is still returned to user.

kaustubhmote · 2020-06-23T08:35:25Z

I see. In that case, what do you think about returning a dict of dicts, but also unpacking NMRFID and NMRSPECTRUM parts so that the default behaviour is not changed?

JLVarjo · 2020-06-23T09:06:58Z

I assume you want the main read() function to return only (dic, data) tuple as all other parsers? So how about the following dic structure: the base level contains the data entries of the main datatype (NMRSPECTRUM or NMRFID) as currently, and all the data from other datatypes are in sub-dicts, for example with keys such as _datatype=LINK_<n>, _datatype=NMRPEAKASSIGNMENTS_<n> etc., in which n is a running index to the section in the file, in the case of multiple similar ones?

kaustubhmote · 2020-06-23T10:09:06Z

Yes, that sounds perfect. This way, none of the existing codes need to change.
Will it be possible to have this as a list instead of running index? For example, could this be refactored so that the other datatypes are accessed by dic["LINK"][0], dic["LINK"][1] and so on?

JLVarjo · 2020-06-23T10:59:17Z

Even better! Will make the changes soon.

JLVarjo · 2020-06-24T12:45:56Z

Changed the behavior as discussed. I think we have to use some prefix on the subdict keys, or they may clash with the DATATYPE tag of the actual data section itself. Chose to use _datatype_ here.

There is also a dummy test case to check that the dictionary structure is correct, please find it attached here:
dicstructure.zip

kaustubhmote · 2020-06-24T15:39:54Z

@JLVarjo This looks good to me. I am merging this now. Thanks, especially for the additional test. I think it really helps understand what the function does.

I have an additional suggestion here. I think the function _readrawdic is probably ripe for a refactoring into separate reading and a parsing function. Currently, I think it is quite long and seems quite a lot, and can potentially become hard to maintain as we add new functionality. This is not urgent of course, but something to keep in mind.

JLVarjo · 2020-06-25T05:24:44Z

Yes, can agree with you on that. Thanks for the pull!

JLVarjo added 2 commits June 18, 2020 10:39

Bug fix to JCAMP pseudodigit recognition

9c33573

Only consider relevant DATATYPEs (NMR SPECTRUM / FIX) when reading JC…

ff546cd

…AMP-DX

kaustubhmote mentioned this pull request Jun 19, 2020

Test datasets: Alternative needed for the Google-Code archive #87

Open

Store information of additional datatypes in subdicts

1c64f02

kaustubhmote merged commit 5834d55 into jjhelmus:master Jun 24, 2020

JLVarjo deleted the jcampdx-bugfixes branch June 25, 2020 05:25

JLVarjo mentioned this pull request Feb 20, 2023

Fix JCAMP-DX block reading #191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

JLVarjo commented Jun 18, 2020

kaustubhmote commented Jun 23, 2020

JLVarjo commented Jun 23, 2020

kaustubhmote commented Jun 23, 2020

JLVarjo commented Jun 23, 2020

kaustubhmote commented Jun 23, 2020

JLVarjo commented Jun 23, 2020

JLVarjo commented Jun 24, 2020

kaustubhmote commented Jun 24, 2020

JLVarjo commented Jun 25, 2020

Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

Conversation

JLVarjo commented Jun 18, 2020

kaustubhmote commented Jun 23, 2020

JLVarjo commented Jun 23, 2020

kaustubhmote commented Jun 23, 2020

JLVarjo commented Jun 23, 2020

kaustubhmote commented Jun 23, 2020

JLVarjo commented Jun 23, 2020

JLVarjo commented Jun 24, 2020

kaustubhmote commented Jun 24, 2020

JLVarjo commented Jun 25, 2020