Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only consider relevant DATATYPEs (NMR SPECTRUM / FID) when reading JCAMP-DX #120

Merged
merged 3 commits into from
Jun 24, 2020

Conversation

JLVarjo
Copy link
Contributor

@JLVarjo JLVarjo commented Jun 18, 2020

Quite often there's all kinds of metadata embedded in JCAMP-DX files. This PR changes the parser behavior to consider only the relevant DATATYPE sections, namely NMR SPECTRUM and NMR FID. Error messaging is also improved for cases where no data is found after all.

In addition, this PR fixes a bug in recognition of JCAMP pseudodigit format.

@kaustubhmote
Copy link
Collaborator

@JLVarjo, This looks good!

I have only one small suggestion. It seems like as it stands currently, only NMRSPECTRUM and NMRFID are considered valid. I think some of the other parameters such as NMRPEAKASSIGMMENTS might also be useful. Although the parsing for these datatypes is not currently not ideal, one can still get the data then manually parse these datasets.

Ideally, one can make a allowed_datatype list and just check whether the datatype is in this list before adding it. Something like:

allowed_datatype = ["NMRSPECTRUM", "NMRFID", "NMRPEAKTABLE", "NMRPEAKASSIGNMENTS" , "LINK"]
if datatype.strip().upper().replace(" ", "") in allowed_datatype:
...

This way, if there are other datatypes that are considered valid later, we could just add them to this list. What do you think?

@JLVarjo
Copy link
Contributor Author

JLVarjo commented Jun 23, 2020

The actual problem behind this PR was that sometimes the different DATATYPE sections have the same data tags, which messed up things if they end up to one dict. Which is why I decided here to just dump the irrelevant ones. Your idea to store other data as well is feasible though. What if I make _readrawdic to return a dict of dicts instead, i.e. the key to outer dict is the DATATYPE? The actual parser could then continue with the spectrum/fid but all the other data is still returned to user.

@kaustubhmote
Copy link
Collaborator

I see. In that case, what do you think about returning a dict of dicts, but also unpacking NMRFID and NMRSPECTRUM parts so that the default behaviour is not changed?

@JLVarjo
Copy link
Contributor Author

JLVarjo commented Jun 23, 2020

I assume you want the main read() function to return only (dic, data) tuple as all other parsers? So how about the following dic structure: the base level contains the data entries of the main datatype (NMRSPECTRUM or NMRFID) as currently, and all the data from other datatypes are in sub-dicts, for example with keys such as _datatype=LINK_<n>, _datatype=NMRPEAKASSIGNMENTS_<n> etc., in which n is a running index to the section in the file, in the case of multiple similar ones?

@kaustubhmote
Copy link
Collaborator

Yes, that sounds perfect. This way, none of the existing codes need to change.
Will it be possible to have this as a list instead of running index? For example, could this be refactored so that the other datatypes are accessed by dic["LINK"][0], dic["LINK"][1] and so on?

@JLVarjo
Copy link
Contributor Author

JLVarjo commented Jun 23, 2020

Even better! Will make the changes soon.

@JLVarjo
Copy link
Contributor Author

JLVarjo commented Jun 24, 2020

Changed the behavior as discussed. I think we have to use some prefix on the subdict keys, or they may clash with the DATATYPE tag of the actual data section itself. Chose to use _datatype_ here.

There is also a dummy test case to check that the dictionary structure is correct, please find it attached here:
dicstructure.zip

@kaustubhmote
Copy link
Collaborator

@JLVarjo This looks good to me. I am merging this now. Thanks, especially for the additional test. I think it really helps understand what the function does.

I have an additional suggestion here. I think the function _readrawdic is probably ripe for a refactoring into separate reading and a parsing function. Currently, I think it is quite long and seems quite a lot, and can potentially become hard to maintain as we add new functionality. This is not urgent of course, but something to keep in mind.

@kaustubhmote kaustubhmote merged commit 5834d55 into jjhelmus:master Jun 24, 2020
@JLVarjo
Copy link
Contributor Author

JLVarjo commented Jun 25, 2020

Yes, can agree with you on that. Thanks for the pull!

@JLVarjo JLVarjo deleted the jcampdx-bugfixes branch June 25, 2020 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants