Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Draft
wants to merge 27 commits into
base: develop
Choose a base branch
from

Conversation

marinegor
Copy link
Contributor

@marinegor marinegor commented Sep 20, 2024

Fixes #2367 and also extends #4303

Changes made in this Pull Request:

  • uses gemmi library (link) to parse mmcif files
  • adds a class MMCIFReader(base.SingleFrameReaderBase) and class MMCIFParser(TopologyReaderBase) classes for that

As a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):

  • mmCIF (PDBx/mmCIF),
  • PDB (with popular extensions),
  • mmJSON

Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).

However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2024

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 28:80: E501 line too long (84 > 79 characters)
Line 41:80: E501 line too long (85 > 79 characters)
Line 42:80: E501 line too long (93 > 79 characters)
Line 61:80: E501 line too long (104 > 79 characters)
Line 65:80: E501 line too long (87 > 79 characters)
Line 67:80: E501 line too long (107 > 79 characters)

Line 2:24: W291 trailing whitespace
Line 60:80: E501 line too long (111 > 79 characters)
Line 64:80: E501 line too long (88 > 79 characters)
Line 68:80: E501 line too long (123 > 79 characters)
Line 78:80: E501 line too long (122 > 79 characters)
Line 102:80: E501 line too long (108 > 79 characters)
Line 109:80: E501 line too long (80 > 79 characters)
Line 124:80: E501 line too long (91 > 79 characters)
Line 168:80: E501 line too long (81 > 79 characters)
Line 169:80: E501 line too long (126 > 79 characters)
Line 179:80: E501 line too long (125 > 79 characters)
Line 218:80: E501 line too long (126 > 79 characters)
Line 236:80: E501 line too long (140 > 79 characters)
Line 275:80: E501 line too long (87 > 79 characters)
Line 286:80: E501 line too long (90 > 79 characters)

Line 49:80: E501 line too long (80 > 79 characters)
Line 50:80: E501 line too long (84 > 79 characters)

Line 336:26: W292 no newline at end of file

Line 48:80: E501 line too long (103 > 79 characters)
Line 81:80: E501 line too long (80 > 79 characters)
Line 97:80: E501 line too long (86 > 79 characters)
Line 271:80: E501 line too long (90 > 79 characters)
Line 339:80: E501 line too long (104 > 79 characters)
Line 386:80: E501 line too long (83 > 79 characters)
Line 435:80: E501 line too long (80 > 79 characters)
Line 462:80: E501 line too long (80 > 79 characters)
Line 480:80: E501 line too long (80 > 79 characters)
Line 492:80: E501 line too long (80 > 79 characters)
Line 493:80: E501 line too long (80 > 79 characters)
Line 496:80: E501 line too long (83 > 79 characters)
Line 497:80: E501 line too long (86 > 79 characters)
Line 545:80: E501 line too long (82 > 79 characters)
Line 546:80: E501 line too long (82 > 79 characters)
Line 548:80: E501 line too long (88 > 79 characters)
Line 550:80: E501 line too long (88 > 79 characters)
Line 551:80: E501 line too long (81 > 79 characters)
Line 776:80: E501 line too long (81 > 79 characters)
Line 777:80: E501 line too long (87 > 79 characters)
Line 778:80: E501 line too long (84 > 79 characters)
Line 779:80: E501 line too long (85 > 79 characters)
Line 780:80: E501 line too long (83 > 79 characters)

Line 37:80: E501 line too long (88 > 79 characters)
Line 49:80: E501 line too long (84 > 79 characters)

Comment last updated at 2024-10-02 17:52:18 UTC

Copy link

github-actions bot commented Sep 20, 2024

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11148966346/job/30986736623


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, will require a small test file to check reader/parser halves.

pass


class MMCIFWriter(base.WriterBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include this at this stage, Writer is optional

from .base import TopologyReaderBase


def _into_idx(arr: list[int]) -> list[int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what this does, ideally with an example

record_types, # res.het_flag
tempfactors, # at.b_iso
residx, # _into_idx(res.seqid.num)
) = map( # this is probably not pretty, but it's efficient -- one loop over the mmcif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all the fields here guaranteed in a valid pdbx? One benefit to working column by column is that you can do optional columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of a PDBx in mind, or like a test set for them? I've never actually worked with the format, since in RCSB afaik we have only pdb or mmcif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDBx is mmcif. The download links here will give you an example file: https://www.rcsb.org/structure/4ake we use 4ake elsewhere in the testsuite. In my experience, sometimes the PDB / mmcif versions of the same entry aren't completely identical, so I wouldn't worry about trying to align the PDB & PDBx tests.

np.array,
list(
zip(
*[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice

@@ -78,6 +78,7 @@ extra_formats = [
"pytng>=0.2.3",
"gsd>3.0.0",
"rdkit>=2020.03.1",
"gemmi", # for mmcif format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably be optional, so other imports will have to respect that too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDBx/mmCIF Reader/Topology Reader
3 participants