Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0003-scope-of-pymatgen-io-addons.md #3

Merged
merged 7 commits into from
Nov 14, 2023
Merged

0003-scope-of-pymatgen-io-addons.md #3

merged 7 commits into from
Nov 14, 2023

Conversation

rkingsbury
Copy link
Collaborator

Clarifying the intended scope of pymatgen-io-xxxx addon packages.

This is a question I've wanted to discuss for a while that I think is very important for defining what our future ecosystem looks like, but I decided to raise it now in particular because it came up recently in the context of LAMMPS development (see materialsproject/pymatgen#2754)

@JaGeo
Copy link
Member

JaGeo commented Nov 28, 2022

Thanks for raising this, @rkingsbury. I think this is relevant for all pymatgen.io developers.

@shyuep
Copy link
Member

shyuep commented Nov 28, 2022

  1. I do not understand what "pymatgen dependency bloat" has to do with the discussion. If the add-on depends on pymatgen anyway, that requires ALL pymatgen dependencies. Whether pymatgen has dependency bloat (questionable at best), it affects all downstream packages whether an add-on architecture is used or not.
  2. I already clarified the scope of what should or should not be an add-on in my response to the LAMMPS question. You have ignored all those considerations (especially the most important one).
  3. I reiterate that the add-on architecture is meant to delegate less used implementations to external developers, not fragment the core functionality itself. As a pymatgen user, I expect VASP IO to be part and parcel of the initial install.

The overriding "bad" for options 1 and 2 is simply that some core functionality sits in a separate code base for which there is no visibility and less stringent code quality control. In particular, unit testing is frequently patchy. In fact, some developers who have duplicated code in the past have admitted that "we reimplemented X functionality in Y code because we do not want to have to deal with unittesting". I am fine with this happening for something like pymatgen-io-simulation_code_used_by_1%_of_users but not for pymatgen-io-vasp.

@rkingsbury
Copy link
Collaborator Author

Just a point of clarification - I don't think addon packages do not necessarily automatically have to be maintained by people outside the core MP team. On the contrary, if we were to consider moving core codes like VASP, Q-Chem, or LAMMPS into addons then I would strongly advocate those packages be maintained by the MP team / Foundation members and subject to the same CI and testing standards currently in place for pymatgen.

  1. I do not understand what "pymatgen dependency bloat" has to do with the discussion. If the add-on depends on pymatgen anyway, that requires ALL pymatgen dependencies. Whether pymatgen has dependency bloat (questionable at best), it affects all downstream packages whether an add-on architecture is used or not.

I didn't mean that in a perjorative way; however installing pymatgen (especially from GitHub) now requires more than 1 GB of space due to the many test files required by testing many different codes. This has caused problems for me on multiple occasions. If we could separate, e.g. VASP- or Q-Chem specific test files and code into addon packages that would make installation a lot easier, and reduce the time it takes to complete CI testing for changes in core pymatgen.

  1. I already clarified the scope of what should or should not be an add-on in my response to the LAMMPS question. You have ignored all those considerations (especially the most important one).

I'm not trying to ignore the considerations you articulated in that discussion; they are important and I understand where you're coming from. Your group uses LAMMPS a lot so it makes complete sense that the LAMMPS IO remains in a package that you maintain. But keeping it in a separate package (that you maintain) could have benefits for users that use pymatgen in different ways.

  1. I reiterate that the add-on architecture is meant to delegate less used implementations to external developers, not fragment the core functionality itself. As a pymatgen user, I expect VASP IO to be part and parcel of the initial install.

The overriding "bad" for options 1 and 2 is simply that some core functionality sits in a separate code base for which there is no visibility and less stringent code quality control. In particular, unit testing is frequently patchy. In fact, some developers who have duplicated code in the past have admitted that "we reimplemented X functionality in Y code because we do not want to have to deal with unittesting". I am fine with this happening for something like pymatgen-io-simulation_code_used_by_1%_of_users but not for pymatgen-io-vasp.

Yes I agree that the quality of some pymatgen-io-xxx packages will not be on par with pymatgen itself. But is there any reason we can't maintain the addon packages for the codes we rely on? I don't envision some external developer maintaining pymatgen-io-vasp for example. I'd imagine the MP team would do that, subject to the same linting and CI standards we already have. It simply makes installation and packaging more flexible for other use cases.

In any case, if the intent all along was for the addon architecture to "delegate less used implementations to external developers" then I think we could be more explicit about that in the docs, because I had the impression that this was partially intended to reduce the installation size and dependency count of pymatgen over time.

@janosh
Copy link
Member

janosh commented May 1, 2023

Ported here from Slack for added context and longevity (per @tschaume's suggestion):


@rkingsbury [...] thoughts on a few agenda items:

  1. Re: addon packages, I think it’s very important that we articulate a clear vision for what should and should not be an addon package. We could parallel the decision we made on document models and say 1) any code that is used in production calculations by MP stays in pymatgen and 2) any other code IO becomes an addon package. Personally I think there are strong arguments to make IO for most codes into addon packages (as I said in the PR) but whatever way we go, criteria for what goes where are the most important thing.
  2. Addon packages may also be important for navigating compatibility. I could be wrong, but I feel it is going to be difficult to preserve backward compatibility as more and more workflows migrate to the exciting atomate2 infrastructure. Having a single version of pymatgen that still works with atomate/fireworks + atomate2 and new document models may not be feasible. In that scenario, could we use addon packages to override the legacy pymatgen code? To use VASP as an example - if you want to use atomate/FW, use what comes with pymatgen. If you want to use atomate2, install pymatgen-io-vasp. This compatibility thing may need to turn into a separate discussion but I wanted to raise it here.

@janosh Erring on the side of addon packages for new IO seems sensible. pymatgen already feels a bit monolithic. errors in one part of the code can hold back releases with fixes and features for other parts.

@janosh 3. could prob be implemented with import hooks. but is maybe bit too much magic? i can foresee a new class of issues arising from people not knowing parts of pymatgen are being shadowed when installing an addon and getting broken code as a result

@mkhorton well put, this is what I meant by 3. being dicey

@shyuep
Copy link
Member

shyuep commented May 1, 2023

@janosh and @rkingsbury I have already stated my opinion on pymatgen-io-vasp. Of course, I cannot prevent you from actually implementing a pymatgen-io-vasp that overrides the version in pymatgen. But doing so will be without my support and I will not give any consideration to those efforts in future developments. The question of backward compatibility is red herring. We have implemented plenty of non-backward-compatible changes before. It just has to be managed.

All the talk about the file size of pymatgen is nonsense. The PyPi wheel of pymatgen is 10Mb. Users don't install the Github version.

@janosh
Copy link
Member

janosh commented May 1, 2023

have already stated my opinion on pymatgen-io-vasp. Of course, I cannot prevent you from actually implementing a pymatgen-io-vasp that overrides the version in pymatgen.

I should have perhaps emphasized the new in new IO addons. Initially, they often require a high release cadence and have more bugs than we'd like in core pymatgen, hence addons seem like a good place to start them. If they're namespace packages, should be fairly easy to integrate them into pymatgen upstream once they mature and long-time users care to do it. Wasn't thinking of a new VASP IO package there.

@JaGeo
Copy link
Member

JaGeo commented May 1, 2023

My personal preference would also be that current very important functionalities would stay in pymatgen (even though I, of course, have no real vote on this). This also makes sure that they still work after changes in any part of pymatgen. Discoverability of such add-ons would also need to be ensured otherwise. (yes, we have been discussing a website about this but it's still easier this way)
If the test file size is such an issue, one could think about reminding developers to keep them as small as possible (I think that also happens already) and maybe spend some time on reducing the test file sizes (e.g., with gzip).

@shyuep
Copy link
Member

shyuep commented May 1, 2023

@janosh Ok. Your example used VASP. I thought you are thinking of overriding pymatgen.io.vasp.

For IO of other packages not currently in pymatgen, I have no issues with them being add-ons.

@rkingsbury
Copy link
Collaborator Author

@janosh Ok. Your example used VASP. I thought you are thinking of overriding pymatgen.io.vasp.

Apologies for any confusion there, @shyuep . My takeaway from our prior discussion (March) was that VASP is central to pymatgen and should stay in, in large part because many/most of the core pymatgen maintainers and developers also use VASP.

The PyPi wheel of pymatgen is 10Mb. Users don't install the Github version.

This is a good point (which I didn't appreciate until recently). However as a counterpoint, developers implementing new workflows will usually need to install the GitHub version on HPC clusters in order to test their work, and that's where the large install size becomes problematic (I had multiple issues with this during the r2SCAN work).

For IO of other packages not currently in pymatgen, I have no issues with them being add-ons.

I agree it makes sense for IO for new codes to be in addon packages. I also think it would be beneficial to move some codes currently in pymatgen (which are not used or maintained as much by the core pymatgen team) into addons. Q-Chem comes to mind here. As far as I'm aware, it's maintained almost entirely by Sam, Evan, and a handful of others in the Persson group that do molecular calculations. If it were an IO package, @shyuep and @janosh and the other pymatgen maintainers would have less work to do reviewing PRs and the molecular team would be able to iterate faster by keeping review and merge internal. The onus would be on them to keep up with changes in upstream pymatgen.

This could also help focus specific user groups (in this example, pymatgen + Q-Chem users) into dedicated GitHub repos where their issues and discussions would not be intermingled with those that aren't relevant to their work.

We could make the transition seamless by initially making pymatgen-io-qchem a dependency of pymatgen so that it still gets installed by default, and then phase that out later.

@JaGeo
Copy link
Member

JaGeo commented May 9, 2023

@rkingsbury One probem for me with developing codes was always to find out who the currently active maintainer is. With more and more add-ons, it might be really hard for other people to contribute as you now need to rely on more and more maintainers and also have to wait for their responses. In general, I however agree with the other mentioned points. We also decided to move a part of the lobster functionalities to another code (also partially because the scope was a bit beyond typical pymatgen scope with features for ML, automated analysis etc.).

@janosh
Copy link
Member

janosh commented May 9, 2023

However as a counterpoint, developers implementing new workflows will usually need to install the GitHub version on HPC clusters in order to test their work

That! Fwiw, I agree with @rkingsbury that the current repo size is a problem. Very annoying when git clone takes minutes. Imagine someone without fiber trying to clone. They can make coffee and drink it before they can start working. 😆

@shyuep
Copy link
Member

shyuep commented May 9, 2023

I think you guys are focusing on the wrong problem here. The size of the repo is primary the result of the test files, which are huge (e.g., WAVECARs, CHGCARs, etc.). Anyway, I have no objections if someone wants to remove qchem entirely to an add-on package.

@janosh
Copy link
Member

janosh commented May 9, 2023

I think you guys are focusing on the wrong problem here.

True. Repo size is off topic.

@rkingsbury
Copy link
Collaborator Author

rkingsbury commented May 11, 2023

I think you guys are focusing on the wrong problem here.

True. Repo size is off topic.

Well, I know that it's the test files which really blow up the repo size, and every code included in pymatgen.io adds more test files. So I was thinking one of the benefits of moving more codes into addon packages would be to reduce the core repo size.

However, looking at the file size breakdown in test_files shows that the lion's share of the space are VASP test files (i.e., the ones in the root directory). Removing the q-chem test files (the molecules directory) would save only 70M. So since we're keeping VASP IO in the core repo, we're stuck with that problem.

2.4M	test_files/BTO_221_99_polarization
6.2M	test_files/absorption
 32M	test_files/neb_analysis
 70M	test_files/molecules
8.3M	test_files/grid_data_files
1.3M	test_files/phonopy
7.7M	test_files/nwchem
 16M	test_files/surface_tests
1.4M	test_files/abinit
 29M	test_files/path_finder
 11M	test_files/chemenv
 28M	test_files/bader
7.3M	test_files/boltztrap2
 12M	test_files/Zr_Vasprun
5.5M	test_files/relaxation
4.2M	test_files/gruneisen
 55M	test_files/boltztrap
 11M	test_files/chgden
2.6M	test_files/reproduce_eps
8.2M	test_files/nmr
 46M	test_files/cohp
829M	test_files

Since addon packages for IO obviously are not a solution to the size issue (unless / until we decide to move VASP into an addon), I'll open a separate discussion for that as @janosh suggested.

@JaGeo
Copy link
Member

JaGeo commented May 11, 2023

@rkingsbury Thanks for listing this. When we work on the lobster modules the next time, @naik-aakash and I will check if we can reduce.test file size as well. I am currently notnplanning to move away the Lobster io.

@shyuep
Copy link
Member

shyuep commented May 11, 2023

I don't think there is a solution that achieves the primary aim here. Developers are supposed to have the test files. They are supposed to be writing tests. Can we move the files to some other location and say developers have to go somewhere to download only the files they need? Yes of course. But that would add an additional step and discourage developers from writing and running tests.

Having developers write and run tests is non-negotiable.

Also, cloning is a one-time affair. Pls do not focus on the imaginary problem of the hypothetical dev who lives on sub-100Mbps broadband taking 1 hr to clone pymatgen for the first time. He does not subsequently have to suffer the same 1hr wait. He can go watch a netflix movie while he waits. I have never taken more than a few minutes to clone pymatgen and I don't have Gigabit internet.

@rkingsbury
Copy link
Collaborator Author

Also, cloning is a one-time affair. Pls do not focus on the imaginary problem of the hypothetical dev who lives on sub-100Mbps broadband taking 1 hr to clone pymatgen for the first time. He does not subsequently have to suffer the same 1hr wait. He can go watch a netflix movie while he waits. I have never taken more than a few minutes to clone pymatgen and I don't have Gigabit internet.

Fair points. I have also found cloning is fast enough on my own local systems, but on HPC clusters I've found myself having to set up new environments frequently - multiple times per quarter - and needing to re-clone each time. I've also commonly encountered slow download and unpack speeds on HPC systems specifically. It sounds like @janosh has had a similar experience, so I think it's worth discussing creative ways of reducing test files sizes at some point. But that's a discussion for another thread.

@shyuep
Copy link
Member

shyuep commented May 11, 2023

@rkingsbury I am not sure why the pymatgen repo needs to be on HPCs, since you can always install the 10Mb pypi version. But even if you want to clone, I think the sparse checkout function of git will do. See https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository

@shyuep
Copy link
Member

shyuep commented May 11, 2023

I would add that a long time ago, I did think of separating out the code from the tests into separate repos. E.g., pymatgen-dev repo which contains submodules pymatgen (the code) and pymatgen-tests (just tests and test_files). People who just want code can just clone the pymatgen submodule and devs who need to have everything will clone the super repo called pymatgen-dev.

In the end, the administrative overhead just doesn't seem to be worth it. Like I said, most people who bother with the pymatgen Github repo in the first place are devs and they need the test files. Pure users can just install from pypi or conda. There are very few cases where we have a quasi dev who wants to clone the pymatgen source but not the test files.

@janosh
Copy link
Member

janosh commented May 11, 2023

In the end, the administrative overhead just doesn't seem to be worth it.

I probably agree with that. 2 things though:

  1. The test files are a perfect fit for git lfs. That's exactly what it was designed for. Everything would still be in a single repo, it would all be versioned and the test files wouldn't be downloaded by default. But it would be a single git command to get all or only the ones you need when you need to run tests. But I really don't like git lfs. Had some bad experiences in the past.
  2. Not all test files are compressed. I don't think there's sufficient benefit in having plain text files to warrant the extra file size. An easy win would be to gzip or bz2 everything.

@JaGeo
Copy link
Member

JaGeo commented May 11, 2023

@janosh 2 sounds like a good first idea from my perspective. 1 sounds painful to communicate to (first-time) contributors (I might be wrong).

@rkingsbury
Copy link
Collaborator Author

But even if you want to clone, I think the sparse checkout function of git will do. See https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository

This seems like a great solution @shyuep ! I was not aware of this capability. Perhaps we can say something about this in the Developer Guide that we are working on. I agree that not very many users will care, but for those doing heavy workflow development where you need to iterate on your IO code in pymatgen and your workflow code in atomate(2) at the same time (and test on an HPC), this could be extremely useful.

  1. The test files are a perfect fit for git lfs. That's exactly what it was designed for. Everything would still be in a single repo, it would all be versioned and the test files wouldn't be downloaded by default. But it would be a single git command to get all or only the ones you need when you need to run tests.

This also sounds like a potentially great solution. I don't know anything about it though. Maybe we should ask around if others have had better experiences?

2. Not all test files are compressed. I don't think there's sufficient benefit in having plain text files to warrant the extra file size. An easy win would be to gzip or bz2 everything.

I was wondering about this myself. I can't think of any reason not to compress all the test files.

Thanks for all the good ideas! In the interest of not straying too far off-topic, I will open a separate PR about repo size and try to capture the above discussion points in the associated decision record.

@shyuep
Copy link
Member

shyuep commented May 17, 2023

@rkingsbury Added. See https://pymatgen.org/contributing.html

@janosh @JaGeo I tried git lfs a long time ago. It is a serious pain in the ass. Maybe things have improved since then. You can test and let me know...

@janosh
Copy link
Member

janosh commented May 17, 2023

@janosh @JaGeo I tried git lfs a long time ago. It is a serious pain in the ass. Maybe things have improved since then. You can test and let me know...

Last time I used git lfs was ~1 year ago. It was still a PITA...

@shyuep
Copy link
Member

shyuep commented May 17, 2023

@rkingsbury Note that I modified the instructions a bit from the link I sent earlier. I tested these instructions myself. The overall process takes up around 140 Mb, instead of 900 Mb for the full clone. Not an order of magnitude difference, but still a significant saving.

@janosh
Copy link
Member

janosh commented May 17, 2023

Added. See https://pymatgen.org/contributing.html

@shyuep Nice! Thanks for adding! After compressing all test files (see tracking issue materialsproject/pymatgen#2994), I think we can call the pymatgen repo size issue solved.

# if you have Git 2.25+, you can use the sparse-checkout command.
# git sparse-checkout set --no-cone pymatgen setup.py pyproject.toml requirements.txt requirements-optional.txt

# The code below should work for any git after 2012.

I initially misread this as run the commented out command if you have Git 2.25+. Else run the commands below the 2nd comment. Maybe just remove the 2nd comment and add "Git 2.25+ (released 2023-01-01)" to the first comment?

@shyuep
Copy link
Member

shyuep commented May 17, 2023

One thing - I am all for compressing the test files, but I doubt it will do much. Git actually compresses everything by default during transfer.....

@janosh
Copy link
Member

janosh commented May 17, 2023

I never understood this in detail but I think the compression git does is over change patches, not the actual files in the repo. But your point is prob correct in that git's compression likely is even better than file compression. Still, even if clone speed is unaffected, we'd definitely save disk space.

@rkingsbury
Copy link
Collaborator Author

@rkingsbury Note that I modified the instructions a bit from the link I sent earlier. I tested these instructions myself. The overall process takes up around 140 Mb, instead of 900 Mb for the full clone. Not an order of magnitude difference, but still a significant saving.

Wow, this is great! Thanks for doing this @shyuep .

One thing - I am all for compressing the test files, but I doubt it will do much. Git actually compresses everything by default during transfer.....

I never understood this in detail but I think the compression git does is over change patches, not the actual files in the repo. But your point is prob correct in that git's compression likely is even better than file compression. Still, even if clone speed is unaffected, we'd definitely save disk space.

I am still in favor of compressing and perhaps better organizing the test files, just to save some disk space and make it easier to understand which test files go with which parts of the code. But as stated, I think this is a lot less important now that we have the option to do sparse checkout

@rkingsbury rkingsbury added this to the ntation milestone Jun 26, 2023
@utf
Copy link
Member

utf commented Nov 14, 2023

Merging based on the discussion at the last meeting.

@utf utf merged commit 219683b into main Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants