-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0003-scope-of-pymatgen-io-addons.md #3
Conversation
Thanks for raising this, @rkingsbury. I think this is relevant for all pymatgen.io developers. |
The overriding "bad" for options 1 and 2 is simply that some core functionality sits in a separate code base for which there is no visibility and less stringent code quality control. In particular, unit testing is frequently patchy. In fact, some developers who have duplicated code in the past have admitted that "we reimplemented X functionality in Y code because we do not want to have to deal with unittesting". I am fine with this happening for something like |
Just a point of clarification - I don't think addon packages do not necessarily automatically have to be maintained by people outside the core MP team. On the contrary, if we were to consider moving core codes like VASP, Q-Chem, or LAMMPS into addons then I would strongly advocate those packages be maintained by the MP team / Foundation members and subject to the same CI and testing standards currently in place for pymatgen.
I didn't mean that in a perjorative way; however installing pymatgen (especially from GitHub) now requires more than 1 GB of space due to the many test files required by testing many different codes. This has caused problems for me on multiple occasions. If we could separate, e.g. VASP- or Q-Chem specific test files and code into addon packages that would make installation a lot easier, and reduce the time it takes to complete CI testing for changes in core pymatgen.
I'm not trying to ignore the considerations you articulated in that discussion; they are important and I understand where you're coming from. Your group uses LAMMPS a lot so it makes complete sense that the LAMMPS IO remains in a package that you maintain. But keeping it in a separate package (that you maintain) could have benefits for users that use pymatgen in different ways.
Yes I agree that the quality of some In any case, if the intent all along was for the addon architecture to "delegate less used implementations to external developers" then I think we could be more explicit about that in the docs, because I had the impression that this was partially intended to reduce the installation size and dependency count of pymatgen over time. |
Ported here from Slack for added context and longevity (per @tschaume's suggestion): @rkingsbury [...] thoughts on a few agenda items:
|
@janosh and @rkingsbury I have already stated my opinion on pymatgen-io-vasp. Of course, I cannot prevent you from actually implementing a pymatgen-io-vasp that overrides the version in pymatgen. But doing so will be without my support and I will not give any consideration to those efforts in future developments. The question of backward compatibility is red herring. We have implemented plenty of non-backward-compatible changes before. It just has to be managed. All the talk about the file size of pymatgen is nonsense. The PyPi wheel of pymatgen is 10Mb. Users don't install the Github version. |
I should have perhaps emphasized the new in new IO addons. Initially, they often require a high release cadence and have more bugs than we'd like in core |
My personal preference would also be that current very important functionalities would stay in pymatgen (even though I, of course, have no real vote on this). This also makes sure that they still work after changes in any part of pymatgen. Discoverability of such add-ons would also need to be ensured otherwise. (yes, we have been discussing a website about this but it's still easier this way) |
@janosh Ok. Your example used VASP. I thought you are thinking of overriding pymatgen.io.vasp. For IO of other packages not currently in pymatgen, I have no issues with them being add-ons. |
Apologies for any confusion there, @shyuep . My takeaway from our prior discussion (March) was that VASP is central to pymatgen and should stay in, in large part because many/most of the core pymatgen maintainers and developers also use VASP.
This is a good point (which I didn't appreciate until recently). However as a counterpoint, developers implementing new workflows will usually need to install the GitHub version on HPC clusters in order to test their work, and that's where the large install size becomes problematic (I had multiple issues with this during the r2SCAN work).
I agree it makes sense for IO for new codes to be in addon packages. I also think it would be beneficial to move some codes currently in pymatgen (which are not used or maintained as much by the core pymatgen team) into addons. Q-Chem comes to mind here. As far as I'm aware, it's maintained almost entirely by Sam, Evan, and a handful of others in the Persson group that do molecular calculations. If it were an IO package, @shyuep and @janosh and the other pymatgen maintainers would have less work to do reviewing PRs and the molecular team would be able to iterate faster by keeping review and merge internal. The onus would be on them to keep up with changes in upstream pymatgen. This could also help focus specific user groups (in this example, pymatgen + Q-Chem users) into dedicated GitHub repos where their issues and discussions would not be intermingled with those that aren't relevant to their work. We could make the transition seamless by initially making |
@rkingsbury One probem for me with developing codes was always to find out who the currently active maintainer is. With more and more add-ons, it might be really hard for other people to contribute as you now need to rely on more and more maintainers and also have to wait for their responses. In general, I however agree with the other mentioned points. We also decided to move a part of the lobster functionalities to another code (also partially because the scope was a bit beyond typical pymatgen scope with features for ML, automated analysis etc.). |
That! Fwiw, I agree with @rkingsbury that the current repo size is a problem. Very annoying when |
I think you guys are focusing on the wrong problem here. The size of the repo is primary the result of the test files, which are huge (e.g., WAVECARs, CHGCARs, etc.). Anyway, I have no objections if someone wants to remove qchem entirely to an add-on package. |
True. Repo size is off topic. |
Well, I know that it's the test files which really blow up the repo size, and every code included in However, looking at the file size breakdown in
Since addon packages for IO obviously are not a solution to the size issue (unless / until we decide to move VASP into an addon), I'll open a separate discussion for that as @janosh suggested. |
@rkingsbury Thanks for listing this. When we work on the lobster modules the next time, @naik-aakash and I will check if we can reduce.test file size as well. I am currently notnplanning to move away the Lobster io. |
I don't think there is a solution that achieves the primary aim here. Developers are supposed to have the test files. They are supposed to be writing tests. Can we move the files to some other location and say developers have to go somewhere to download only the files they need? Yes of course. But that would add an additional step and discourage developers from writing and running tests. Having developers write and run tests is non-negotiable. Also, cloning is a one-time affair. Pls do not focus on the imaginary problem of the hypothetical dev who lives on sub-100Mbps broadband taking 1 hr to clone pymatgen for the first time. He does not subsequently have to suffer the same 1hr wait. He can go watch a netflix movie while he waits. I have never taken more than a few minutes to clone pymatgen and I don't have Gigabit internet. |
Fair points. I have also found cloning is fast enough on my own local systems, but on HPC clusters I've found myself having to set up new environments frequently - multiple times per quarter - and needing to re-clone each time. I've also commonly encountered slow download and unpack speeds on HPC systems specifically. It sounds like @janosh has had a similar experience, so I think it's worth discussing creative ways of reducing test files sizes at some point. But that's a discussion for another thread. |
@rkingsbury I am not sure why the pymatgen repo needs to be on HPCs, since you can always install the 10Mb pypi version. But even if you want to clone, I think the sparse checkout function of git will do. See https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository |
I would add that a long time ago, I did think of separating out the code from the tests into separate repos. E.g., pymatgen-dev repo which contains submodules pymatgen (the code) and pymatgen-tests (just tests and test_files). People who just want code can just clone the pymatgen submodule and devs who need to have everything will clone the super repo called pymatgen-dev. In the end, the administrative overhead just doesn't seem to be worth it. Like I said, most people who bother with the pymatgen Github repo in the first place are devs and they need the test files. Pure users can just install from pypi or conda. There are very few cases where we have a quasi dev who wants to clone the pymatgen source but not the test files. |
I probably agree with that. 2 things though:
|
@janosh 2 sounds like a good first idea from my perspective. 1 sounds painful to communicate to (first-time) contributors (I might be wrong). |
This seems like a great solution @shyuep ! I was not aware of this capability. Perhaps we can say something about this in the Developer Guide that we are working on. I agree that not very many users will care, but for those doing heavy workflow development where you need to iterate on your IO code in pymatgen and your workflow code in atomate(2) at the same time (and test on an HPC), this could be extremely useful.
This also sounds like a potentially great solution. I don't know anything about it though. Maybe we should ask around if others have had better experiences?
I was wondering about this myself. I can't think of any reason not to compress all the test files. Thanks for all the good ideas! In the interest of not straying too far off-topic, I will open a separate PR about repo size and try to capture the above discussion points in the associated decision record. |
@rkingsbury Added. See https://pymatgen.org/contributing.html @janosh @JaGeo I tried git lfs a long time ago. It is a serious pain in the ass. Maybe things have improved since then. You can test and let me know... |
@rkingsbury Note that I modified the instructions a bit from the link I sent earlier. I tested these instructions myself. The overall process takes up around 140 Mb, instead of 900 Mb for the full clone. Not an order of magnitude difference, but still a significant saving. |
@shyuep Nice! Thanks for adding! After compressing all test files (see tracking issue materialsproject/pymatgen#2994), I think we can call the pymatgen repo size issue solved.
I initially misread this as run the commented out command if you have Git 2.25+. Else run the commands below the 2nd comment. Maybe just remove the 2nd comment and add "Git 2.25+ (released 2023-01-01)" to the first comment? |
One thing - I am all for compressing the test files, but I doubt it will do much. Git actually compresses everything by default during transfer..... |
I never understood this in detail but I think the compression |
Wow, this is great! Thanks for doing this @shyuep .
I am still in favor of compressing and perhaps better organizing the test files, just to save some disk space and make it easier to understand which test files go with which parts of the code. But as stated, I think this is a lot less important now that we have the option to do sparse checkout |
Merging based on the discussion at the last meeting. |
Clarifying the intended scope of
pymatgen-io-xxxx
addon packages.This is a question I've wanted to discuss for a while that I think is very important for defining what our future ecosystem looks like, but I decided to raise it now in particular because it came up recently in the context of LAMMPS development (see materialsproject/pymatgen#2754)