Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up bader_caller and chargemol_caller #3192

Merged
merged 23 commits into from
Aug 8, 2023

Conversation

chiang-yuan
Copy link
Contributor

@chiang-yuan chiang-yuan commented Jul 29, 2023

In the original implementation, CHGCAR is copied into temporary folder by monty.zopen. Since CHGCAR is usually very large, this makes bader analysis calling from pymatgen extremely slow.

this might close #2487

  • fix 1: use monty.tempfiie.ScratchdDr context manager
  • fix 2: do not copy large files like CHGCAR into temporary folder but read them in place instead
  • fix 3: use os.symlink to link larges files for chargemol_caller

Checklist

  • Google format doc strings added. Check with ruff.
  • Type annotations included. Check with mypy.
  • Tests added for new features/fixes.

@chiang-yuan chiang-yuan marked this pull request as ready for review July 29, 2023 00:01
@Andrew-S-Rosen
Copy link
Member

Thank you so much for doing this! It has been on my backlog for a very long time.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Jul 29, 2023

You don't need to add it in this PR (unless you're feeling ambitious!), but it's likely that an analogous fix will solve similar issues with the chargemol_caller.

with ScratchDir("."):
with zopen(self._chgcarpath, "rt") as f_in, open("CHGCAR", "w") as f_out:
shutil.copyfileobj(f_in, f_out)
with zopen(self._potcarpath, "rt") as f_in, open("POTCAR", "w") as f_out:
shutil.copyfileobj(f_in, f_out)
with zopen(self._aeccar0path, "rt") as f_in, open("AECCAR0", "w") as f_out:
shutil.copyfileobj(f_in, f_out)
with zopen(self._aeccar2path, "rt") as f_in, open("AECCAR2", "w") as f_out:
shutil.copyfileobj(f_in, f_out)

@chiang-yuan
Copy link
Contributor Author

@arosen93 thanks for pointing out chagemol as well! It actually requires little modification so I guess I can try to solve them both.

@Andrew-S-Rosen
Copy link
Member

@chiang-yuan: I would certainly be very grateful if you did!!! 🙏

@chiang-yuan
Copy link
Contributor Author

Chargemol needs to see all the input files in the same folder so the way around here is using symbolic link to point to large files . I also port the error message by bader and chagmol back to python RuntimeError.

I have used pytest to test test_bader_caller.py and test_chargemol_caller.py locally on Perlmutter and they all passed. If this change cannot pass pytest on github workflow, I don't know where is the problem then... Perhaps somehow bader and chagemol are not correctly installed on the pytest environment.

@chiang-yuan chiang-yuan changed the title speed up bader_caller speed up bader_caller and chargemol_caller Jul 30, 2023
@chiang-yuan
Copy link
Contributor Author

Looks like CHGCAR passed but cube failed. Maybe @janosh @arosen93 you will be interested in looking into this?

https://github.com/materialsproject/pymatgen/actions/runs/5704593631/job/15458352230?pr=3192#step:7:183

@chiang-yuan
Copy link
Contributor Author

I found the problem. The code was reading compressed test files but somehow didn't raise error on Perlmutter. Modify the code to handle compressed files and decompress them to temp directory if needed. Not sure if this is the ideal solution since this falls back to copying data to temp folder again (but only for compressed file and folder, the uncompressed files are still read at their original place).

This will also affect when we want to compress all the test files. #2994

Copy link
Member

@janosh janosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chiang-yuan Thanks for taking this on! Very excited about fully reviving the BaderAnalysis in pymatgen. Left a few questions.

import subprocess
import warnings
from glob import glob
from shutil import which
from tempfile import TemporaryDirectory

import numpy as np
from monty.io import zopen
from monty.tempfile import ScratchDir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't understand why the files need to be copied at all? Does bader modify them in place? Seems unlikely and even if, they should provide an option to disable that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't! So what I did here is creating a temporary scratchdir to store the output files from bader. However, bader cannot read compressed file so at the bottom we need to decompress file if it is compressed and store it temporarily somewhere (I just put it in ScratchDir as well)

os.symlink(self._aeccar0path, "./AECCAR0")
os.symlink(self._aeccar2path, "./AECCAR2")
except OSError as e:
print(f"Error creating symbolic link: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using os.symlink here seems strange. Why not read the files from where they are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chargemol is different. It seems like here is no input arguments to indicate where the input files (e.g. CHGCAR AECCAR0) stored, so we need to all the input file in ScratchDir. To avoid copying, just use symbolic link can "pretend" there are in the same directory but actually they are just the shortcut linked to the file in the original place.

@@ -78,6 +78,8 @@ def test_from_path(self):
assert np.allclose(charge, charge0)
if os.path.exists("CHGREF"):
os.remove("CHGREF")
if os.path.exists(os.path.join(test_dir, "CHGREF")):
os.remove(os.path.join(test_dir, "CHGREF"))
Copy link
Member

@janosh janosh Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be best to use the PymatgenTest.tmp_path fixture so we don't have to worry about manually cleaning up test-generated files.

Copy link
Contributor Author

@chiang-yuan chiang-yuan Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PymatgenTest.tmp_path already implemented? If yes, I need some time to look into how to refactor this part. If not, I need more time 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right here

@pytest.fixture(autouse=True) # make all tests run a in a temporary directory accessible via self.tmp_path
def _tmp_dir(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
# https://pytest.org/en/latest/how-to/unittest.html#using-autouse-fixtures-and-accessing-other-fixtures
monkeypatch.chdir(tmp_path) # change to pytest-provided temporary directory
self.tmp_path = tmp_path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I am looking into it!

@@ -17,6 +17,29 @@
__date__ = "Sep 23, 2011"


def decompress_file_to_path(fin_path, fout_path=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this quite similar to monty.io.zopen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually is more similar to decompress_file in monty but the function won't return the path and delete the original compressed file. Created a helper function there as a fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume @shyuep wouldn't mind modifying monty.shutil.decompress_file, esp. since the doc string is currently wrong: arg compression doesn't actually exist.

https://github.com/materialsvirtuallab/monty/blob/2b91561d2483f2bebfc185c4945dfae85008c439/monty/shutil.py#L98

We can make a PR to monty to have the function return the path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chiang-yuan Let's delete this function and use the changes in materialsvirtuallab/monty#536 instead.

@shyuep Can we have a monty release whenever we do the next pymatgen release?

@janosh
Copy link
Member

janosh commented Aug 7, 2023

Sorry, looks like this was waiting on me. I missed that. Just fixed the merge conflict. I'll submit the PR to monty to return filepath from decompress_file.

@chiang-yuan Any additional tests we might want to add?

@@ -17,6 +17,29 @@
__date__ = "Sep 23, 2011"


def decompress_file_to_path(fin_path, fout_path=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chiang-yuan Let's delete this function and use the changes in materialsvirtuallab/monty#536 instead.

@shyuep Can we have a monty release whenever we do the next pymatgen release?

@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2023

Codecov Report

Patch coverage: 80.39% and project coverage change: -0.57% ⚠️

Comparison is base (1fa96f8) 74.57% compared to head (6f9b905) 74.01%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3192      +/-   ##
==========================================
- Coverage   74.57%   74.01%   -0.57%     
==========================================
  Files         230      230              
  Lines       69494    69496       +2     
  Branches    16166    16163       -3     
==========================================
- Hits        51824    51436     -388     
- Misses      14595    15019     +424     
+ Partials     3075     3041      -34     
Files Changed Coverage Δ
pymatgen/command_line/chargemol_caller.py 61.56% <0.00%> (-0.29%) ⬇️
pymatgen/command_line/bader_caller.py 60.61% <88.09%> (-1.12%) ⬇️
pymatgen/util/io_utils.py 91.30% <100.00%> (+1.83%) ⬆️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chiang-yuan
Copy link
Contributor Author

chiang-yuan commented Aug 8, 2023

@janosh I made two modifications, 1) delete decompress_file_to_path and use monty instead 2) use PymatgenTest.tmp_path fixture.

The monty version of the test is still older version so the test is unsuccessful.

@janosh janosh merged commit 7c70c1a into materialsproject:master Aug 8, 2023
15 of 18 checks passed
@janosh
Copy link
Member

janosh commented Aug 8, 2023

Thanks @chiang-yuan! 👍 this is a big improvement. required a lot of tweaking but i'm glad we got it merged.

@mkhorton
Copy link
Member

mkhorton commented Aug 8, 2023

Also offering thanks for this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bader analysis via Pymatgen is extremely slow
5 participants