-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [BUG] Crash when using large dataset #214
Comments
Hi @peastman , Please try the latest un-stable If this still does not resolve the memory issues (and you have a reasonable amount of free memory on the system for the task), you could be running into an issue where systems with lots of vacuum in ways I don't fully know how to characterize can cause the ASE neighborlist, which we use, to try to allocate enormous arrays. This can be diagnosed by recording the chart of memory usage over time to see if it is an issue of inefficient usage overall or a single anomalous frame. |
I have the same problem with the Initially there were eight processes, all using equal amounts of CPU and memory. Their memory gradually increased with time. After about five minutes, the total memory in use reached about 29 GB. At that point those processes started disappearing and the memory use went back down to about 8 GB. They were replaced by a single process using 100% of one core. Its memory gradually increased with time. It took another five minutes or so for it to reach 32 GB (the amount of memory in my computer), at which point it exited and displayed the message "Killed". |
...huh. Is your You could try adding some
You could also |
Here is the output:
Based on a quick skim through the code, it looks like the dataset always needs to hold all data in memory at once? For a sufficiently large dataset, that will eventually become impossible. Is there any support for datasets that are too large to fit in memory? |
If I limit the dataset to 300,000 samples it works. At 400,000 it crashes. |
OK yeah, back of the envelope that is just a lot of data (assuming all-to-all edges for 50, which is a bit of an exaggeration but probably not much at 10A, gives just for edges themselves: 50**2 edges * 2 int64/edge * 8 byte/int64 * 400k frames =~ 15GB). Only There are two caveats to that:
Fixing the second limitation is something we've been meaning to get around to but haven't been able to yet. The vast majority of the infrastructure is already there in the form of our All that said, the simplest option (I know you already know this) would be to use a machine with more memory for the dataset processing. Once past the pre-processing the memory required should be much less (it peaks at >2x in the step where yours fails when it copies the individual frames into the final set of contiguous buffers)... the processed dataset files should be portable across machines if you want to pre-process on a bigmem CPU node and then train on a GPU node, for example. |
I implemented a HDF5 dataset format in TorchMD-Net as a way to support arbitrarily large datasets. The code to implement it is very simple. Paging data into memory is handled transparently and very efficiently. Would you be interested in something like that? |
Hi @peastman , That would be great! Obviously caveat (2) still applies, but this would be a good starting point. You'd just replace the If you're planning to look into this, Thanks! |
I don't really understand the dataset architecture in NequIP. Are you suggesting I write a new class, or modify an existing one? If the former, which class would it extend and what methods would need to be implemented? If the latter, which class? It looks to me like the two central methods are |
Hi @peastman , I'm suggesting that you write a new subclass of
where I've made an example skeleton of a custom dataset here: https://github.com/mir-group/nequip/blob/develop/examples/custom_dataset.py (make sure to pull latest Re
this is true, there are just a lot of subtle edge cases to do it for all datasets and statistics modes correctly... it's been on the TODO list for a while but never high enough priority (if it ain't broke...). But our package |
Thanks! I've got an initial version working. At the moment I just have |
@peastman oh nice! Yeah, the ASE neighborlists are not that fast... you could always try to play with the |
Thanks for the tip. That helps a little bit. With |
Describe the bug
I am trying to train a model on a large dataset with over 700,000 conformations for a diverse set of molecules. I created a dataset in extxyz format as described at #89. The file is about 1.5 GB. When I run
nequip-train
, it displays the message, "Processing...", shows no further output for about 20 minutes, and then exits with the message, "Killed".I also tried using a subset of only about the first 200 conformations from the file, and that worked. I suspect the problem is caused by running out of memory or some other resource. Is there any supported way of handling large datasets like this?
To Reproduce
The dataset is much too large to attach, but if it would be helpful I can find a different way of sending it to you.
Environment (please complete the following information):
python --version
) 3.9import nequip; nequip.__version__
) 0.5.4import e3nn; e3nn.__version__
) 0.4.4import torch; torch.__version__
) 1.10.0nvcc --version
)import torch; torch.version.cuda
)The text was updated successfully, but these errors were encountered: