🐛 [BUG] Crash when using large dataset #214

peastman · 2022-05-23T22:04:25Z

Describe the bug
I am trying to train a model on a large dataset with over 700,000 conformations for a diverse set of molecules. I created a dataset in extxyz format as described at #89. The file is about 1.5 GB. When I run nequip-train, it displays the message, "Processing...", shows no further output for about 20 minutes, and then exits with the message, "Killed".

I also tried using a subset of only about the first 200 conformations from the file, and that worked. I suspect the problem is caused by running out of memory or some other resource. Is there any supported way of handling large datasets like this?

To Reproduce
The dataset is much too large to attach, but if it would be helpful I can find a different way of sending it to you.

Environment (please complete the following information):

OS: [e.g. Ubuntu, Windows] Ubuntu 18.04
python version (python --version) 3.9
python environment (commands are given for python interpreter):
- nequip version (import nequip; nequip.__version__) 0.5.4
- e3nn version (import e3nn; e3nn.__version__) 0.4.4
- pytorch version (import torch; torch.__version__) 1.10.0
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

The text was updated successfully, but these errors were encountered:

Linux-cpp-lisp · 2022-05-24T01:45:28Z

Hi @peastman ,

Please try the latest un-stable develop branch (at least to pre-process your dataset, once it's cached you should be able to go back and run trainings using stable 0.5.4), which introduced streaming+multiprocessing for faster and more memory efficient extxyz dataset loading and pre-processing. This should resolve your issue, for more details please see #182. The only important point is that, if running outside of a scheduler like SLURM, you may need to set the NEQUIP_NUM_TASKS environment variable to a reasonable number for your system.

If this still does not resolve the memory issues (and you have a reasonable amount of free memory on the system for the task), you could be running into an issue where systems with lots of vacuum in ways I don't fully know how to characterize can cause the ASE neighborlist, which we use, to try to allocate enormous arrays. This can be diagnosed by recording the chart of memory usage over time to see if it is an issue of inefficient usage overall or a single anomalous frame.

peastman · 2022-05-24T03:46:14Z

I have the same problem with the develop branch. It just happens a bit faster. Here is what I observed monitoring the process with top.

Initially there were eight processes, all using equal amounts of CPU and memory. Their memory gradually increased with time. After about five minutes, the total memory in use reached about 29 GB. At that point those processes started disappearing and the memory use went back down to about 8 GB.

They were replaced by a single process using 100% of one core. Its memory gradually increased with time. It took another five minutes or so for it to reach 32 GB (the amount of memory in my computer), at which point it exited and displayed the message "Killed".

Linux-cpp-lisp · 2022-05-24T05:51:53Z

...huh. Is your r_max very big or your atomic systems very dense? I'm struggling to see how 1.5GB of .xyz could get there...

You could try adding some print statements so we know where it fails at least... in dataset.py on develop I'd put ones before lines:

952
954
284

You could also breakpoint() at 953 and go look at the temporary results from the workers in the temp filesystem stored at the paths listed in datas and see how large they are.

peastman · 2022-05-24T14:52:38Z

Is your r_max very big or your atomic systems very dense?

r_max is 10 A. The very largest samples are just under 100 atoms, but most are under 50.

You could try adding some print statements so we know where it fails at least...

Here is the output:

Torch device: cuda
Processing dataset...
line 952
line 954
line 284
Killed

Based on a quick skim through the code, it looks like the dataset always needs to hold all data in memory at once? For a sufficiently large dataset, that will eventually become impossible. Is there any support for datasets that are too large to fit in memory?

peastman · 2022-05-26T00:43:42Z

If I limit the dataset to 300,000 samples it works. At 400,000 it crashes.

Linux-cpp-lisp · 2022-05-26T04:30:31Z

OK yeah, back of the envelope that is just a lot of data (assuming all-to-all edges for 50, which is a bit of an exaggeration but probably not much at 10A, gives just for edges themselves: 50**2 edges * 2 int64/edge * 8 byte/int64 * 400k frames =~ 15GB).

Only AtomicInMemoryDataset assumes that everything is in-memory... the vast majority of nequip should work just fine with a custom subclass of AtomicDataset, which is basically just a PyTorch Geometric Dataset that can do whatever it wants in the get() method, including loading samples on-demand from disk or a flexible cache.

There are two caveats to that:

I've never actually done it, so unexpected things may break
The .statistics method for computing statistics like energy stdev, average number of neighbors, etc. is currently only implemented for in-memory datasets.
The default options for network normalization all use .statistics() to compute these things automatically. If you don't use the default settings, and instead explicitly provide values (either your own estimates or computed from a smaller subset), then this should be a non-issue.

Fixing the second limitation is something we've been meaning to get around to but haven't been able to yet. The vast majority of the infrastructure is already there in the form of our pytorch_runstats library that we already use for running statistics on metrics, but the pieces have to be put together while dealing with some edge cases. Certainly let me know in case you'd be interested in contributing to or testing such efforts.

All that said, the simplest option (I know you already know this) would be to use a machine with more memory for the dataset processing. Once past the pre-processing the memory required should be much less (it peaks at >2x in the step where yours fails when it copies the individual frames into the final set of contiguous buffers)... the processed dataset files should be portable across machines if you want to pre-process on a bigmem CPU node and then train on a GPU node, for example.

peastman · 2022-05-26T04:40:52Z

I implemented a HDF5 dataset format in TorchMD-Net as a way to support arbitrarily large datasets. The code to implement it is very simple. Paging data into memory is handled transparently and very efficiently. Would you be interested in something like that?

Linux-cpp-lisp · 2022-05-26T17:00:39Z

Hi @peastman ,

That would be great! Obviously caveat (2) still applies, but this would be a good starting point. You'd just replace the Data(...) constructor with AtomicData.from_points, more or less. The only issue is that this will recompute the neighborlists on every .get(), which is a decent bit of overhead (ASE neighborlist is slow)... if HDF5 is so good at paging maybe you could still do a preprocessing and cache to an HDF file that also stores the edges? Not sure what makes the most sense.

If you're planning to look into this, CONTRIBUTING.md has some info on the black and flake8 settings for the project.

Thanks!

peastman · 2022-05-26T20:34:17Z

I don't really understand the dataset architecture in NequIP. Are you suggesting I write a new class, or modify an existing one? If the former, which class would it extend and what methods would need to be implemented? If the latter, which class?

It looks to me like the two central methods are get(), which returns an AtomicData for one sample, and statistics(), which I think it should be possible to implement in a generic way by just looping over all samples? There are lots of other methods, but they mostly seem to be incidental.

Linux-cpp-lisp · 2022-05-30T16:59:19Z

Hi @peastman ,

I'm suggesting that you write a new subclass of AtomicDataset. All datasets are expected to be subclasses of AtomicDataset, and arbitrary dataset classes with arbitrary __init__ arguments can be built using:

dataset: package.submodule.MyDataset
dataset_option1: apple
dataset_option2: spinach
...

where MyDataset has __init__ arguments option1 and option2.

I've made an example skeleton of a custom dataset here: https://github.com/mir-group/nequip/blob/develop/examples/custom_dataset.py (make sure to pull latest develop for that). Basically, yes--- if you don't want to do any cached pre-processing or automatic downloading, all you really need are get() and len().

Re statistics(),

which I think it should be possible to implement in a generic way by just looping over all samples?

this is true, there are just a lot of subtle edge cases to do it for all datasets and statistics modes correctly... it's been on the TODO list for a while but never high enough priority (if it ain't broke...). But our package pytorch_runstats (which is already a dependency) should be a good starting point for this, and as the example file says if you just need to get a very specific use working all you'd need to implement is that case.

peastman · 2022-05-31T22:25:11Z

Thanks! I've got an initial version working. At the moment I just have statistics() returning precomputed values. I'll need to look into how to implement it properly. And as you said, calling AtomicData.from_points() inside get() adds a lot of overhead. The GPU utilitization is only about 45%. So I still have things to improve, but it serves as a proof of concept.

Linux-cpp-lisp · 2022-06-01T04:47:04Z

@peastman oh nice!

Yeah, the ASE neighborlists are not that fast... you could always try to play with the dataloader_num_workers YAML setting. I've never really used it set to anything but 0 (disabled) because on our cluster it would always hang with weird IPC problems, but in theory it should resolve this issue by making the neighborlists and the epoch steps run in parallel.

peastman · 2022-06-01T15:15:13Z

Thanks for the tip. That helps a little bit. With dataloader_num_workers: 3 I can get the GPU utilization up to around 55%.

peastman added the bug Something isn't working label May 23, 2022

peastman mentioned this issue Jun 27, 2022

Created HDF5 based dataset #227

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [BUG] Crash when using large dataset #214

🐛 [BUG] Crash when using large dataset #214

peastman commented May 23, 2022

Linux-cpp-lisp commented May 24, 2022 •

edited

Loading

peastman commented May 24, 2022

Linux-cpp-lisp commented May 24, 2022 •

edited

Loading

peastman commented May 24, 2022

peastman commented May 26, 2022

Linux-cpp-lisp commented May 26, 2022

peastman commented May 26, 2022

Linux-cpp-lisp commented May 26, 2022

peastman commented May 26, 2022

Linux-cpp-lisp commented May 30, 2022 •

edited

Loading

peastman commented May 31, 2022

Linux-cpp-lisp commented Jun 1, 2022

peastman commented Jun 1, 2022

🐛 [BUG] Crash when using large dataset #214

🐛 [BUG] Crash when using large dataset #214

Comments

peastman commented May 23, 2022

Linux-cpp-lisp commented May 24, 2022 • edited Loading

peastman commented May 24, 2022

Linux-cpp-lisp commented May 24, 2022 • edited Loading

peastman commented May 24, 2022

peastman commented May 26, 2022

Linux-cpp-lisp commented May 26, 2022

peastman commented May 26, 2022

Linux-cpp-lisp commented May 26, 2022

peastman commented May 26, 2022

Linux-cpp-lisp commented May 30, 2022 • edited Loading

peastman commented May 31, 2022

Linux-cpp-lisp commented Jun 1, 2022

peastman commented Jun 1, 2022

Linux-cpp-lisp commented May 24, 2022 •

edited

Loading

Linux-cpp-lisp commented May 24, 2022 •

edited

Loading

Linux-cpp-lisp commented May 30, 2022 •

edited

Loading