Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-GPU dataparallel #74

Merged
merged 64 commits into from
Mar 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
abc66f8
make matscipy default neighbour list
ilyes319 Jan 6, 2023
73b234c
on the fly data loading
davkovacs Feb 8, 2023
e8b5bac
add multi-GPU dataparrallel
ilyes319 Feb 9, 2023
83733f6
threaded loading and improved speed
davkovacs Feb 9, 2023
e966331
implement statistics parsing and improve run_train
davkovacs Feb 9, 2023
7604f23
refactoring plus cleaned up structure
davkovacs Feb 13, 2023
4a5309e
small bugfix
davkovacs Feb 13, 2023
5303cb4
parse statistics from json file
davkovacs Feb 14, 2023
f3f14f2
still a little slow data loading
davkovacs Feb 14, 2023
da23dba
performance good, but no shuffling of training set during training!
davkovacs Feb 14, 2023
a290466
data loading works with shuffling
davkovacs Feb 14, 2023
b7a097a
document on-line data loading
davkovacs Feb 15, 2023
652a96e
Update README.md
davkovacs Feb 15, 2023
d619d9d
Merge pull request #73 from davkovacs/on_the_fly_dataloading
ilyes319 Feb 15, 2023
262a1fb
lower CPU memroy during preprocessing
davkovacs Feb 15, 2023
f769417
Merge branch 'on_the_fly_dataloading' of github.com:davkovacs/mace in…
davkovacs Feb 15, 2023
cde62d3
update README
davkovacs Feb 15, 2023
d662854
save jsons stats as strings
davkovacs Feb 15, 2023
3b4162b
fix test set h5 save name
davkovacs Feb 16, 2023
2dee6d1
Merge pull request #81 from davkovacs/on_the_fly_dataloading
ilyes319 Feb 16, 2023
bc8f04a
improve half periodic matscipy
ilyes319 Feb 24, 2023
5ab28a6
Merge pull request #64 from ACEsuit/52-matscipy-neighbour-list-as-def…
ilyes319 Apr 6, 2023
e552cc5
implement on the fly graph creation
ilyes319 Apr 6, 2023
5d90030
add hf5 test
ilyes319 Apr 6, 2023
511f591
fix stuff
ilyes319 Apr 6, 2023
fca0bb3
Merge pull request #99 from davkovacs/multi-GPU
ilyes319 Apr 6, 2023
56b3814
fix import
ilyes319 Apr 6, 2023
8c0eb6f
add flags
ilyes319 Apr 6, 2023
8bd47ec
fix cpu
ilyes319 Apr 6, 2023
2009b0c
Update run_train.py
ilyes319 Apr 6, 2023
0f3ed94
add correct batching
ilyes319 Apr 6, 2023
3b15205
debug
davkovacs Apr 7, 2023
717c8f4
node attrs req grad
davkovacs May 7, 2023
bb4d51e
Multi-node, multi-GPU data parallel training.
samwaltonnorwood May 22, 2023
4971d90
Only open hdf5 file when it is used and remove it from the state for …
sivonxay Jun 15, 2023
e0c94fb
remove new file creation in HDF5ChainDataset
sivonxay Jun 15, 2023
d07469a
Distributed evaluation.
samwaltonnorwood Jun 16, 2023
583b1c3
Merge pull request #117 from sivonxay/multi-GPU
ilyes319 Jun 17, 2023
643a18b
Update .gitignore
mavaylon1 Jun 28, 2023
b900ee4
Update .gitignore
mavaylon1 Jun 28, 2023
37762e9
Update preprocess_data.py
mavaylon1 Jun 28, 2023
99baf11
Update utils.py
mavaylon1 Jun 28, 2023
7876875
cpu/4
mavaylon1 Jun 28, 2023
3c13f10
compute statistics only once
ilyes319 Jul 6, 2023
a28c4fb
Merge branch 'multi-GPU' into multi-GPU
mavaylon1 Jul 17, 2023
2f54450
some reverts
mavaylon1 Jul 20, 2023
1d67e69
clean
mavaylon1 Jul 20, 2023
7befa8f
Merge pull request #105 from samwaltonnorwood/distributed
ilyes319 Jul 24, 2023
354c23d
test
mavaylon1 Jul 31, 2023
3df66cc
test cleanm
mavaylon1 Jul 31, 2023
741f9aa
Merge branch 'ACEsuit:multi-GPU' into multi-GPU
mavaylon1 Aug 1, 2023
41aeede
removal
mavaylon1 Aug 1, 2023
42ffd94
clean up
mavaylon1 Aug 1, 2023
3ce9b08
clean up
mavaylon1 Aug 1, 2023
916d759
clean up
mavaylon1 Aug 1, 2023
8e7d174
clean up
mavaylon1 Aug 1, 2023
2a74e38
args
mavaylon1 Aug 1, 2023
5da3b6a
fix
mavaylon1 Aug 8, 2023
729b1ad
parse/test
mavaylon1 Aug 8, 2023
a07a0ff
clean
mavaylon1 Aug 20, 2023
9f553f6
type hints
mavaylon1 Aug 20, 2023
5d27636
remove
mavaylon1 Sep 5, 2023
d8e236a
Merge pull request #133 from mavaylon1/multi-GPU
ilyes319 Sep 29, 2023
113b839
fix bug avg neighbors
ilyes319 Oct 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,9 @@ build/
.vscode/
logs/MACE_run-5.log
*.txt

# Jupyter Notebook
.ipynb_checkpoints

# DS_Store
.DS_Store
48 changes: 47 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ python ./mace/scripts/run_train.py \

To give a specific validation set, use the argument `--valid_file`. To set a larger batch size for evaluating the validation set, specify `--valid_batch_size`.

To control the model's size, you need to change `--hidden_irreps`. For most applications, the recommended default model size is `--hidden_irreps='256x0e'` (meaning 256 invariant messages) or `--hidden_irreps='128x0e + 128x1o'`. If the model is not accurate enough, you can include higher order features, e.g., `128x0e + 128x1o + 128x2e`, or increase the number of channels to `256`.
To control the model's size, you need to change `--hidden_irreps`. For most applications, the recommended default model size is `--hidden_irreps='256x0e'` (meaning 256 invariant messages) or `--hidden_irreps='128x0e + 128x1o'`. If the model is not accurate enough, you can include higher order features, e.g., `128x0e + 128x1o + 128x2e`, or increase the number of channels to `256`. It is also possible to specify the model using the `--num_channels=128` and `--max_L=1`keys.

It is usually preferred to add the isolated atoms to the training set, rather than reading in their energies through the command line like in the example above. To label them in the training set, set `config_type=IsolatedAtom` in their info fields. If you prefer not to use or do not know the energies of the isolated atoms, you can use the option `--E0s="average"` which estimates the atomic energies using least squares regression.

Expand Down Expand Up @@ -105,6 +105,52 @@ python3 ./mace/scripts/eval_configs.py \

You can run our [Colab tutorial](https://colab.research.google.com/drive/1D6EtMUjQPey_GkuxUAbPgld6_9ibIa-V?authuser=1#scrollTo=Z10787RE1N8T) to quickly get started with MACE.

## On-line data loading for large datasets

If you have a large dataset that might not fit into the GPU memory it is recommended to preprocess the data on a CPU and use on-line dataloading for training the model. To preprocess your dataset specified as an xyz file run the `preprocess_data.py` script. An example is given here:

```sh
mkdir processed_data
python ./mace/scripts/preprocess_data.py \
--train_file="/path/to/train_large.xyz" \
--valid_fraction=0.05 \
--test_file="/path/to/test_large.xyz" \
--atomic_numbers="[1, 6, 7, 8, 9, 15, 16, 17, 35, 53]" \
--r_max=4.5 \
--h5_prefix="processed_data/" \
--compute_statistics \
--E0s="average" \
--seed=123 \
```

To see all options and a little description of them run `python ./mace/scripts/preprocess_data.py --help` . The script will create a number of HDF5 files in the `processed_data` folder which can be used for training. There wiull be one file for trainin, one for validation and a separate one for each `config_type` in the test set. To train the model use the `run_train.py` script as follows:

```sh
python ./mace/scripts/run_train.py \
--name="MACE_on_big_data" \
--num_workers=16 \
--train_file="./processed_data/train.h5" \
--valid_file="./processed_data/valid.h5" \
--test_dir="./processed_data" \
--statistics_file="./processed_data/statistics.json" \
--model="ScaleShiftMACE" \
--num_interactions=2 \
--num_channels=128 \
--max_L=1 \
--correlation=3 \
--batch_size=32 \
--valid_batch_size=32 \
--max_num_epochs=100 \
--swa \
--start_swa=60 \
--ema \
--ema_decay=0.99 \
--amsgrad \
--error_table='PerAtomMAE' \
--device=cuda \
--seed=123 \
```

## Weights and Biases for experiment tracking

If you would like to use MACE with Weights and Biases to log your experiments simply install with
Expand Down
8 changes: 8 additions & 0 deletions mace/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,12 @@
config_from_atoms_list,
load_from_xyz,
random_train_valid_split,
save_configurations_as_HDF5,
test_config_types,
save_dataset_as_HDF5,
save_AtomicData_to_HDF5,
)
from .hdf5_dataset import HDF5Dataset, dataset_from_sharded_hdf5

__all__ = [
"get_neighborhood",
Expand All @@ -22,4 +26,8 @@
"config_from_atoms_list",
"AtomicData",
"compute_average_E0s",
"save_dataset_as_HDF5",
"HDF5Dataset",
"save_AtomicData_to_HDF5",
"save_configurations_as_HDF5",
]
170 changes: 170 additions & 0 deletions mace/data/hdf5_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
import h5py
import torch
from torch.utils.data import Dataset, IterableDataset, ChainDataset
from mace import data
from mace.data.utils import Configuration
from torch.utils.data import ConcatDataset
from glob import glob
from typing import List
from mace.tools.utils import AtomicNumberTable


class HDF5ChainDataset(ChainDataset):
def __init__(self, file_path, r_max, z_table, **kwargs):
super(HDF5ChainDataset, self).__init__()
self.file_path = file_path
self._file = None

self.length = len(self.file.keys())
self.r_max = r_max
self.z_table = z_table

@property
def file(self):
if self._file is None:
# If a file has not already been opened, open one here
self._file = h5py.File(self.file_path, "r")
return self._file

def __getstate__(self):
_d = dict(self.__dict__)

# An opened h5py.File cannot be pickled, so we must exclude it from the state
_d["_file"] = None
return _d

def __call__(self):
datasets = []
for i in range(self.length):
grp = self.file["config_" + str(i)]
datasets.append(
HDF5IterDataset(
iter_group=grp,
r_max=self.r_max,
z_table=self.z_table,
)
)
return ChainDataset(datasets)


class HDF5IterDataset(IterableDataset):
def __init__(self, iter_group, r_max, z_table, **kwargs):
super(HDF5IterDataset, self).__init__()
# it might be dangerous to open the file here
# move opening of file to __getitem__?
self.iter_group = iter_group
self.length = len(self.iter_group.keys())
self.r_max = r_max
self.z_table = z_table
# self.file = file
# self.length = len(h5py.File(file, 'r').keys())

def __len__(self):
return self.length

def __iter__(self):
# file = h5py.File(self.file, 'r')
# grp = file["config_" + str(index)]
grp = self.iter_group
len_subgrp = len(grp.keys())
grp_list = []
for i in range(len_subgrp):
subgrp = grp["config_" + str(i)]
config = Configuration(
atomic_numbers=subgrp["atomic_numbers"][()],
positions=subgrp["positions"][()],
energy=subgrp["energy"][()],
forces=subgrp["forces"][()],
stress=subgrp["stress"][()],
virials=subgrp["virials"][()],
dipole=subgrp["dipole"][()],
charges=subgrp["charges"][()],
weight=subgrp["weight"][()],
energy_weight=subgrp["energy_weight"][()],
forces_weight=subgrp["forces_weight"][()],
stress_weight=subgrp["stress_weight"][()],
virials_weight=subgrp["virials_weight"][()],
config_type=subgrp["config_type"][()],
pbc=subgrp["pbc"][()],
cell=subgrp["cell"][()],
)
atomic_data = data.AtomicData.from_config(
config, z_table=self.z_table, cutoff=self.r_max
)
grp_list.append(atomic_data)

return iter(grp_list)


class HDF5Dataset(Dataset):
def __init__(self, file_path, r_max, z_table, **kwargs):
super(HDF5Dataset, self).__init__()
self.file_path = file_path
self._file = None
batch_key = list(self.file.keys())[0]
self.batch_size = len(self.file[batch_key].keys())
self.length = len(self.file.keys()) * self.batch_size
self.r_max = r_max
self.z_table = z_table
try:
self.drop_last = bool(self.file.attrs["drop_last"])
except KeyError:
self.drop_last = False

@property
def file(self):
if self._file is None:
# If a file has not already been opened, open one here
self._file = h5py.File(self.file_path, "r")
return self._file

def __getstate__(self):
_d = dict(self.__dict__)

# An opened h5py.File cannot be pickled, so we must exclude it from the state
_d["_file"] = None
return _d

def __len__(self):
return self.length

def __getitem__(self, index):
# compute the index of the batch
batch_index = index // self.batch_size
config_index = index % self.batch_size
grp = self.file["config_batch_" + str(batch_index)]
subgrp = grp["config_" + str(config_index)]
config = Configuration(
atomic_numbers=subgrp["atomic_numbers"][()],
positions=subgrp["positions"][()],
energy=unpack_value(subgrp["energy"][()]),
forces=unpack_value(subgrp["forces"][()]),
stress=unpack_value(subgrp["stress"][()]),
virials=unpack_value(subgrp["virials"][()]),
dipole=unpack_value(subgrp["dipole"][()]),
charges=unpack_value(subgrp["charges"][()]),
weight=unpack_value(subgrp["weight"][()]),
energy_weight=unpack_value(subgrp["energy_weight"][()]),
forces_weight=unpack_value(subgrp["forces_weight"][()]),
stress_weight=unpack_value(subgrp["stress_weight"][()]),
virials_weight=unpack_value(subgrp["virials_weight"][()]),
config_type=unpack_value(subgrp["config_type"][()]),
pbc=unpack_value(subgrp["pbc"][()]),
cell=unpack_value(subgrp["cell"][()]),
)
atomic_data = data.AtomicData.from_config(
config, z_table=self.z_table, cutoff=self.r_max
)
return atomic_data

def dataset_from_sharded_hdf5(files: List, z_table: AtomicNumberTable, r_max: float):
files = glob(files+'/*')
datasets = []
for file in files:
datasets.append(data.HDF5Dataset(file, z_table=z_table, r_max=r_max))
full_dataset = ConcatDataset(datasets)
return full_dataset

def unpack_value(value):
value = value.decode("utf-8") if isinstance(value, bytes) else value
return None if str(value) == "None" else value
31 changes: 19 additions & 12 deletions mace/data/neighborhood.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,7 @@
###########################################################################################
# Neighborhood construction
# Authors: Ilyes Batatia, Gregor Simm
# This program is distributed under the MIT License (see MIT.md)
###########################################################################################

from typing import Optional, Tuple

import ase.neighborlist
import numpy as np
from matscipy.neighbours import neighbour_list


def get_neighborhood(
Expand All @@ -16,24 +10,37 @@ def get_neighborhood(
pbc: Optional[Tuple[bool, bool, bool]] = None,
cell: Optional[np.ndarray] = None, # [3, 3]
true_self_interaction=False,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
) -> Tuple[np.ndarray, np.ndarray]:
if pbc is None:
pbc = (False, False, False)

if cell is None or cell.any() == np.zeros((3, 3)).any():
cell = np.identity(3, dtype=float)

assert len(pbc) == 3 and all(isinstance(i, (bool, np.bool_)) for i in pbc)
assert cell.shape == (3, 3)

pbc_x = pbc[0]
pbc_y = pbc[1]
pbc_z = pbc[2]
identity = np.identity(3, dtype=float)
max_positions = np.max(np.absolute(positions)) + 1
# Extend cell in non-periodic directions
if not pbc_x:
cell[:,0] = max_positions * 5 * cutoff * identity[:,0]
if not pbc_y:
cell[:,1] = max_positions * 5 * cutoff * identity[:,1]
if not pbc_z:
cell[:,2] = max_positions * 5 * cutoff * identity[:,2]

sender, receiver, unit_shifts = ase.neighborlist.primitive_neighbor_list(
sender, receiver, unit_shifts = neighbour_list(
quantities="ijS",
pbc=pbc,
cell=cell,
positions=positions,
cutoff=cutoff,
self_interaction=True, # we want edges from atom to itself in different periodic images
use_scaled_positions=False, # positions are not scaled positions
# self_interaction=True, # we want edges from atom to itself in different periodic images
# use_scaled_positions=False, # positions are not scaled positions
)

if not true_self_interaction:
Expand Down
Loading