Improve more on new dataset API #434

zasdfgbnm · 2020-03-22T00:27:47Z

The new suggested way to use dataset is:

training, validation = torchani.data.load(dspath).subtract_self_energies(energy_shifter).species_to_indices().shuffle().split(0.8, None)
training = training.collate(batch_size).cache()
validation = validation.collate(batch_size).cache()

which is cleaner than the previous

dataset = torchani.data.load(dspath).subtract_self_energies(energy_shifter).species_to_indices().shuffle()
size = len(dataset)
training, validation = dataset.split(int(0.8 * size), None)
training = training.collate(batch_size).cache()
validation = validation.collate(batch_size).cache()

The memory usage has reduced from the previous 18GB to 16GB, this is a small improvement, but it is hard to make it as good as the previously API by @yueyericardo because we are now not chunking, so we have to spend lots of memories to store padding.

If we don't want to use that much memory, we can remove the .cache() from the last two lines, but we now need a data loader to achieve comparable performance:

training, validation = torchani.data.load(dspath).subtract_self_energies(energy_shifter).species_to_indices().shuffle().split(0.8, None)
training = torch.utils.data.DataLoader(list(training), batch_size=batch_size, collate_fn=torchani.data.collate_fn, num_workers=64)
validation = torch.utils.data.DataLoader(list(validation), batch_size=batch_size, collate_fn=torchani.data.collate_fn, num_workers=64)

The above code uses ~9GB RAM

farhadrgh

LGTM

zasdfgbnm added 3 commits March 21, 2020 16:40

Improve new dataset API

5b3d892

Merge branch 'master' of github.com:aiqm/torchani into improve-more

10765ef

Improve more on new dataset API

42747a9

zasdfgbnm requested a review from farhadrgh as a code owner March 22, 2020 00:27

zasdfgbnm added 2 commits March 21, 2020 17:30

fix

19638eb

fix reentrance

9e0b288

zasdfgbnm requested review from IgnacioJPickering and yueyericardo as code owners March 22, 2020 00:51

zasdfgbnm added 5 commits March 21, 2020 18:09

Allow all intermediate state of transformation to be reentered

8b315d4

Add length inference

eeb8c09

fix

235b5e4

split by ratio

220b791

add dataloader example

7906c14

zasdfgbnm changed the title ~~[WIP] Improve more on new dataset API~~ Improve more on new dataset API Mar 22, 2020

add test for data loader

9a23a74

farhadrgh approved these changes Mar 22, 2020

View reviewed changes

farhadrgh merged commit f8edffe into master Mar 22, 2020

farhadrgh deleted the improve-more branch March 22, 2020 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve more on new dataset API #434

Improve more on new dataset API #434

zasdfgbnm commented Mar 22, 2020 •

edited

Loading

farhadrgh left a comment

Improve more on new dataset API #434

Improve more on new dataset API #434

Conversation

zasdfgbnm commented Mar 22, 2020 • edited Loading

farhadrgh left a comment

Choose a reason for hiding this comment

zasdfgbnm commented Mar 22, 2020 •

edited

Loading