large training set management #36

macfergus · 2017-03-09T16:14:50Z

On the kind of hardware I have access to:

It's not practical to encode the entire training set at once. It takes days and I run out of memory or disk
It's not practical to train an entire epoch uninterrupted. Either I'm using my personal laptop which I sometimes want to use for other stuff, or I'm using cheap EC2 spot instances which can be pre-empted at any time.

My proposal is:

Divide a corpus into manageable chunks, let's say 100K positions.
Build an index where, for every training chunk, you can quickly find the SGFs containing the correct board positions.
Encode and train on one chunk at a time, saving a progress checkpoint (including model weights) after each chunk. You can even train chunk N on the GPU while you encode chunk N + 1 on the CPU.

I started implementing this in my https://github.com/macfergus/betago/tree/large-training-set branch. It's not perfect but I have successfully completed larger training runs than I could do before

rocketinventor · 2017-03-24T21:14:28Z

@macfergus Sounds interesting. How do the extra steps of dividing and indexing affect performance?

macfergus · 2017-03-28T04:42:58Z

It's hard to tell precisely. The chunking definitely adds some overhead to encoding the game records. I would guess it's about 25% slower but I did not take any careful measurements.

But! Since you can do the encoding in parallel with training the network, the entire training process is very efficient, it's basically limited by how fast you can train the network.

On an amazon p2.xlarge instance it took something like 30 hours to do a full training epoch on 30 million positions (that counts indexing, encoding, and training). I can't really compare to the old method directly cause I never succeeded in training a set that large

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large training set management #36

large training set management #36

macfergus commented Mar 9, 2017

rocketinventor commented Mar 24, 2017

macfergus commented Mar 28, 2017

large training set management #36

large training set management #36

Comments

macfergus commented Mar 9, 2017

rocketinventor commented Mar 24, 2017

macfergus commented Mar 28, 2017