Skip to content
This repository has been archived by the owner on Dec 22, 2020. It is now read-only.

large training set management #36

Open
macfergus opened this issue Mar 9, 2017 · 2 comments
Open

large training set management #36

macfergus opened this issue Mar 9, 2017 · 2 comments

Comments

@macfergus
Copy link
Collaborator

On the kind of hardware I have access to:

  1. It's not practical to encode the entire training set at once. It takes days and I run out of memory or disk
  2. It's not practical to train an entire epoch uninterrupted. Either I'm using my personal laptop which I sometimes want to use for other stuff, or I'm using cheap EC2 spot instances which can be pre-empted at any time.

My proposal is:

  1. Divide a corpus into manageable chunks, let's say 100K positions.
  2. Build an index where, for every training chunk, you can quickly find the SGFs containing the correct board positions.
  3. Encode and train on one chunk at a time, saving a progress checkpoint (including model weights) after each chunk. You can even train chunk N on the GPU while you encode chunk N + 1 on the CPU.

I started implementing this in my https://github.com/macfergus/betago/tree/large-training-set branch. It's not perfect but I have successfully completed larger training runs than I could do before

@rocketinventor
Copy link
Contributor

@macfergus Sounds interesting. How do the extra steps of dividing and indexing affect performance?

@macfergus
Copy link
Collaborator Author

It's hard to tell precisely. The chunking definitely adds some overhead to encoding the game records. I would guess it's about 25% slower but I did not take any careful measurements.

But! Since you can do the encoding in parallel with training the network, the entire training process is very efficient, it's basically limited by how fast you can train the network.

On an amazon p2.xlarge instance it took something like 30 hours to do a full training epoch on 30 million positions (that counts indexing, encoding, and training). I can't really compare to the old method directly cause I never succeeded in training a set that large

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants