You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 22, 2020. It is now read-only.
It's not practical to encode the entire training set at once. It takes days and I run out of memory or disk
It's not practical to train an entire epoch uninterrupted. Either I'm using my personal laptop which I sometimes want to use for other stuff, or I'm using cheap EC2 spot instances which can be pre-empted at any time.
My proposal is:
Divide a corpus into manageable chunks, let's say 100K positions.
Build an index where, for every training chunk, you can quickly find the SGFs containing the correct board positions.
Encode and train on one chunk at a time, saving a progress checkpoint (including model weights) after each chunk. You can even train chunk N on the GPU while you encode chunk N + 1 on the CPU.
It's hard to tell precisely. The chunking definitely adds some overhead to encoding the game records. I would guess it's about 25% slower but I did not take any careful measurements.
But! Since you can do the encoding in parallel with training the network, the entire training process is very efficient, it's basically limited by how fast you can train the network.
On an amazon p2.xlarge instance it took something like 30 hours to do a full training epoch on 30 million positions (that counts indexing, encoding, and training). I can't really compare to the old method directly cause I never succeeded in training a set that large
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
On the kind of hardware I have access to:
My proposal is:
I started implementing this in my https://github.com/macfergus/betago/tree/large-training-set branch. It's not perfect but I have successfully completed larger training runs than I could do before
The text was updated successfully, but these errors were encountered: