Checkpoint tfds data iterator #954

mattdonati · 2024-10-07T20:27:36Z

Enable deterministic training with preemption when using tfds pipeline by checkpointing data iterator.

Creates a checkpoint handler for data iterator that implements orbax.checkpoint.CheckpointHandler, similar to https://github.com/google/grain/blob/main/grain/_src/python/checkpoint_handlers.py. Handler utilizes tf.train.Checkpoint to save and restore iterator.

Makes checkpointing the data iterator optional, since this method will save large checkpoints. Adds a bool flag to base.yml

Async checkpointing is handled at the level of the orbax checkpoint manager.

Updates input pipeline description to reflect option to checkpoint tfds iterator.

Decided to add a flag because checkpointing the iterator creates large checkpoints and may not be preferred by all users.

aireenmei

Thanks for using and contributing to our repo!
I would like to understand the use case of this feature. You must have checked out our doc and know about the Grain pipeline which is optimized for ckpt support of data iterators. It's efficient and will save very small data iterator ckpt containing only indices, therefore it's the recommended way for use cases that need check-pointing data iterators. We would like to hear if there's any difficulty in adopting Grain for your use case?

And I tested the branch with this command on v4-8 python3 MaxText/train.py MaxText/configs/base.yml steps=20 per_device_batch_size=8.0 learning_rate=3e-4 enable_checkpointing=true base_output_directory=gs://aireenmei-multipod/tfds_ckpt dataset_path=gs://maxtext_dataset tfds_iter_checkpointing=True run_name=$(date +%m%d-$H%M) checkpoint_period=10 and got this error (no error if tfds_iter_checkpointing=False): https://gist.github.com/aireenmei/42a4c4e0dd8caed0b7ce8182f5ca8292

aireenmei · 2024-10-15T03:57:16Z

MaxText/configs/base.yml

@@ -43,6 +43,8 @@ async_checkpointing: True
 checkpoint_period: 10_000
 # enables one replica to read the ckpt then broadcast to the rest
 enable_single_replica_ckpt_restoring: False
+# enable checkpointing of tfds data iterator, for fully deterministic training.  saves large checkpoints.
+tfds_iter_checkpointing: True


Since it saves large checkpoint, it would be better to set the default to False

aireenmei · 2024-10-15T04:01:48Z

getting_started/Data_Input_Pipeline.md

+```
+tfds_iter_checkpointing: True
+```
+Note that deteriminism with preemption requires checkpointing the data iterator, and the checkpoints will be larger in size. 


Could you add more details on what contributes to the size of the checkpoint? Would be good to provide a way to estimate the size to help the users make plans

mattdonati added 7 commits October 7, 2024 15:27

Update checkpointing.py w/ tfds checkpoint handler

a9b1f4d

Update train.py w/ option to checkpoint tfds iterator

5f288ea

Update max_utils.py with option to load tfds iterator from ckpt

83df17a

Update base.yml by adding flag for tfds iter checkpointing

2f76034

Decided to add a flag because checkpointing the iterator creates large checkpoints and may not be preferred by all users.

Update Data_Input_Pipeline.md with itfds iter ckpt notes

d187384

Update Data_Input_Pipeline.md

c1ffcaa

Update checkpointing.py

aef8281

mattdonati requested review from gobbleturk, jonb377, khatwanimohit, bvandermoon and vipannalla as code owners October 7, 2024 20:27

gobbleturk requested a review from aireenmei October 7, 2024 23:03

gobbleturk assigned aireenmei Oct 8, 2024

aireenmei requested changes Oct 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint tfds data iterator #954

Checkpoint tfds data iterator #954

mattdonati commented Oct 7, 2024

aireenmei left a comment

aireenmei Oct 15, 2024

aireenmei Oct 15, 2024

Checkpoint tfds data iterator #954

Are you sure you want to change the base?

Checkpoint tfds data iterator #954

Conversation

mattdonati commented Oct 7, 2024

aireenmei left a comment

Choose a reason for hiding this comment

aireenmei Oct 15, 2024

Choose a reason for hiding this comment

aireenmei Oct 15, 2024

Choose a reason for hiding this comment