-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (5/n): ray.train.Checkpoint
save direction
#37888
[air] pyarrow.fs
persistence (5/n): ray.train.Checkpoint
save direction
#37888
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> Update to use the new checkpoint id attribute Signed-off-by: Justin Yu <[email protected]> Add todo comment to remove legacy path Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
… -> driver Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> Fix lint Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> Fix lint Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…persistence/storage_context_to_worker_temp
50c7afa
to
ee4ccbd
Compare
|
||
# Save the rank of the worker that created this checkpoint. | ||
metadata.update({CHECKPOINT_RANK_KEY: self.world_rank}) | ||
checkpoint_to_report.set_metadata(metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we shouldn't set this on the actual checkpoint right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the internal metadata that I want to attach to a checkpoint. We should be able to know the metrics that were reported with a checkpoint (useful for tracking top k checkpoints).
We currently have a few metadata concepts that should be cleaned up:
- The old
.metadata.pkl
file. This contained the checkpoint class information and was used for framework-specific checkpoints, but it's not needed anymore. - The new
Checkpoint.set_metadata
. User-facing metadata. - Internal metadata that was previously stored in the
_TrackedCheckpoint
. This is pretty much just the metrics reported with the checkpoint. This PR adds it as part of the user-facing metadata.- Q: Would it be better to separate 2 and 3? Ex: we can dump internal metadata in a separate
.internal_metadata.json
file.
- Q: Would it be better to separate 2 and 3? Ex: we can dump internal metadata in a separate
- Internal metadata that is stored in the
.tune_metadata
. This consists of trainable state and can probably be removed in the new path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those metadata are available in the Result object though as metrics right? It's weird to put a couple metrics in Checkpoint, which should be only opaque user data.
Afaik, the result metadata is stored in the trial dir separately already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I believe the structure we want is
Result
metrics (incl tune internal metadata/metrics)
Checkpoint
user data
user metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the full set of results are saved in the trial dir, but I think a checkpoint should still know the set of metrics that was reported along with it. This info is currently stored in a few places:
- The
experiment-state.json
file holds onto these_TrackedCheckpoint
objects, which hold onto "metadata" which is the metrics reported with the checkpoint. This is at the experiment level directory, and we're trying to move away from storing data in that file. - The
.tune_metadata
stores certain parts of state, such as the training iteration, and it directly within the checkpoint folder.
I'm thinking we can do:
trial_dir/
result.json -- all metrics
checkpoint_0/
.internal_metadata.json <-- metrics & other metadata we populate, associated with the checkpoint
.user_metadata.json
user data ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's redundant with result.json etc right? In the result.json, we can store "checkpoint_number": "00001" etc. to connect which checkpoint should be associated with which metric.
Anyways, we shouldn't discuss this in the scope of 2.7--- it's a new feature that we can tack on later once we finish the GA APIs as specced out.
results, decode_checkpoint_fn=self._backend._decode_data | ||
) | ||
if _use_storage_context(): | ||
self._process_checkpoint_results(results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to involve the checkpoint manager to delete old checkpoints, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Train checkpoint manager is already disabled on purpose and basically doesn't do anything. Tune's checkpoint manager is the one who actually keeps track of the heap of top K checkpoints and does deletion. I think we should just keep it this way for now, and just fully avoid the train checkpoint manager.
Here's where it will happen:
Signed-off-by: Justin Yu <[email protected]>
…persistence/new_checkpoint Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…persistence/new_checkpoint
148df61
to
cada06a
Compare
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
if TYPE_CHECKING: | ||
from ray.train._checkpoint import Checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This circular dependency is:
ray.train._internal.storage
->ray.train._checkpoint
->ray.train._internal.storage._download_fs_path
We can solve this by moving the filesystem utils to a different file.
from ray.train._internal.checkpoint_manager import ( | ||
_CheckpointManager as _NewCheckpointManager, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This circular dependency is caused by:
ray.train
-> ray.train.trainer.TrainingIterator
-> ray.tune
-> train._internal.CheckpointManager
-> ray.train.CheckpointConfig
This will be fixed when TrainingIterator
no longer needs to depend on tune.TrainableUtil
. The code already has a todo for deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea on simplifying the report() case for multi-worker, but looks good to me either way.
"Report (metrics, checkpoint) to the Tune session:\n" | ||
f" metrics={metrics}\n checkpoint={checkpoint}" | ||
) | ||
train.report(metrics, checkpoint=checkpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, a way to make this perhaps easier to read is to not use the public report
API for this inner reporting, but call a private _report
that doesn't do the repeated checkpoint uploading.
However, this is a minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl I think that makes sense -- I think this will be easier to do when we unify the 2 sessions -- just introduce a private method on the session that does the Train -> Tune communication.
Let's merge with what we have for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, hold off on that, need to rebase and up the test timeout limit.
Signed-off-by: Justin Yu <[email protected]>
…persistence/new_checkpoint Signed-off-by: Justin Yu <[email protected]>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: NripeshN <[email protected]>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: e428265 <[email protected]>
…ection (ray-project#37888) This PR: 1. Uses the storage context to upload the new `ray.train.Checkpoint` (from ray-project#37925) directly from the Train worker. 2. Gets checkpoint reporting to work in the save direction, simplifying the checkpoint handling logic to avoid the Train `CheckpointManager` and use as single, simplified checkpoint manager (from ray-project#37962). 3. Updates the e2e test to check for worker-uploaded checkpoints. ### Follow-ups needed 1. `Trial` path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment. This fix is up next. 2. Trial restoration is explicitly disabled at the moment. This is up next as well. 3. Artifacts are currently being synced by the driver due to the train worker living on the same node, which is why it passes in the test case. This upload should be done from the worker, and the test case should be updated to check that. 4. The `on_checkpoint` hook for `tune.Callback` takes in a `_TrackedCheckpoint`. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here. 5. Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex: `CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")` Signed-off-by: Victor <[email protected]>
Why are these changes needed?
This PR:
ray.train.Checkpoint
(from [air]pyarrow.fs
persistence (3/n): Introduce newCheckpoint
API #37925)directly from the Train worker.
CheckpointManager
and use as single, simplified checkpoint manager (from [air]pyarrow.fs
persistence (4/n): Introduce a simplified checkpoint manager #37962).Follow-ups needed
See: [air]Trial
path resolution is still messed up (using the legacy path), causing some issues with the custom fs test case. That test case skips some assertions at the moment.pyarrow.fs
persistence (6/n): FixTrial
+Experiment
paths to use theStorageContext
#38057Trial restoration is explicitly disabled at the moment.See PRs 7-9.on_checkpoint
hook fortune.Callback
takes in a_TrackedCheckpoint
. Currently, I skip invoking the callbacks -- TBD what to expose to the user callbacks here.Checkpoints cannot be ordered based on auto-filled metrics at the moment, only user specified metrics. Ex:See: https://github.com/ray-project/ray/pull/38141/files#diff-c8a0efbd48da8eaa5c945c0423d35eaf8797a31b1f82178e5c567614c130d8ebR511-R512CheckpointConfig(checkpoint_score_attribute="training_iteration", mode="min")
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.