You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many frameeworks set default logging directories to the working directory.
Train/Tune changes the working directory to the trial directory, and the contents of this directory can get synced to cloud unintentionally. This can cause double uploading of checkpoints (once for the Train checkpoint, and once as an artifact in the directory).
The uploading happens from either:
Driver syncing if the trial happens to live on the head node. This can be fixed by converting the sync exclude-list into an explicit include-list instead.
Trial artifact syncing enabled by SyncConfig(sync_artifacts=True). We should either recommend to configure the logging directory of these frameworks to an external directory in the docs, or add a configurable artifact exclude-list.
The text was updated successfully, but these errors were encountered:
justinvyu
added
P1
Issue that should be fixed within a few weeks
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
train
Ray Train Related Issue
UX
The issue is not only about technical bugs
labels
Oct 24, 2023
justinvyu
changed the title
[train] Logs from frameworks (lightning_logs, wandb) in the working directory can be synced unintentionally
[train] Logs from frameworks (lightning_logs, wandb, transformers output_dir) in the working directory can be synced unintentionally
Feb 16, 2024
One workaround is to set the log directory for these frameworks to some path outside the Ray Train experiment directory. (The default behavior for a lot of these is the current working directory in the training worker, which is in the experiment dir.)
Huggingface Transformers Trainer:
TrainingArguments(output_dir="/tmp/path")
Lightning Trainer:
pl.Trainer(default_root_dir="/tmp/path")
wandb:
wandb.init(dir="/tmp/path")
Workaround 2: Disable CWD change behavior
Another workaround is to run with the environment variable RAY_CHDIR_TO_TRIAL_DIR=0.
Many frameeworks set default logging directories to the working directory.
Train/Tune changes the working directory to the trial directory, and the contents of this directory can get synced to cloud unintentionally. This can cause double uploading of checkpoints (once for the Train checkpoint, and once as an artifact in the directory).
The uploading happens from either:
SyncConfig(sync_artifacts=True)
. We should either recommend to configure the logging directory of these frameworks to an external directory in the docs, or add a configurable artifact exclude-list.The text was updated successfully, but these errors were encountered: