-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use TF Tensorboard writer by default [DET-3353] #857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@@ -174,15 +174,14 @@ def _scan_checkpoint_directory(checkpoint_dir: str) -> List[Checkpoint]: | |||
return list(checkpoints.values()) | |||
|
|||
|
|||
def move_tf_events(root_dir: str) -> None: | |||
def move_tf_events(event_dir: pathlib.Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: nice clean up here
@@ -147,15 +147,15 @@ def prepare_tensorboard( | |||
env, env.experiment_config["checkpoint_storage"], container_path | |||
) | |||
try: | |||
from determined.tensorboard.metric_writers import pytorch | |||
from determined.tensorboard.metric_writers import tensorflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: as mentioned just make sure that this works well in containers without tensorflow installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked and it works 🥳
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
If a dispatch suddenly disappeared from the luancher (404) without any action monitoring was dropped without any notification of job completion. On 404, notify that the job was lost and terminate.
Description
It appears that using the PyTorch TensorBoard writer lead to TF events coming out in a "strange" state.
Test Plan
Ran several experiments and observed that file naming changed from always having a file ending
1.0
to not having it, this file seemed to have been the cause of some of the issues.Going to test this out with a container that doesn't have TF.