Skip to content

Commit

Permalink
improve TensorBoard instructions in README
Browse files Browse the repository at this point in the history
ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7
Pull Request resolved: #96
  • Loading branch information
tianyu-l committed Feb 27, 2024
1 parent 254279f commit 4c03475
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,21 +32,21 @@ run the llama debug model locally to verify the setup is correct:

# TensorBoard

To visualize training metrics on TensorBoard:
To visualize TensorBoard metrics of models trained on a remote server via a local web browser:

1. (by default) set `enable_tensorboard = true` in `torchtrain/train_configs/train_config.toml`
1. Make sure `metrics.enable_tensorboard` option is set to true in model training (either from a .toml file or from CLI).

2. set up SSH tunneling
2. Set up SSH tunneling, by running the following from local CLI
```
ssh -L 6006:127.0.0.1:6006 [username]@[hostname]
```

3. then in the torchtrain repo
3. Inside the SSH tunnel that logged into the remote server, go to the torchtrain repo, and start the TensorBoard backend
```
tensorboard --logdir=./torchtrain/outputs/tb
```

4. go to the URL it provides OR to http://localhost:6006/
4. In the local web browser, go to the URL it provides OR to http://localhost:6006/.

## Multi-Node Training
For training on ParallelCluster/Slurm type configurations, you can use the multinode_trainer.slurm file to submit your sbatch job.</br>
Expand Down

0 comments on commit 4c03475

Please sign in to comment.