Skip to content

Commit

Permalink
[Docs] clarify launch port
Browse files Browse the repository at this point in the history
Co-authored-by: Edenzzzz <[email protected]>
  • Loading branch information
Edenzzzz and Edenzzzz authored Aug 7, 2024
1 parent fe71917 commit 9179d40
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 6 deletions.
7 changes: 4 additions & 3 deletions docs/source/en/basics/launch_colossalai.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,17 +131,18 @@ with one simple command. There are two ways you can launch multi-node jobs.

This is suitable when you only have a few nodes. Let's say I have two nodes, namely `host1` and `host2`, I can start
multi-node training with the following command. Compared to single-node training, you must specify the `master_addr`
option, which is auto-set to localhost if running on a single node only.
option, which is auto-set to localhost if running on a single node only. \
Additionally, you must also ensure that all nodes share the same open ssh port, which can be specified using --ssh-port.

:::caution

`master_addr` cannot be localhost when running on multiple nodes, it should be the hostname or IP address of a node.
`master_addr` cannot be localhost when running on multiple nodes, it should be the **hostname or IP address** of a node.

:::

```shell
# run on these two nodes
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py --ssh-port 22
```
- Run with `--hostfile`

Expand Down
6 changes: 3 additions & 3 deletions docs/source/zh-Hans/basics/launch_colossalai.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,17 +116,17 @@ colossalai run --nproc_per_node 4 --master_port 29505 test.py
- 通过`--hosts`来启动

这个方式适合节点数不多的情况。假设我们有两个节点,分别为`host``host2`。我们可以用以下命令进行多节点训练。
比起单节点训练,多节点训练需要手动设置`--master_addr` (在单节点训练中`master_addr`默认为`127.0.0.1`)。
比起单节点训练,多节点训练需要手动设置`--master_addr` (在单节点训练中`master_addr`默认为`127.0.0.1`)。同时,你需要确保每个节点都使用同一个ssh port。可以通过--ssh-port设置。

:::caution

多节点训练时,`master_addr`不能为`localhost`或者`127.0.0.1`它应该是一个节点的名字或者IP地址
多节点训练时,`master_addr`不能为`localhost`或者`127.0.0.1`它应该是一个节点的**名字或者IP地址**

:::

```shell
# 在两个节点上训练
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py --ssh-port 22
```


Expand Down

0 comments on commit 9179d40

Please sign in to comment.