-
Notifications
You must be signed in to change notification settings - Fork 196
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Metax GPU topo-awareness support (#574)
* support metax topology-aware scheduling Signed-off-by: root <[email protected]> * fix ut Signed-off-by: root <[email protected]> --------- Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]>
- Loading branch information
1 parent
3f24a36
commit b030525
Showing
29 changed files
with
629 additions
and
61 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
## Introduction | ||
|
||
**We now support metax.com/gpu by implementing topo-awareness among metax GPUs**: | ||
|
||
When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink depending on whether they are connected | ||
, there is a near-far relationship. This forms a topology among all the cards on the server, as shown in the following figure: | ||
|
||
![img](../imgs/metax_topo.jpg) | ||
|
||
A user job requests a certain number of metax-tech.com/gpu resources, Kubernetes schedule pods to the appropriate node. gpu-device further processes the logic of allocating the remaining resources on the resource node following criterias below: | ||
1. MetaXLink takes precedence over PCIe Switch in two way: | ||
– A connection is considered a MetaXLink connection when there is a MetaXLink connection and a PCIe Switch connection between the two cards. | ||
– When both the MetaXLink and the PCIe Switch can meet the job request | ||
Equipped with MetaXLink interconnected resources. | ||
|
||
2. When using `node-scheduler-policy=spread` , Allocate Metax resources to be under the same Metaxlink or Paiswich as much as possible, as the following figure shows: | ||
|
||
![img](../imgs/metax_spread.jpg) | ||
|
||
3. When using `node-scheduler-policy=binpack`, Assign GPU resources, so minimize the damage to MetaxXLink topology, as the following figure shows: | ||
|
||
![img](../imgs/metax_binpack.jpg) | ||
|
||
## Important Notes | ||
|
||
1. Device sharing is not supported yet. | ||
|
||
2. These features are tested on MXC500 | ||
|
||
## Prerequisites | ||
|
||
* Metax GPU extensions >= 0.8.0 | ||
* Kubernetes >= 1.23 | ||
|
||
## Enabling topo-awareness scheduling | ||
|
||
* Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to aquire its package and document) | ||
|
||
* Deploy HAMi according to README.md | ||
|
||
## Running Metax jobs | ||
|
||
Mthreads GPUs can now be requested by a container | ||
using the `metax-tech.com/gpu` resource type: | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod1 | ||
annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. | ||
spec: | ||
containers: | ||
- name: ubuntu-container | ||
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 | ||
imagePullPolicy: IfNotPresent | ||
command: ["sleep","infinity"] | ||
resources: | ||
limits: | ||
metax-tech.com/gpu: 1 # requesting 1 vGPUs | ||
``` | ||
|
||
> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/)* | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
## 简介 | ||
|
||
**我们支持基于拓扑结构,对沐曦设备进行优化调度**: | ||
|
||
在单台服务器上配置多张 GPU 时,GPU 卡间根据双方是否连接在相同的 PCIe Switch 或 MetaXLink | ||
下,存在近远(带宽高低)关系。服务器上所有卡间据此形成一张拓扑,如下图所示。 | ||
|
||
![img](../imgs/metax_topo.jpg) | ||
|
||
用户作业请求一定数量的 metax-tech.com/gpu 资源,Kubernetes 选择剩余资源数量满足要求的 | ||
节点,并将 Pod 调度到相应节点。gpu‑device 进一步处理资源节点上剩余资源的分配逻辑,并按照以 | ||
下优先级逻辑为作业容器分配 GPU 设备: | ||
1. MetaXLink 优先级高于 PCIe Switch,包含两层含义: | ||
– 两卡之间同时存在 MetaXLink 连接以及 PCIe Switch 连接时,认定为 MetaXLink 连接。 | ||
– 服务器剩余 GPU 资源中 MetaXLink 互联资源与 PCIe Switch 互联资源均能满足作业请求时,分 | ||
配 MetaXLink 互联资源。 | ||
|
||
2. 当任务使用 `node-scheduler-policy=spread` ,分配GPU资源尽可能位于相同 MetaXLink或PCIe Switch下,如下图所示: | ||
|
||
![img](../imgs/metax_spread.jpg) | ||
|
||
3. 当使用 `node-scheduler-policy=binpack`,分配GPU资源后,剩余资源尽可能完整,如下图所示: | ||
|
||
![img](../imgs/metax_binpack.jpg) | ||
|
||
## 注意: | ||
|
||
1. 暂时不支持沐曦设备的切片,只能申请整卡 | ||
|
||
2. 本功能基于MXC500进行测试 | ||
|
||
## 需求 | ||
|
||
* Metax GPU extensions >= 0.8.0 | ||
* Kubernetes >= 1.23 | ||
|
||
## 开启针对沐曦设备的拓扑调度优化 | ||
|
||
* 部署Metax GPU extensions (请联系您的设备提供方获取) | ||
|
||
* 根据readme.md部署HAMi | ||
|
||
## 运行沐曦任务 | ||
|
||
一个典型的沐曦任务如下所示: | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod1 | ||
annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. | ||
spec: | ||
containers: | ||
- name: ubuntu-container | ||
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 | ||
imagePullPolicy: IfNotPresent | ||
command: ["sleep","infinity"] | ||
resources: | ||
limits: | ||
metax-tech.com/gpu: 1 # requesting 1 vGPUs | ||
``` | ||
|
||
> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/)* | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod1 | ||
annotations: | ||
hami.io/node-scheduler-policy: "binpack" # when this parameter is set to binpack, the scheduler will try to minimize the topology loss. | ||
spec: | ||
containers: | ||
- name: ubuntu-container | ||
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 | ||
imagePullPolicy: IfNotPresent | ||
command: ["sleep","infinity"] | ||
resources: | ||
limits: | ||
metax-tech.com/gpu: 1 # requesting 1 vGPUs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod1 | ||
spec: | ||
containers: | ||
- name: ubuntu-container | ||
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 | ||
imagePullPolicy: IfNotPresent | ||
command: ["sleep","infinity"] | ||
resources: | ||
limits: | ||
metax-tech.com/gpu: 1 # requesting 1 vGPUs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod1 | ||
annotations: | ||
hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task. | ||
spec: | ||
containers: | ||
- name: ubuntu-container | ||
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 | ||
imagePullPolicy: IfNotPresent | ||
command: ["sleep","infinity"] | ||
resources: | ||
limits: | ||
metax-tech.com/gpu: 1 # requesting 1 vGPUs |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.