-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Update nvidia-gpu-efa pattern #1966
Merged
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
5717c57
Initial EFA example with two c5n.9xlarge ec2 instances
iankouls-aws ff540ad
Add .gitignore
iankouls-aws 2729cd0
Add placement group
iankouls-aws d921bdc
Working EFA example with g4dn.metal
iankouls-aws a7ac33d
Format ToC
iankouls-aws 2957659
Expand documentation and test end-to-end
iankouls-aws f98ac49
Add more test log information
iankouls-aws 4a36716
Fix ToC bookmarks
iankouls-aws 6be0ae4
ToC title correction
iankouls-aws 6ae441a
ToC title correction
iankouls-aws 242429d
fix bookmark link & cap
iankouls-aws 615722f
Add conclusion
iankouls-aws bc8feb6
log formatting
iankouls-aws c9e220c
Satisfy pre-commit checks
iankouls-aws 8324be1
Merge branch 'aws-ia:main' into main
iankouls-aws df8738c
Enable full EKS control plane logging
iankouls-aws 7125f2c
Merge branch 'aws-ia:main' into main
iankouls-aws d35ac91
Update nvidia-gpu-efa template
iankouls-aws 51a83c3
Replace partials with actual content
iankouls-aws 12f8ad7
Examples moved to patterns directory
iankouls-aws 0e541bc
Modified PR to reflect feedback from reviewers
iankouls-aws 1dd924b
Implement requested changes
iankouls-aws File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
efa-info-test.yaml | ||
efa-nccl-test.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
#!/bin/bash | ||
|
||
export MPI_JOB_NAME=efa-info-test | ||
export IMAGE_URI=public.ecr.aws/hpc-cloud/nccl-tests:latest | ||
export NUM_WORKERS=2 | ||
export GPU_PER_WORKER=8 | ||
export EFA_PER_WORKER=32 | ||
export TOTAL_GPUS=$((${NUM_WORKERS}*${GPU_PER_WORKER})) | ||
|
||
cat <<EOF >> efa-info-test.yaml | ||
apiVersion: kubeflow.org/v2beta1 | ||
kind: MPIJob | ||
metadata: | ||
name: ${MPI_JOB_NAME} | ||
spec: | ||
runPolicy: | ||
cleanPodPolicy: Running | ||
backoffLimit: 20 | ||
slotsPerWorker: ${GPU_PER_WORKER} | ||
mpiReplicaSpecs: | ||
Launcher: | ||
replicas: 1 | ||
template: | ||
spec: | ||
restartPolicy: OnFailure | ||
tolerations: | ||
- key: "nvidia.com/gpu" | ||
operator: "Equal" | ||
value: "true" | ||
effect: "NoSchedule" | ||
containers: | ||
- image: ${IMAGE_URI} | ||
name: ${MPI_JOB_NAME}-launcher | ||
imagePullPolicy: IfNotPresent | ||
env: | ||
- name: LD_LIBRARY_PATH | ||
value: "/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib" | ||
- name: PATH | ||
value: "/opt/amazon/efa/bin:/usr/bin" | ||
- name: XLA_FLAGS | ||
value: "--xla_gpu_cuda_data_dir=/usr/local/cuda" | ||
- name: TF_XLA_FLAGS | ||
value: "--tf_xla_cpu_global_jit" | ||
- name: NCCL_DEBUG | ||
value: INFO | ||
command: | ||
- /opt/amazon/openmpi/bin/mpirun | ||
- --allow-run-as-root | ||
- --tag-output | ||
- -np | ||
- "${TOTAL_GPUS}" | ||
- -bind-to | ||
- none | ||
- -map-by | ||
- slot | ||
- -x | ||
- PATH | ||
- -x | ||
- LD_LIBRARY_PATH | ||
- -x | ||
- XLA_FLAGS | ||
- -x | ||
- TF_XLA_FLAGS | ||
- -x | ||
- NCCL_DEBUG=INFO | ||
- --mca | ||
- pml | ||
- ^cm | ||
- --mca | ||
- pml_rsh_agent=ssh | ||
- --oversubscribe | ||
- /opt/amazon/efa/bin/fi_info | ||
- -p | ||
- "efa" | ||
- -t | ||
- "FI_EP_RDM" | ||
Worker: | ||
replicas: ${NUM_WORKERS} | ||
template: | ||
spec: | ||
containers: | ||
- image: ${IMAGE_URI} | ||
name: ${MPI_JOB_NAME}-worker | ||
imagePullPolicy: IfNotPresent | ||
resources: | ||
limits: | ||
nvidia.com/gpu: ${GPU_PER_WORKER} | ||
vpc.amazonaws.com/efa: ${EFA_PER_WORKER} | ||
requests: | ||
nvidia.com/gpu: ${GPU_PER_WORKER} | ||
vpc.amazonaws.com/efa: ${EFA_PER_WORKER} | ||
EOF | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are all known values so lets just make it a static yaml file that users can apply