diff --git a/docs/model-dev-guide/api-guides/apis-howto/_index.rst b/docs/model-dev-guide/api-guides/apis-howto/_index.rst index d8cb117a75ef..cffd6b2f00bb 100644 --- a/docs/model-dev-guide/api-guides/apis-howto/_index.rst +++ b/docs/model-dev-guide/api-guides/apis-howto/_index.rst @@ -56,26 +56,10 @@ Prefer to use an Example Model? If you'd like to build off of an existing model that already runs on Determined, visit our :ref:`example-solutions` to see if the model you'd like to train is already available. -****************** - AMD ROCm Support -****************** +ROCm Support +============ -.. _rocm-support: - -Determined provides experimental support for AMD ROCm GPUs in Kubernetes deployments. Determined -provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed -support for MI300x users: - -- `pytorch-infinityhub-dev - `__ -- `pytorch-infinityhub-hpc-dev - `__ - -You can build these images locally based on the Dockerfiles found in the `environments repository -`__. - -For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit -:ref:`rocm-known-issues` for details on current limitations and troubleshooting. +For AMD ROCm support, visit :ref:`rocm-support` .. toctree:: :caption: Training APIs diff --git a/docs/model-dev-guide/create-experiment.rst b/docs/model-dev-guide/create-experiment.rst index 31c97b521bbd..dda260200774 100644 --- a/docs/model-dev-guide/create-experiment.rst +++ b/docs/model-dev-guide/create-experiment.rst @@ -25,6 +25,9 @@ Launcher options include: - A command with arguments, run in the container +If you're using AMD ROCm GPUs, make sure to specify ``slot_type: rocm`` in your experiment +configuration. For more information on ROCm support, see :ref:`AMD ROCm Support `. + For distributed training, separate the launcher that starts distributed workers from your training script, which typically runs each worker. The distributed training launcher must: diff --git a/docs/model-dev-guide/prepare-container/set-environment-images.rst b/docs/model-dev-guide/prepare-container/set-environment-images.rst index fefeba804fc3..88042c43b3b2 100644 --- a/docs/model-dev-guide/prepare-container/set-environment-images.rst +++ b/docs/model-dev-guide/prepare-container/set-environment-images.rst @@ -28,3 +28,6 @@ preparation needed. If you need to add additional customization to the training environment, review the :ref:`custom-env` page. + +For details on using ROCm-enabled images, including our ROCm 6.1 images with DeepSpeed support for +MI300x users, see our :ref:`AMD ROCm Support documentation `. diff --git a/docs/setup-cluster/_index.rst b/docs/setup-cluster/_index.rst index fd73ca3bed3e..5f2404e6b7a4 100644 --- a/docs/setup-cluster/_index.rst +++ b/docs/setup-cluster/_index.rst @@ -97,3 +97,4 @@ Enable Determined to submit jobs to a Slurm cluster. Deploy on Kubernetes Deploy on Slurm/PBS Cluster Configuration + ROCm Support diff --git a/docs/setup-cluster/k8s/helm-commands.rst b/docs/setup-cluster/k8s/helm-commands.rst index 3f0d52a88e96..1497d6f616b0 100644 --- a/docs/setup-cluster/k8s/helm-commands.rst +++ b/docs/setup-cluster/k8s/helm-commands.rst @@ -35,6 +35,14 @@ To list the current installation of Determined on the Kubernetes cluster: It is recommended to have just one instance of Determined per Kubernetes cluster. +**************************** + AMD ROCm GPU Configuration +**************************** + +For specific configuration details related to AMD ROCm GPUs, including how to set up resource pools +and configure experiments, see our :ref:`guide on Configuring Kubernetes for ROCm GPUs +`. + ************************************** Get the Determined Master IP Address ************************************** diff --git a/docs/setup-cluster/k8s/k8s-dev-guide.rst b/docs/setup-cluster/k8s/k8s-dev-guide.rst index 3787581a73b2..64ea1432587d 100644 --- a/docs/setup-cluster/k8s/k8s-dev-guide.rst +++ b/docs/setup-cluster/k8s/k8s-dev-guide.rst @@ -13,6 +13,11 @@ Kubernetes version >= 1.19 and <= 1.21. Later versions of Kubernetes may also wo Kubernetes manually, or you can use a managed Kubernetes service such as :ref:`GKE ` or :ref:`EKS `. +.. note:: + + For information on using AMD ROCm GPUs with Determined on Kubernetes, please refer to our + :ref:`ROCm Support Guide `. + ********************************** Set up a Development Environment ********************************** diff --git a/docs/setup-cluster/rocm-support.rst b/docs/setup-cluster/rocm-support.rst new file mode 100644 index 000000000000..13ff62a9f22c --- /dev/null +++ b/docs/setup-cluster/rocm-support.rst @@ -0,0 +1,90 @@ +.. _rocm-support: + +################## + AMD ROCm Support +################## + +.. contents:: Table of Contents + :local: + :depth: 2 + +********** + Overview +********** + +.. note:: + ROCm support in Determined is experimental. Features and configurations may change in future releases. We recommend testing thoroughly in a non-production environment before deploying to production. + +Determined provides experimental support for AMD ROCm GPUs in Kubernetes deployments. Determined +provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed +support for MI300x users: + +- `pytorch-infinityhub-dev + `__ +- `pytorch-infinityhub-hpc-dev + `__ + +You can build these images locally based on the Dockerfiles found in the `environments repository +`__. + +For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit +:ref:`rocm-known-issues` for details on current limitations and troubleshooting. + +.. _rocm-config-k8s: + +************************************** + Configuring Kubernetes for ROCm GPUs +************************************** + +To use ROCm GPUs in your Kubernetes deployment: + +1. Ensure your Kubernetes cluster has nodes with ROCm-capable GPUs and the necessary drivers installed. + +2. In your Helm chart values or Determined configuration, set the following: + + .. code-block:: yaml + + resourceManager: + defaultComputeResourcePool: rocm-pool + + resourcePools: + - pool_name: rocm-pool + gpu_type: rocm + max_slots: + +3. When submitting experiments or launching tasks, specify ``slot_type: rocm`` in your experiment configuration. + +********************************* + Using ROCm Images in Experiments +********************************* + +To use ROCm images in your experiments, specify the image in your experiment configuration: + +.. code-block:: yaml + + environment: + image: determinedai/pytorch-infinityhub-dev:rocm6.1-pytorch2.1-deepspeed0.10.0 + +Ensure that your experiment configuration also specifies ``slot_type: rocm`` to use ROCm GPUs. + +.. _rocm-known-issues: + +****************************** + Known Issues and Limitations +****************************** + +- **Agent Deprecation**: Agent-based deployments are deprecated for ROCm support. Use Kubernetes with ROCm support for your deployments. + +- **HIP GPU Errors**: Launching experiments with ``slot_type: rocm`` may fail with the error + ``RuntimeError: No HIP GPUs are available``. Ensure compute nodes have compatible ROCm drivers and + libraries installed and available in default locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH``. + +- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory + not empty`` during ROCm operations. A workaround is to disable per-container ``/tmp`` using bind mounts + in your experiment configuration or globally using the ``task_container_defaults`` section in your master configuration: + + .. code:: yaml + + bind_mounts: + - host_path: /tmp + container_path: /tmp diff --git a/docs/setup-cluster/slurm/slurm-known-issues.rst b/docs/setup-cluster/slurm/slurm-known-issues.rst index 6abb431ee55b..d85127fc3c78 100644 --- a/docs/setup-cluster/slurm/slurm-known-issues.rst +++ b/docs/setup-cluster/slurm/slurm-known-issues.rst @@ -391,37 +391,11 @@ Some constraints are due to differences in behavior between Docker and Singulari reported as completed with no error message reported. Refer to :ref:`PBS Requirements `. -.. _rocm-known-issues: - *********************** AMD/ROCm Known Issues *********************** -Determined provides experimental support for :ref:`AMD ROCm GPUs ` in Kubernetes -deployments. - -Deprecations and known issues: - -- **Agent Deprecation**: Agent-based deployments have been deprecated. Ensure that you are using - Kubernetes with ROCm support for your deployments. - -- **HIP GPU Errors**: Launching experiments with ``slot_type: rocm`` may fail with the error - ``RuntimeError: No HIP GPUs are available``. Ensure that the compute nodes have ROCm drivers and - libraries compatible with the environment image in use. These should be available in default - locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH`` variables in the :ref:`slurm - configuration `. Depending on your system setup, you may need to - select a different ROCm image. See :ref:`set-environment-images` for the available images. - -- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory - not empty`` during ROCm operations. A potential workaround is to disable the per-container - ``/tmp`` by adding a bind mount in your experiment configuration or globally using the - ``task_container_defaults`` section in your master configuration: - - .. code:: yaml - - bind_mounts: - - host_path: /tmp - container_path: /tmp +For AMD/ROCm support and known issues, visit :ref:`AMD ROCm GPUs `. *************************************** Determined AI Experiment Requirements