determined-ai · tara-det-ai · Sep 10, 2024 · Sep 4, 2024 · Sep 5, 2024 · Sep 10, 2024
diff --git a/docs/model-dev-guide/api-guides/apis-howto/_index.rst b/docs/model-dev-guide/api-guides/apis-howto/_index.rst
@@ -56,21 +56,10 @@ Prefer to use an Example Model?
 If you'd like to build off of an existing model that already runs on Determined, visit our
 :ref:`example-solutions` to see if the model you'd like to train is already available.
 
-******************
- AMD ROCm Support
-******************
+ROCm Support
+============
 
-.. _rocm-support:
-
-Determined has experimental support for ROCm. Determined provides a prebuilt Docker image that
-includes ROCm 5.0, PyTorch 1.10 and TensorFlow 2.7:
-
--  ``determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-0.26.4``
-
-Known limitations:
-
--  Only agent-based deployments are available; Kubernetes is not yet supported.
--  GPU profiling is not yet supported.
+For AMD ROCm support, visit :ref:`rocm-support`
 
 .. toctree::
    :caption: Training APIs

diff --git a/docs/model-dev-guide/create-experiment.rst b/docs/model-dev-guide/create-experiment.rst
@@ -25,6 +25,9 @@ Launcher options include:
 
 -  A command with arguments, run in the container
 
+If you're using AMD ROCm GPUs, make sure to specify ``slot_type: rocm`` in your experiment
+configuration. For more information on ROCm support, see :ref:`AMD ROCm Support <rocm-support>`.
+
 For distributed training, separate the launcher that starts distributed workers from your training
 script, which typically runs each worker. The distributed training launcher must:
 

diff --git a/docs/model-dev-guide/prepare-container/set-environment-images.rst b/docs/model-dev-guide/prepare-container/set-environment-images.rst
@@ -28,3 +28,6 @@ preparation needed.
 
 If you need to add additional customization to the training environment, review the
 :ref:`custom-env` page.
+
+For details on using ROCm-enabled images, including our ROCm 6.1 images with DeepSpeed support for
+MI300x users, see our :ref:`AMD ROCm Support documentation <rocm-support>`.
diff --git a/docs/setup-cluster/_index.rst b/docs/setup-cluster/_index.rst
@@ -97,3 +97,4 @@ Enable Determined to submit jobs to a Slurm cluster.
    Deploy on Kubernetes <k8s/_index>
    Deploy on Slurm/PBS <slurm/_index>
    Cluster Configuration <cluster-configuration>
+   ROCm Support <rocm-support>
diff --git a/docs/setup-cluster/k8s/helm-commands.rst b/docs/setup-cluster/k8s/helm-commands.rst
@@ -35,6 +35,14 @@ To list the current installation of Determined on the Kubernetes cluster:
 
 It is recommended to have just one instance of Determined per Kubernetes cluster.
 
+****************************
+ AMD ROCm GPU Configuration
+****************************
+
+For specific configuration details related to AMD ROCm GPUs, including how to set up resource pools
+and configure experiments, see our :ref:`guide on Configuring Kubernetes for ROCm GPUs
+<rocm-config-k8s>`.
+
 **************************************
  Get the Determined Master IP Address
 **************************************

diff --git a/docs/setup-cluster/k8s/k8s-dev-guide.rst b/docs/setup-cluster/k8s/k8s-dev-guide.rst
@@ -13,6 +13,11 @@ Kubernetes version >= 1.19 and <= 1.21. Later versions of Kubernetes may also wo
 Kubernetes manually, or you can use a managed Kubernetes service such as :ref:`GKE
 <setup-gke-cluster>` or :ref:`EKS <setup-eks-cluster>`.
 
+.. note::
+
+   For information on using AMD ROCm GPUs with Determined on Kubernetes, please refer to our
+   :ref:`ROCm Support Guide <rocm-support>`.
+
 **********************************
  Set up a Development Environment
 **********************************

diff --git a/docs/setup-cluster/rocm-support.rst b/docs/setup-cluster/rocm-support.rst
@@ -0,0 +1,90 @@
+.. _rocm-support:
+
+##################
+ AMD ROCm Support
+##################
+
+.. contents:: Table of Contents
+   :local:
+   :depth: 2
+
+**********
+ Overview
+**********
+
+.. note::
+   ROCm support in Determined is experimental. Features and configurations may change in future releases. We recommend testing thoroughly in a non-production environment before deploying to production.
+
+Determined provides experimental support for AMD ROCm GPUs in Kubernetes deployments. Determined
+provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed
+support for MI300x users:
+
+-  `pytorch-infinityhub-dev
+   <https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-dev/tags>`__
+-  `pytorch-infinityhub-hpc-dev
+   <https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-hpc-dev/tags>`__
+
+You can build these images locally based on the Dockerfiles found in the `environments repository
+<https://github.com/determined-ai/environments/blob/main/Dockerfile-infinityhub-pytorch>`__.
+
+For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit
+:ref:`rocm-known-issues` for details on current limitations and troubleshooting.
+
+.. _rocm-config-k8s:
+
+**************************************
+ Configuring Kubernetes for ROCm GPUs
+**************************************
+
+To use ROCm GPUs in your Kubernetes deployment:
+
+1. Ensure your Kubernetes cluster has nodes with ROCm-capable GPUs and the necessary drivers installed.
+
+2. In your Helm chart values or Determined configuration, set the following:
+
+   .. code-block:: yaml
+
+      resourceManager:
+        defaultComputeResourcePool: rocm-pool
+
+      resourcePools:
+        - pool_name: rocm-pool
+          gpu_type: rocm
+          max_slots: <number_of_rocm_gpus>
+
+3. When submitting experiments or launching tasks, specify ``slot_type: rocm`` in your experiment configuration.
+
+*********************************
+ Using ROCm Images in Experiments
+*********************************
+
+To use ROCm images in your experiments, specify the image in your experiment configuration:
+
+.. code-block:: yaml
+
+   environment:
+     image: determinedai/pytorch-infinityhub-dev:rocm6.1-pytorch2.1-deepspeed0.10.0
+
+Ensure that your experiment configuration also specifies ``slot_type: rocm`` to use ROCm GPUs.
+
+.. _rocm-known-issues:
+
+******************************
+ Known Issues and Limitations
+******************************
+
+-  **Agent Deprecation**: Agent-based deployments are deprecated for ROCm support. Use Kubernetes with ROCm support for your deployments.
+
+-  **HIP GPU Errors**: Launching experiments with ``slot_type: rocm`` may fail with the error
+   ``RuntimeError: No HIP GPUs are available``. Ensure compute nodes have compatible ROCm drivers and
+   libraries installed and available in default locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH``.
+
+-  **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory
+   not empty`` during ROCm operations. A workaround is to disable per-container ``/tmp`` using bind mounts 
+   in your experiment configuration or globally using the ``task_container_defaults`` section in your master configuration:
+
+      .. code:: yaml
+
+         bind_mounts:
+            - host_path: /tmp
+              container_path: /tmp
diff --git a/docs/setup-cluster/slurm/slurm-known-issues.rst b/docs/setup-cluster/slurm/slurm-known-issues.rst
@@ -395,29 +395,7 @@ Some constraints are due to differences in behavior between Docker and Singulari
  AMD/ROCm Known Issues
 ***********************
 
--  AMD/ROCm support is available only with Singularity containers. While Determined does add the
-   proper Podman arguments to enable ROCm GPU support, the capabilities have not yet been verified.
-
--  Launching experiments with ``slot_type: rocm``, may fail with the error ``RuntimeError: No HIP
-   GPUs are available``. Ensure that the compute nodes are providing ROCm drivers and libraries
-   compatible with the environment image that you are using and that they are available in the
-   default locations, or are added to the ``path`` and/or ``ld_library_path`` variables in the
-   :ref:`slurm configuration <cluster-configuration-slurm>`. Depending upon your system
-   configuration, you may need to select a different ROCm image. See :ref:`set-environment-images`
-   for the images available.
-
--  Launching experiments with ``slot_type: rocm``, may fail in the AMD/ROCm libraries with with the
-   error ``terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
-   what(): boost::filesystem::remove: Directory not empty: "/tmp/miopen-...``. A potential
-   workaround is to disable the per-container ``/tmp`` by adding the following :ref:`bind mount
-   <exp-bind-mounts>` in your experiment configuration or globally by using the
-   ``task_container_defaults`` section in your master configuration:
-
-   .. code:: yaml
-
-      bind_mounts:
-         - host_path: /tmp
-           container_path: /tmp
+For AMD/ROCm support and known issues, visit :ref:`AMD ROCm GPUs <rocm-support>`.
 
 ***************************************
  Determined AI Experiment Requirements