Skip to content

Commit

Permalink
reorganize pages add a new ROCM support page
Browse files Browse the repository at this point in the history
  • Loading branch information
tara-det-ai committed Sep 5, 2024
1 parent 50715f9 commit 87e3862
Show file tree
Hide file tree
Showing 8 changed files with 114 additions and 46 deletions.
22 changes: 3 additions & 19 deletions docs/model-dev-guide/api-guides/apis-howto/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,26 +56,10 @@ Prefer to use an Example Model?
If you'd like to build off of an existing model that already runs on Determined, visit our
:ref:`example-solutions` to see if the model you'd like to train is already available.

******************
AMD ROCm Support
******************
ROCm Support
============

.. _rocm-support:

Determined provides experimental support for AMD ROCm GPUs in Kubernetes deployments. Determined
provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed
support for MI300x users:

- `pytorch-infinityhub-dev
<https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-dev/tags>`__
- `pytorch-infinityhub-hpc-dev
<https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-hpc-dev/tags>`__

You can build these images locally based on the Dockerfiles found in the `environments repository
<https://github.com/determined-ai/environments/blob/main/Dockerfile-infinityhub-pytorch>`__.

For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit
:ref:`rocm-known-issues` for details on current limitations and troubleshooting.
For AMD ROCm support, visit :ref:`rocm-support`

.. toctree::
:caption: Training APIs
Expand Down
3 changes: 3 additions & 0 deletions docs/model-dev-guide/create-experiment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ Launcher options include:

- A command with arguments, run in the container

If you're using AMD ROCm GPUs, make sure to specify ``slot_type: rocm`` in your experiment
configuration. For more information on ROCm support, see :ref:`AMD ROCm Support <rocm-support>`.

For distributed training, separate the launcher that starts distributed workers from your training
script, which typically runs each worker. The distributed training launcher must:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@ preparation needed.

If you need to add additional customization to the training environment, review the
:ref:`custom-env` page.

For details on using ROCm-enabled images, including our ROCm 6.1 images with DeepSpeed support for
MI300x users, see our :ref:`AMD ROCm Support documentation <rocm-support>`.
1 change: 1 addition & 0 deletions docs/setup-cluster/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,4 @@ Enable Determined to submit jobs to a Slurm cluster.
Deploy on Kubernetes <k8s/_index>
Deploy on Slurm/PBS <slurm/_index>
Cluster Configuration <cluster-configuration>
ROCm Support <rocm-support>
8 changes: 8 additions & 0 deletions docs/setup-cluster/k8s/helm-commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ To list the current installation of Determined on the Kubernetes cluster:
It is recommended to have just one instance of Determined per Kubernetes cluster.

****************************
AMD ROCm GPU Configuration
****************************

For specific configuration details related to AMD ROCm GPUs, including how to set up resource pools
and configure experiments, see our :ref:`guide on Configuring Kubernetes for ROCm GPUs
<rocm-config-k8s>`.

**************************************
Get the Determined Master IP Address
**************************************
Expand Down
5 changes: 5 additions & 0 deletions docs/setup-cluster/k8s/k8s-dev-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Kubernetes version >= 1.19 and <= 1.21. Later versions of Kubernetes may also wo
Kubernetes manually, or you can use a managed Kubernetes service such as :ref:`GKE
<setup-gke-cluster>` or :ref:`EKS <setup-eks-cluster>`.

.. note::

For information on using AMD ROCm GPUs with Determined on Kubernetes, please refer to our
:ref:`ROCm Support Guide <rocm-support>`.

**********************************
Set up a Development Environment
**********************************
Expand Down
90 changes: 90 additions & 0 deletions docs/setup-cluster/rocm-support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
.. _rocm-support:

##################
AMD ROCm Support
##################

.. contents:: Table of Contents
:local:
:depth: 2

**********
Overview
**********

.. note::
ROCm support in Determined is experimental. Features and configurations may change in future releases. We recommend testing thoroughly in a non-production environment before deploying to production.

Determined provides experimental support for AMD ROCm GPUs in Kubernetes deployments. Determined
provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed
support for MI300x users:

- `pytorch-infinityhub-dev
<https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-dev/tags>`__
- `pytorch-infinityhub-hpc-dev
<https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-hpc-dev/tags>`__

You can build these images locally based on the Dockerfiles found in the `environments repository
<https://github.com/determined-ai/environments/blob/main/Dockerfile-infinityhub-pytorch>`__.

For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit
:ref:`rocm-known-issues` for details on current limitations and troubleshooting.

.. _rocm-config-k8s:

**************************************
Configuring Kubernetes for ROCm GPUs
**************************************

To use ROCm GPUs in your Kubernetes deployment:

1. Ensure your Kubernetes cluster has nodes with ROCm-capable GPUs and the necessary drivers installed.

2. In your Helm chart values or Determined configuration, set the following:

.. code-block:: yaml
resourceManager:
defaultComputeResourcePool: rocm-pool
resourcePools:
- pool_name: rocm-pool
gpu_type: rocm
max_slots: <number_of_rocm_gpus>
3. When submitting experiments or launching tasks, specify ``slot_type: rocm`` in your experiment configuration.

*********************************
Using ROCm Images in Experiments
*********************************

To use ROCm images in your experiments, specify the image in your experiment configuration:

.. code-block:: yaml
environment:
image: determinedai/pytorch-infinityhub-dev:rocm6.1-pytorch2.1-deepspeed0.10.0
Ensure that your experiment configuration also specifies ``slot_type: rocm`` to use ROCm GPUs.

.. _rocm-known-issues:

******************************
Known Issues and Limitations
******************************

- **Agent Deprecation**: Agent-based deployments are deprecated for ROCm support. Use Kubernetes with ROCm support for your deployments.

- **HIP GPU Errors**: Launching experiments with ``slot_type: rocm`` may fail with the error
``RuntimeError: No HIP GPUs are available``. Ensure compute nodes have compatible ROCm drivers and
libraries installed and available in default locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH``.

- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory
not empty`` during ROCm operations. A workaround is to disable per-container ``/tmp`` using bind mounts
in your experiment configuration or globally using the ``task_container_defaults`` section in your master configuration:

.. code:: yaml
bind_mounts:
- host_path: /tmp
container_path: /tmp
28 changes: 1 addition & 27 deletions docs/setup-cluster/slurm/slurm-known-issues.rst
Original file line number Diff line number Diff line change
Expand Up @@ -391,37 +391,11 @@ Some constraints are due to differences in behavior between Docker and Singulari
reported as completed with no error message reported. Refer to :ref:`PBS Requirements
<pbs-config-requirements>`.

.. _rocm-known-issues:

***********************
AMD/ROCm Known Issues
***********************

Determined provides experimental support for :ref:`AMD ROCm GPUs <rocm-support>` in Kubernetes
deployments.

Deprecations and known issues:

- **Agent Deprecation**: Agent-based deployments have been deprecated. Ensure that you are using
Kubernetes with ROCm support for your deployments.

- **HIP GPU Errors**: Launching experiments with ``slot_type: rocm`` may fail with the error
``RuntimeError: No HIP GPUs are available``. Ensure that the compute nodes have ROCm drivers and
libraries compatible with the environment image in use. These should be available in default
locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH`` variables in the :ref:`slurm
configuration <cluster-configuration-slurm>`. Depending on your system setup, you may need to
select a different ROCm image. See :ref:`set-environment-images` for the available images.

- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory
not empty`` during ROCm operations. A potential workaround is to disable the per-container
``/tmp`` by adding a bind mount in your experiment configuration or globally using the
``task_container_defaults`` section in your master configuration:

.. code:: yaml
bind_mounts:
- host_path: /tmp
container_path: /tmp
For AMD/ROCm support and known issues, visit :ref:`AMD ROCm GPUs <rocm-support>`.

***************************************
Determined AI Experiment Requirements
Expand Down

0 comments on commit 87e3862

Please sign in to comment.