Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update ROCM support #9893

Merged
merged 4 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 3 additions & 14 deletions docs/model-dev-guide/api-guides/apis-howto/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,21 +56,10 @@ Prefer to use an Example Model?
If you'd like to build off of an existing model that already runs on Determined, visit our
:ref:`example-solutions` to see if the model you'd like to train is already available.

******************
AMD ROCm Support
******************
ROCm Support
tara-det-ai marked this conversation as resolved.
Show resolved Hide resolved
============

.. _rocm-support:

Determined has experimental support for ROCm. Determined provides a prebuilt Docker image that
includes ROCm 5.0, PyTorch 1.10 and TensorFlow 2.7:

- ``determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-0.26.4``

Known limitations:

- Only agent-based deployments are available; Kubernetes is not yet supported.
- GPU profiling is not yet supported.
For AMD ROCm support, visit :ref:`rocm-support`

.. toctree::
:caption: Training APIs
Expand Down
3 changes: 3 additions & 0 deletions docs/model-dev-guide/create-experiment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ Launcher options include:

- A command with arguments, run in the container

If you're using AMD ROCm GPUs, make sure to specify ``slot_type: rocm`` in your experiment
configuration. For more information on ROCm support, see :ref:`AMD ROCm Support <rocm-support>`.

For distributed training, separate the launcher that starts distributed workers from your training
script, which typically runs each worker. The distributed training launcher must:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@ preparation needed.

If you need to add additional customization to the training environment, review the
:ref:`custom-env` page.

For details on using ROCm-enabled images, including our ROCm 6.1 images with DeepSpeed support for
MI300x users, see our :ref:`AMD ROCm Support documentation <rocm-support>`.
1 change: 1 addition & 0 deletions docs/setup-cluster/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,4 @@ Enable Determined to submit jobs to a Slurm cluster.
Deploy on Kubernetes <k8s/_index>
Deploy on Slurm/PBS <slurm/_index>
Cluster Configuration <cluster-configuration>
ROCm Support <rocm-support>
8 changes: 8 additions & 0 deletions docs/setup-cluster/k8s/helm-commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ To list the current installation of Determined on the Kubernetes cluster:

It is recommended to have just one instance of Determined per Kubernetes cluster.

****************************
AMD ROCm GPU Configuration
****************************

For specific configuration details related to AMD ROCm GPUs, including how to set up resource pools
and configure experiments, see our :ref:`guide on Configuring Kubernetes for ROCm GPUs
<rocm-config-k8s>`.

**************************************
Get the Determined Master IP Address
**************************************
Expand Down
5 changes: 5 additions & 0 deletions docs/setup-cluster/k8s/k8s-dev-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Kubernetes version >= 1.19 and <= 1.21. Later versions of Kubernetes may also wo
Kubernetes manually, or you can use a managed Kubernetes service such as :ref:`GKE
<setup-gke-cluster>` or :ref:`EKS <setup-eks-cluster>`.

.. note::

For information on using AMD ROCm GPUs with Determined on Kubernetes, please refer to our
:ref:`ROCm Support Guide <rocm-support>`.

**********************************
Set up a Development Environment
**********************************
Expand Down
90 changes: 90 additions & 0 deletions docs/setup-cluster/rocm-support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
.. _rocm-support:

##################
AMD ROCm Support
##################

.. contents:: Table of Contents
:local:
:depth: 2

**********
Overview
**********

.. note::
ROCm support in Determined is experimental. Features and configurations may change in future releases. We recommend testing thoroughly in a non-production environment before deploying to production.

Determined provides experimental support for AMD ROCm GPUs in Kubernetes deployments. Determined
provides prebuilt Docker images for ROCm, including the latest ROCm 6.1 version with DeepSpeed
support for MI300x users:

- `pytorch-infinityhub-dev
<https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-dev/tags>`__
- `pytorch-infinityhub-hpc-dev
<https://hub.docker.com/repository/docker/determinedai/pytorch-infinityhub-hpc-dev/tags>`__

You can build these images locally based on the Dockerfiles found in the `environments repository
<https://github.com/determined-ai/environments/blob/main/Dockerfile-infinityhub-pytorch>`__.

For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit
:ref:`rocm-known-issues` for details on current limitations and troubleshooting.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: known issues are on the same page


.. _rocm-config-k8s:

**************************************
Configuring Kubernetes for ROCm GPUs
**************************************

To use ROCm GPUs in your Kubernetes deployment:

1. Ensure your Kubernetes cluster has nodes with ROCm-capable GPUs and the necessary drivers installed.

2. In your Helm chart values or Determined configuration, set the following:

.. code-block:: yaml

resourceManager:
defaultComputeResourcePool: rocm-pool

resourcePools:
- pool_name: rocm-pool
gpu_type: rocm
max_slots: <number_of_rocm_gpus>

3. When submitting experiments or launching tasks, specify ``slot_type: rocm`` in your experiment configuration.

*********************************
Using ROCm Images in Experiments
*********************************

To use ROCm images in your experiments, specify the image in your experiment configuration:

.. code-block:: yaml

environment:
image: determinedai/pytorch-infinityhub-dev:rocm6.1-pytorch2.1-deepspeed0.10.0

Ensure that your experiment configuration also specifies ``slot_type: rocm`` to use ROCm GPUs.

.. _rocm-known-issues:

******************************
Known Issues and Limitations
******************************

- **Agent Deprecation**: Agent-based deployments are deprecated for ROCm support. Use Kubernetes with ROCm support for your deployments.

- **HIP GPU Errors**: Launching experiments with ``slot_type: rocm`` may fail with the error
``RuntimeError: No HIP GPUs are available``. Ensure compute nodes have compatible ROCm drivers and
libraries installed and available in default locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH``.

- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is just going to be a Slurm consideration. I'm hardly the expert here, but I might call that out unless you've heard that this can happen in Kubernetes too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the update from the slack convo: K8s is supported and agent has been deprecated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see what you are saying

not empty`` during ROCm operations. A workaround is to disable per-container ``/tmp`` using bind mounts
in your experiment configuration or globally using the ``task_container_defaults`` section in your master configuration:

.. code:: yaml

bind_mounts:
- host_path: /tmp
container_path: /tmp
24 changes: 1 addition & 23 deletions docs/setup-cluster/slurm/slurm-known-issues.rst
Original file line number Diff line number Diff line change
Expand Up @@ -395,29 +395,7 @@ Some constraints are due to differences in behavior between Docker and Singulari
AMD/ROCm Known Issues
***********************

- AMD/ROCm support is available only with Singularity containers. While Determined does add the
proper Podman arguments to enable ROCm GPU support, the capabilities have not yet been verified.

- Launching experiments with ``slot_type: rocm``, may fail with the error ``RuntimeError: No HIP
GPUs are available``. Ensure that the compute nodes are providing ROCm drivers and libraries
compatible with the environment image that you are using and that they are available in the
default locations, or are added to the ``path`` and/or ``ld_library_path`` variables in the
:ref:`slurm configuration <cluster-configuration-slurm>`. Depending upon your system
configuration, you may need to select a different ROCm image. See :ref:`set-environment-images`
for the images available.

- Launching experiments with ``slot_type: rocm``, may fail in the AMD/ROCm libraries with with the
error ``terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
what(): boost::filesystem::remove: Directory not empty: "/tmp/miopen-...``. A potential
workaround is to disable the per-container ``/tmp`` by adding the following :ref:`bind mount
<exp-bind-mounts>` in your experiment configuration or globally by using the
``task_container_defaults`` section in your master configuration:

.. code:: yaml

bind_mounts:
- host_path: /tmp
container_path: /tmp
For AMD/ROCm support and known issues, visit :ref:`AMD ROCm GPUs <rocm-support>`.

***************************************
Determined AI Experiment Requirements
Expand Down
Loading