Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update ROCM support #9893

Merged
merged 4 commits into from
Sep 10, 2024
Merged

docs: Update ROCM support #9893

merged 4 commits into from
Sep 10, 2024

Conversation

tara-det-ai
Copy link
Member

@tara-det-ai tara-det-ai commented Sep 4, 2024

Update docs to latest ROCm support as mentioned in 0.36.0 release notes and prior release notes. Make content more readable and less duplicative. Organize the content under proper sections and pages.

https://hpe-aiatscale.atlassian.net/browse/DOCSMLDX-35

docs preview: https://determined-ai-docs.s3.us-west-2.amazonaws.com/previews/d97560c619e6d0ca2965fcbf25127c11/setup-cluster/rocm-support.html

Ticket

Description

Test Plan

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Sep 4, 2024
@determined-ci determined-ci requested a review from a team September 4, 2024 18:09
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Sep 4, 2024
Copy link

netlify bot commented Sep 4, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 7b4b376
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66e07e210fd2480008d1fec4

@tara-det-ai tara-det-ai force-pushed the docs/Update-ROCm-support branch 2 times, most recently from 87e3862 to 3ef42d0 Compare September 5, 2024 18:41
``RuntimeError: No HIP GPUs are available``. Ensure compute nodes have compatible ROCm drivers and
libraries installed and available in default locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH``.

- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is just going to be a Slurm consideration. I'm hardly the expert here, but I might call that out unless you've heard that this can happen in Kubernetes too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the update from the slack convo: K8s is supported and agent has been deprecated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see what you are saying

Copy link
Member

@mackrorysd mackrorysd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but otherwise I think this is good.

<https://github.com/determined-ai/environments/blob/main/Dockerfile-infinityhub-pytorch>`__.

For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit
:ref:`rocm-known-issues` for details on current limitations and troubleshooting.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: known issues are on the same page

Update docs to latest ROCm support as mentioned in 0.36.0 release notes and prior release notes. Make content more readable and less duplicative. Organize the content under proper sections and pages.
@tara-det-ai tara-det-ai merged commit 8fb9f6b into main Sep 10, 2024
75 of 90 checks passed
@tara-det-ai tara-det-ai deleted the docs/Update-ROCm-support branch September 10, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants