-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Update ROCM support #9893
Conversation
✅ Deploy Preview for determined-ui canceled.
|
535f221
to
87a19f4
Compare
87e3862
to
3ef42d0
Compare
``RuntimeError: No HIP GPUs are available``. Ensure compute nodes have compatible ROCm drivers and | ||
libraries installed and available in default locations or added to the ``PATH`` and/or ``LD_LIBRARY_PATH``. | ||
|
||
- **Boost Filesystem Errors**: You may encounter the error ``boost::filesystem::remove: Directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is just going to be a Slurm consideration. I'm hardly the expert here, but I might call that out unless you've heard that this can happen in Kubernetes too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the update from the slack convo: K8s is supported and agent has been deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see what you are saying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but otherwise I think this is good.
<https://github.com/determined-ai/environments/blob/main/Dockerfile-infinityhub-pytorch>`__. | ||
|
||
For more detailed information about configuration, visit the :ref:`helm-config-reference` or visit | ||
:ref:`rocm-known-issues` for details on current limitations and troubleshooting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: known issues are on the same page
3ef42d0
to
3d4c2e4
Compare
Update docs to latest ROCm support as mentioned in 0.36.0 release notes and prior release notes. Make content more readable and less duplicative. Organize the content under proper sections and pages.
3d4c2e4
to
6fbfdcc
Compare
Update docs to latest ROCm support as mentioned in 0.36.0 release notes and prior release notes. Make content more readable and less duplicative. Organize the content under proper sections and pages.
https://hpe-aiatscale.atlassian.net/browse/DOCSMLDX-35
docs preview: https://determined-ai-docs.s3.us-west-2.amazonaws.com/previews/d97560c619e6d0ca2965fcbf25127c11/setup-cluster/rocm-support.html
Ticket
Description
Test Plan
Checklist
docs/release-notes/
See Release Note for details.