Multiple issues related to the `runpod` backend #1133

peterschmidt85 · 2024-04-15T10:18:00Z

Uses the hardcoded Docker image name (runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 instead of dstack's Docker image)
The runpod backend always uses the Docker image's default entrypoint. In that case no configuration will work if the Docker image's default entrypoint isn't bash or sh.
Uses the hardcoded Docker args (instead of the default Docker commands used in all other backend)
Uses the 22 SSH port on the container (instead of 10022 as other backends do)
Doesn't support registry_auth
If a Docker image is large, the runpod bakend fails with f"Wait instance {instance_id} timeout" and proceed to trying another offer – without terminating the pod that is being created. Ths leads to creating multiple pods instead of failing the job.
If there is an error when running the command, the runpod backend fails

dstack/src/dstack/_internal/core/backends/runpod/compute.py", line 95, in run_job
  for port in pod["runtime"]["ports"]:
TypeError: 'NoneType' object is not iterable

and proceed to trying another offer – without terminating the pod that is being created. Ths leads to creating multiple pods instead of failing the job.

The text was updated successfully, but these errors were encountered:

peterschmidt85 · 2024-04-15T12:05:48Z

@jvstme In theory, we could detect if the image has a non-default entrypoint automatically, skip the runpod backend in that case, and show a warning. Would that be easy to implement?

jvstme · 2024-04-15T12:19:23Z

@peterschmidt85, I think it shouldn't be difficult. See example of detecting the image entrypoint

peterschmidt85 added major bug Something isn't working labels Apr 15, 2024

peterschmidt85 added a commit that referenced this issue Apr 15, 2024

Multiple issues related to the runpod backend #1133

5fdd0a5

TheBits mentioned this issue Apr 15, 2024

Multiple issues related to the runpod backend #1136

Merged

5 tasks

TheBits linked a pull request Apr 15, 2024 that will close this issue

Multiple issues related to the runpod backend #1136

Merged

5 tasks

TheBits closed this as completed in #1136 Apr 15, 2024

TheBits pushed a commit that referenced this issue Apr 15, 2024

Multiple issues related to the runpod backend #1133

6576d28

TheBits pushed a commit that referenced this issue Apr 15, 2024

Multiple issues related to the runpod backend #1133

d9b2870

Bihan mentioned this issue Apr 16, 2024

Add spot in runpod #1119

Merged

Bihan pushed a commit to SMTM-Capital/dstack that referenced this issue Apr 30, 2024

Multiple issues related to the runpod backend dstackai#1133

8644ebb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple issues related to the `runpod` backend #1133

Multiple issues related to the `runpod` backend #1133

peterschmidt85 commented Apr 15, 2024 •

edited

Loading

peterschmidt85 commented Apr 15, 2024 •

edited

Loading

jvstme commented Apr 15, 2024

Multiple issues related to the runpod backend #1133

Multiple issues related to the runpod backend #1133

Comments

peterschmidt85 commented Apr 15, 2024 • edited Loading

peterschmidt85 commented Apr 15, 2024 • edited Loading

jvstme commented Apr 15, 2024

Multiple issues related to the `runpod` backend #1133

Multiple issues related to the `runpod` backend #1133

peterschmidt85 commented Apr 15, 2024 •

edited

Loading

peterschmidt85 commented Apr 15, 2024 •

edited

Loading