Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues related to the runpod backend #1133

Closed
4 of 7 tasks
peterschmidt85 opened this issue Apr 15, 2024 · 2 comments · Fixed by #1136
Closed
4 of 7 tasks

Multiple issues related to the runpod backend #1133

peterschmidt85 opened this issue Apr 15, 2024 · 2 comments · Fixed by #1136
Labels
bug Something isn't working major

Comments

@peterschmidt85
Copy link
Contributor

peterschmidt85 commented Apr 15, 2024

  • Uses the hardcoded Docker image name (runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04 instead of dstack's Docker image)
  • The runpod backend always uses the Docker image's default entrypoint. In that case no configuration will work if the Docker image's default entrypoint isn't bash or sh.
  • Uses the hardcoded Docker args (instead of the default Docker commands used in all other backend)
  • Uses the 22 SSH port on the container (instead of 10022 as other backends do)
  • Doesn't support registry_auth
  • If a Docker image is large, the runpod bakend fails with f"Wait instance {instance_id} timeout" and proceed to trying another offer – without terminating the pod that is being created. Ths leads to creating multiple pods instead of failing the job.
  • If there is an error when running the command, the runpod backend fails
dstack/src/dstack/_internal/core/backends/runpod/compute.py", line 95, in run_job
  for port in pod["runtime"]["ports"]:
TypeError: 'NoneType' object is not iterable

and proceed to trying another offer – without terminating the pod that is being created. Ths leads to creating multiple pods instead of failing the job.

@peterschmidt85
Copy link
Contributor Author

peterschmidt85 commented Apr 15, 2024

@jvstme In theory, we could detect if the image has a non-default entrypoint automatically, skip the runpod backend in that case, and show a warning. Would that be easy to implement?

@jvstme
Copy link
Collaborator

jvstme commented Apr 15, 2024

@peterschmidt85, I think it shouldn't be difficult. See example of detecting the image entrypoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants