Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Job Sample YAML ray_v1alpha1_rayjob.yaml fails with empty node-ip-address $MY_POD_IP #805

Closed
2 tasks done
architkulkarni opened this issue Dec 7, 2022 · 1 comment · Fixed by #807
Closed
2 tasks done
Labels
bug Something isn't working

Comments

@architkulkarni
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Following the documentation at https://ray-project.github.io/kuberay/guidance/rayjob/ using Kuberay master using kind, the worker pod starts to crashloop:

Every 1.0s: kubectl get pod                     Archits-MBP.local.meter: Tue Dec  6 15:47:13 2022

NAME                                                      READY   STATUS             RESTARTS
  AGE
rayjob-sample-raycluster-plpll-head-wk7pr                 1/1     Running            0
  16s
rayjob-sample-raycluster-plpll-worker-small-group-djcc6   0/1     CrashLoopBackOff   3 (41s ago)
  16s

Inspecting the logs, we see the traceback:

❯ kubectl logs rayjob-sample-raycluster-plpll-worker-small-group-djcc6
Traceback (most recent call last):
 File "/home/ray/anaconda3/bin/ray", line 8, in <module>
   sys.exit(main())
 File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2596, in main
   return cli()
 File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
   return self.main(*args, **kwargs)
 File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
   rv = self.invoke(ctx)
 File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
   return _process_result(sub_ctx.command.invoke(sub_ctx))
 File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
   return ctx.invoke(self.callback, **ctx.params)
 File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
   return __callback(*args, **kwargs)
 File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
   return f(*args, **kwargs)
 File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 588, in start
   node_ip_address = services.resolve_ip_for_localhost(node_ip_address)
 File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 531, in resolve_ip_for_localhost
   raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address:

Reproduction script

kind create cluster and then follow the docs page at https://ray-project.github.io/kuberay/guidance/rayjob/

Anything else

From tracing through the Ray code, the error is happening because an empty node-ip-address was passed to ray start. There is a field node-ip-address: $MY_POD_IP in the sample Job yaml, so this environment variable must not have been set. I assume the Ray operator is supposed to set this environment variable, but I'm not sure where it gets set.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@architkulkarni architkulkarni added the bug Something isn't working label Dec 7, 2022
@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 7, 2022

I probably caused this and suspect it's a simple fix.

DmitriGekhtman added a commit that referenced this issue Dec 7, 2022
Closes #805 which was caused by incomplete config cleanup in #761 .

Signed-off-by: Dmitri Gekhtman <[email protected]>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this issue Sep 24, 2023
Closes ray-project#805 which was caused by incomplete config cleanup in ray-project#761 .

Signed-off-by: Dmitri Gekhtman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants