Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Serve 1k replica release test failed #23747

Closed
2 tasks done
jiaodong opened this issue Apr 6, 2022 · 3 comments
Closed
2 tasks done

[Bug] Serve 1k replica release test failed #23747

jiaodong opened this issue Apr 6, 2022 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order serve Ray Serve Related Issue

Comments

@jiaodong
Copy link
Member

jiaodong commented Apr 6, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Serve

Issue Severity

High: It blocks me to complete my task.

What happened + What you expected to happen

Error seems legit to me, filing P0 and self-assigned for now.

(ServeController pid=769) 2022-04-05 15:16:52,987       INFO http_state.py:113 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:sGYxOM:SERVE_PROXY_ACTOR-node:172.31.92.105-0' on node 'node:172.31.92.105-0' listening on '127.0.0.1:8000'
2022-04-05 15:16:54,419 INFO api.py:827 -- Started Serve instance in namespace 'e170177d-9c49-449f-a67a-44edd4b0ef22'.
2022-04-05 15:16:54,420 INFO single_deployment_1k_noop_replica.py:116 -- Ray serve http_host: 127.0.0.1, http_port: 8000
2022-04-05 15:16:54,420 INFO single_deployment_1k_noop_replica.py:118 -- Deploying with 1000 target replicas ....

2022-04-05 15:16:54,425 INFO api.py:647 -- Updating deployment 'echo'. component=serve deployment=echo
(HTTPProxyActor pid=816) INFO:     Started server process [816]
(ServeController pid=769) 2022-04-05 15:16:54,520       INFO deployment_state.py:1211 -- Adding 1000 replicas to deployment 'echo'. component=serve deployment=echo
Traceback (most recent call last):
  File "workloads/single_deployment_1k_noop_replica.py", line 152, in <module>
    main()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "workloads/single_deployment_1k_noop_replica.py", line 119, in main
    all_endpoints = deploy_replicas(num_replicas, max_batch_size)
  File "workloads/single_deployment_1k_noop_replica.py", line 72, in deploy_replicas
    Echo.deploy()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/deployment.py", line 250, in deploy
    _blocking=_blocking,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 190, in check
    return f(self, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 394, in deploy
    self._wait_for_deployment_healthy(name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 328, in _wait_for_deployment_healthy
    raise RuntimeError(f"Deployment {name} is UNHEALTHY: {status.message}")
RuntimeError: Deployment echo is UNHEALTHY: Failed to update deployment:
'>' not supported between instances of 'NoneType' and 'int'.
(ServeController pid=769) 2022-04-05 15:17:24,967       INFO deployment_state.py:1237 -- Removing 1000 replicas from deployment 'echo'. component=serve deployment=echo
(ServeController pid=769) 2022-04-05 15:17:29,812       INFO http_state.py:113 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:sGYxOM:SERVE_PROXY_ACTOR-node:172.31.74.200-0' on node 'node:172.31.74.200-0' listening on '127.0.0.1:8000'

Versions / Dependencies

nightly

Reproduction script

.

Anything else

.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@jiaodong jiaodong added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order serve Ray Serve Related Issue platform labels Apr 6, 2022
@jiaodong jiaodong self-assigned this Apr 6, 2022
@jiaodong
Copy link
Member Author

jiaodong commented Apr 7, 2022

i tried re-running serve_single_deployment_1k_noop_replica test on release tool using the wheel from 21ee7e4 but ran into some import errors. Will give it another try later after we have the rebased wheel on this commit .

Traceback (most recent call last):
  File "run_release_test.py", line 137, in main
    no_terminate=no_terminate,
  File "/Users/jiaodong/Workspace/ray/release/ray_release/glue.py", line 314, in run_release_test
    raise pipeline_exception
  File "/Users/jiaodong/Workspace/ray/release/ray_release/glue.py", line 204, in run_release_test
    command_runner.prepare_remote_env()
  File "/Users/jiaodong/Workspace/ray/release/ray_release/command_runner/sdk_runner.py", line 58, in prepare_remote_env
    ) from e
ray_release.exception.RemoteEnvSetupError: Error setting up remote environment: No module named 'ray.job_submission'

@jiaodong
Copy link
Member Author

jiaodong commented Apr 7, 2022

Locally running the same test with 10 replicas passed for me however. Next step is

  1. Check next day's nightly to see if it persists
  2. Patch on shreya's PR and submit it to anyscale again to reproduce

@jiaodong
Copy link
Member Author

jiaodong commented Apr 7, 2022

I just retried the same test on Shreya's attached PR's latest commit, just rebased on master it passed: https://buildkite.com/ray-project/release-tests-pr/builds/55#36193e99-96c2-4663-82d1-b160a5f925a9

so im closing this issue triggered by nightly as it suggests we might be seeing an unhealthy replica but failed to retrieve correct exception message. I will watch nightly for a few more days this week, and will re-open this issue if the problem persists on nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

1 participant