Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] compatibility test for the nightly Ray image fails #1055

Merged
merged 7 commits into from
Apr 28, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 27, 2023

Why are these changes needed?

(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.

Experiments for 73b37b1

#!/bin/bash
for i in {1..25}
do
  RAY_IMAGE=rayproject/ray:2.1.0 python3 tests/compatibility-test.py RayFTTestCase 2>&1 | tee log_$i.txt
done
  • Run RayFTTestCase 25 times on my devbox.
    • test_detached_actor never fails.

    • test_ray_serve fails 6 times.

      test_ray_serve_1.py * 5
      ^[[2m^[[36m(ServeController pid=643)^[[0m INFO 2023-04-28 01:32:33,683 controller 643 http_state.py:132 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-c82ea22161ab30914ebd453cc3bd00e7e2088203b1506c26dc98116c' on node 'c82ea22161ab30914ebd453cc3bd00e7e2088203b1506c26dc98116c' listening on '127.0.0.1:8000'
      ^[[2m^[[36m(ServeController pid=643)^[[0m INFO 2023-04-28 01:32:33,716 controller 643 http_state.py:132 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cc6e7b1793a1f6e801f67f310ef39428d54e15c3ca140fcc193795fd' on node 'cc6e7b1793a1f6e801f67f310ef39428d54e15c3ca140fcc193795fd' listening on '127.0.0.1:8000'
      ^[[2m^[[36m(ServeController pid=643)^[[0m INFO 2023-04-28 01:32:33,725 controller 643 http_state.py:132 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-438634dbcfb7ea6edd2cc3fef2feaf02ae2df4c6198e7a17a3155f72' on node '438634dbcfb7ea6edd2cc3fef2feaf02ae2df4c6198e7a17a3155f72' listening on '127.0.0.1:8000'
      ^[[2m^[[36m(HTTPProxyActor pid=700)^[[0m INFO:     Started server process [700]
      Traceback (most recent call last):
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/api.py", line 241, in serve_start
          timeout=HTTP_PROXY_TIMEOUT,
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
          return getattr(ray, func.__name__)(*args, **kwargs)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
          return self.worker.get(vals, timeout=timeout)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 434, in get
          res = self._get(to_get, op_timeout)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 462, in _get
          raise err
      ray.exceptions.GetTimeoutError: Get timed out: some object(s) not ready.
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "samples/test_ray_serve_1.py", line 15, in <module>
          handle = serve.run(MyModelDeployment.bind(msg="Hello world!"))
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 483, in run
          http_options={"host": host, "port": port, "location": "EveryNode"},
        File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/api.py", line 245, in serve_start
          f"HTTP proxies not available after {HTTP_PROXY_TIMEOUT}s."
      TimeoutError: HTTP proxies not available after 60s.
      command terminated with exit code 1
      
      test_ray_serve_2.py * 1
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
          timeout=timeout
        File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen
          method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
          raise MaxRetryError(_pool, url, error or ResponseError(cause))
      urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcabbf355d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "samples/test_ray_serve_2.py", line 23, in <module>
          retry_with_timeout(send_req, 180)
        File "samples/test_ray_serve_2.py", line 14, in retry_with_timeout
          raise err
        File "samples/test_ray_serve_2.py", line 9, in retry_with_timeout
          return func()
        File "samples/test_ray_serve_2.py", line 17, in send_req
          response = requests.get('http://127.0.0.1:8000', timeout=10)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/api.py", line 76, in get
          return request('get', url, params=params, **kwargs)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/api.py", line 61, in request
          return session.request(method=method, url=url, **kwargs)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
          resp = self.send(prep, **send_kwargs)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
          r = adapter.send(request, **kwargs)
        File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
          raise ConnectionError(e, request=request)
      requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcabbf355d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
      command terminated with exit code 1
      

The test_ray_serve_1.py's error message is from the link. The reason seems to be failing to get HTTPProxy actors which is similar to my observation above "(Bug? Ray 2.1.0) In some cases, HTTPProxy will not be created on the head Pod after the cluster recovers from a failure.". I will file an issue later, but the check seems to be legacy code.

Related issue number

Closes #1053
#848
Closes ray-project/ray#34799

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title [WIP] [Bug] compatibility test for the nightly Ray image fails Apr 27, 2023
@kevin85421 kevin85421 marked this pull request as ready for review April 27, 2023 16:58
@kevin85421
Copy link
Member Author

cc @architkulkarni

test_ray_serve becomes much more flaky after #1036 because:

(1) In most cases, test_detached_actor and test_ray_serve will schedule tasks / actors on head Pod. However, this is not what we really want to test.

(2) Without #1036, all Pods in the cluster will crash (head Pod crashes once, each worker Pod crashes twice). With #1036, only head Pod will crash.

This PR relieves the flakiness, but it is still flaky. See the section "Experiments for 73b37b1" for more details. In GitHub Actions, it passes twice consecutively without any retry. That's why I think this PR is ready to review.

@architkulkarni architkulkarni self-assigned this Apr 28, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! If I understand correctly, this PR takes the test from consistently failing to merely flaky. Feel free to keep the original flakiness issue open, or open a new issue for the remaining flakiness.

@kevin85421 kevin85421 merged commit 2b136c9 into ray-project:master Apr 28, 2023
@kevin85421 kevin85421 mentioned this pull request Apr 28, 2023
2 tasks
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…#1055)

compatibility test for the nightly Ray image fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants