Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Widening range of grpcio versions allowed. #28623

Merged
merged 7 commits into from
Sep 25, 2022

Conversation

cadedaniel
Copy link
Member

@cadedaniel cadedaniel commented Sep 19, 2022

See #27299 for context.

script.py

Testing with the following code:

#!/usr/bin/env python3
import ray
ray.init()


@ray.remote
def task(argument):
    import grpc
    import platform
    print(argument, grpc.__version__, platform.python_version())

ray.get(task.remote('Hello world'))

python/grpcio versions tested on above script

grpcio==1.43.0 python==3.10.4
grpcio==1.44.0 python==3.10.4
grpcio==1.45.0 python==3.10.4
grpcio==1.46.0 python==3.10.4
grpcio==1.47.0 python==3.10.4
grpcio==1.48.1 python==3.10.4
grpcio==1.43.0 python==3.6.13
grpcio==1.44.0 python==3.6.13
grpcio==1.45.0 python==3.6.13
grpcio==1.46.0 python==3.6.13
grpcio==1.47.0 python==3.6.13
grpcio==1.48.1 python==3.6.13
grpcio==1.43.0 python==3.7.13
grpcio==1.44.0 python==3.7.13
grpcio==1.45.0 python==3.7.13
grpcio==1.46.0 python==3.7.13
grpcio==1.47.0 python==3.7.13
grpcio==1.48.1 python==3.7.13
grpcio==1.43.0 python==3.8.13
grpcio==1.44.0 python==3.8.13
grpcio==1.45.0 python==3.8.13
grpcio==1.46.0 python==3.8.13
grpcio==1.47.0 python==3.8.13
grpcio==1.48.1 python==3.8.13
grpcio==1.43.0 python==3.9.13
grpcio==1.44.0 python==3.9.13
grpcio==1.45.0 python==3.9.13
grpcio==1.46.0 python==3.9.13
grpcio==1.47.0 python==3.9.13
grpcio==1.48.1 python==3.9.13

test code

#!/usr/bin/env bash

set -e

source /home/ray/anaconda3/etc/profile.d/conda.sh

pr_commit="0193a19226c29c9988760114d67f6ea9af99f9e7"

ray_wheels="$(aws s3 ls s3://ray-ci-artifact-pr-public/$pr_commit/tmp/artifacts/.whl/ | grep -v 'cpp' | awk '{print $4}')"
grpcio_versions="1.43 1.44 1.45 1.46 1.47 1.48.1"

for ray_wheel in $ray_wheels; do

    conda_create_cmd=$(echo $ray_wheel | sed 's/-/ /'g | awk '{print $3}' | sed 's/cp//g' | sed 's/3/3\./g' | sed 's/^/conda create -n temp python=/g')
    $conda_create_cmd --yes
    conda activate temp
    
    for grpcio_version in $grpcio_versions; do
        printf "Uninstalling\n"
        pip uninstall grpcio -y
        pip uninstall ray -y

        printf "Installing grpcio_version $grpcio_version\n"
        pip install grpcio==$grpcio_version

        printf "Installing Ray wheel $ray_wheel"
        pip install "https://ray-ci-artifact-pr-public.s3.us-west-2.amazonaws.com/$pr_commit/tmp/artifacts/.whl/$ray_wheel"

        printf "Running script on grpcio_version $grpcio_version\n"
        ./script.py
    done
done

@cadedaniel
Copy link
Member Author

@jjyao what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?

@cadedaniel cadedaniel self-assigned this Sep 19, 2022
@cadedaniel cadedaniel linked an issue Sep 19, 2022 that may be closed by this pull request
@cadedaniel cadedaniel added the core Issues that should be addressed in Ray Core label Sep 19, 2022
@jjyao
Copy link
Collaborator

jjyao commented Sep 19, 2022

what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?

Yea, I would use conda for that.

@cadedaniel cadedaniel marked this pull request as ready for review September 20, 2022 00:00
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp

@jjyao
Copy link
Collaborator

jjyao commented Sep 20, 2022

Could you pull the master head? Want to make sure failed tests are unrelated.

@cadedaniel
Copy link
Member Author

Notes on windows tests:

  • The test FAILED ::test_scheduling_class_depth[ray_start_regular0] failed on windows three times, so unlikely to be a flaky timeout.
  • tests:test_scheduling_asan looks like it's taking longer. On the third attempt of Window 5/6, it fails due to timeout, the second attempt it takes longer than expected.

Copy link
Contributor

@scv119 scv119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

@scv119 scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2022
@cadedaniel
Copy link
Member Author

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

cc @jjyao

@jjyao
Copy link
Collaborator

jjyao commented Sep 21, 2022

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

@scv119 This is not the strategy we are following due to the nature of Ray as a library (if you look at setup.py, we don't have upper bounds for most of (core) dependencies). Ray shouldn't prevent users from using the latest versions of dependencies (they may want to use that for some reason like bug fix) that might come out after Ray's release.

@scv119
Copy link
Contributor

scv119 commented Sep 21, 2022

hmm this (setting upper bound) has been the practical strategy we have been following for grpcio right (and other core dependencies)? Empirically, for grpcio, not setting upper bound has caused serious problems (i.e. ray hangs), comparing to the issues where users are not able to upgrade to the latest grpcio.

@jjyao
Copy link
Collaborator

jjyao commented Sep 21, 2022

grpcio doesn't have upper bound by default as well. I think what we did in the past was forcing an upper bound when we discover an issue and lift the upper bound afterwards after the issue is fixed. (this is what we normally do for other core dependencies as well).

The strategy I mentioned is a general strategy we apply to all of our dependencies. Whether we want to make an exception for grpcio is a separate story but I don't think it should apply to all dependencies.

Personally I still feel we shouldn't put an upper bound (we may put an upper bound on the major version number not but the minor version number if they are following semantic versioning).

@richardliaw may have more insights on this.

@scv119
Copy link
Contributor

scv119 commented Sep 21, 2022

  • not all ray dependency has the same impact radius. grpcio is a critical dependency for ray and ray core. i.e. if it breaks ray won't work, AT ALL.
  • grpcio has a track of record breaking things, released and yanked https://pypi.org/project/grpcio/#history
  • so we are really deciding taking the risk next grpcio release breaks ray, versus some user failed to use ray with latest grpcio.

that's said, i'm not comfortable follow the existing practices where we don't set grpcio upperbound, without other protection mechanism.

  • The baseline is setting upperbound for grpcio, and only allow new version until proved innocent.
  • We might brainstorm other protection mechanism, such as work with grpcio team closely, or even vender our own grpcio.

i think the safest way is to set the upperbound to the version we know that's working, and figure out a less strict protection mechansim.

@jjyao
Copy link
Collaborator

jjyao commented Sep 22, 2022

Given we haven't reached an agreement on whether to put a cap or not. Let's add a cap to merge this PR since it's still strictly better than Ray 2.0 and we can discuss long term solution later.

@cadedaniel
Copy link
Member Author

cadedaniel commented Sep 22, 2022

Sounds good.

There are more non-flaky Windows tests failing, taking note here so see if they repeat:

Windows [1/6] (first try):
* //python/ray/tests:test_multiprocessing (timeout)
Windows [1/6] (second try):
* //python/ray/tests:test_multiprocessing  (timeout)
* //python/ray/tests:test_asyncio  (timeout)

Windows [5/6] (first try):
* //python/ray/tests:test_queue (timeout)

Windows [5/6] (second try):
* //python/ray/tests:test_queue (timeout)
* //python/ray/tests:test_runtime_env_working_dir_3 (failure for something to be GC'd after 20 seconds)

Running things manually on a windows machine:

//python/ray/tests:test_multiprocessing test_task_to_actor_assignment fails on Windows
//python/ray/tests:test_multiprocessing test_callbacks might hang
//python/ray/tests:test_queue passes

This reverts commit 1dd22fbd02b95d4c2941178675d136dd8f2b85fa.

Signed-off-by: Cade Daniel <[email protected]>
@cadedaniel
Copy link
Member Author

cadedaniel commented Sep 22, 2022

//python/ray/tests:test_multiprocessing test_callbacks hangs for both grpcio==1.43.0 and 1.48.1 so I assume it has to do with my environment?

Is there a way I can replicate the BK environment and iterate faster? It is taking forever to wait for BK

same error message for both grpc versions

(base) C:\Users\Administrator\Downloads\ray-master\ray-master>pytest python\ray\tests\test_multiprocessing.py -k "test_callbacks" -s -v
================================================================= test session starts ==================================================================
platform win32 -- Python 3.7.6, pytest-7.1.3, pluggy-0.13.1 -- c:\programdata\anaconda3\python.exe
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('C:\\Users\\Administrator\\Downloads\\ray-master\\ray-master\\.hypothesis\\examples')
rootdir: C:\Users\Administrator\Downloads\ray-master\ray-master\python
plugins: anyio-3.6.1, hypothesis-5.4.1, arraydiff-0.3, astropy-header-0.1.2, asyncio-0.19.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2
asyncio: mode=strict
collected 20 items / 19 deselected / 1 selected

python\ray\tests\test_multiprocessing.py::test_callbacks starting 4 processes using ray pool
Usage stats collection is enabled. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2022-09-22 19:17:11,055 INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
callback_queue.get
(PoolActor pid=7892) 2022-09-22 19:17:20,118    ERROR serialization.py:354 -- No module named 'test_multiprocessing'
(PoolActor pid=7892) Traceback (most recent call last):
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 352, in deserialize_objects
(PoolActor pid=7892)     obj = self._deserialize_object(data, metadata, object_ref)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 241, in _deserialize_object
(PoolActor pid=7892)     return self._deserialize_msgpack_data(data, metadata_fields)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 196, in _deserialize_msgpack_data
(PoolActor pid=7892)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 186, in _deserialize_pickle5_data
(PoolActor pid=7892)     obj = pickle.loads(in_band)
(PoolActor pid=7892) ModuleNotFoundError: No module named 'test_multiprocessing'

@h-vetinari
Copy link

Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?

@cadedaniel
Copy link
Member Author

Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?

Yep! I believe we prohibited this version because it was causing the hang; since 1.48.0 was yanked and 1.48.1 fixes the issue, we can simply allow it.

@cadedaniel
Copy link
Member Author

cadedaniel commented Sep 23, 2022

I'm increasing the timeouts on all of the tests that I've seen become flaky on Windows because of this change.

test_reference_counting
test_multiprocessing
test_asyncio
test_queue
test_runtime_env_working_dir_3

Signed-off-by: Cade Daniel <[email protected]>
@cadedaniel
Copy link
Member Author

I have a lint fix but waiting for windows builds to succeed before I push

@jjyao
Copy link
Collaborator

jjyao commented Sep 25, 2022

windows tests passed

@scv119 scv119 merged commit 58255c7 into ray-project:master Sep 25, 2022
@cadedaniel cadedaniel deleted the widen-grpcio-range branch September 25, 2022 18:37
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
* Widening range of grpcio versions allowed.

Signed-off-by: Cade Daniel <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. core Issues that should be addressed in Ray Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] investigate why Ray hangs with grpcio==1.48.0
5 participants