[core] Make ray able to connect to redis without pip redis. #25875

fishbone · 2022-06-17T00:24:18Z

Why are these changes needed?

Right now, only cpp layer in ray is connecting to redis which means we don't need pip redis to connect to a redis db.

The blocking part is that we are doing some sharding in redis right now. But this feature is not actually used and the shard is always 1. So to make things simple, this feature is just disabled.

Test is added to make sure we can start ray with a redis db without pip redis.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

fishbone · 2022-06-17T00:24:53Z

@wilsonwang371 Once we get this PR in, we don't need pip redis in KubeRay.

fishbone · 2022-06-17T00:26:00Z

I'll remove enable_sharding_conn_ in next PR.

mwtian · 2022-06-17T23:54:36Z

ci/ci.sh

@@ -751,14 +751,24 @@ test_minimal() {
  # shellcheck disable=SC2086
  bazel test --test_output=streamed --config=ci --test_env=RAY_MINIMAL=1 ${BAZEL_EXPORT_OPTIONS} python/ray/tests/test_basic
  # shellcheck disable=SC2086
+  bazel test --test_output=streamed --config=ci --test_env=RAY_MINIMAL=1 --test_env=TEST_EXTERNAL_REDIS=1 ${BAZEL_EXPORT_OPTIONS} python/ray/tests/test_basic


Is external Redis worth a separate test function and build step?

I feel it's probably is ok because redis is also what we should support in minimal install. We only run basic test which does't require that much time.

mwtian · 2022-06-17T23:56:29Z

python/ray/_private/services.py

-    assert len(cur_config_list) == 12
-    cur_config_list[8:] = ["pubsub", "134217728", "134217728", "60"]
-    redis_client.config_set("client-output-buffer-limit", " ".join(cur_config_list))
-    # Put a time stamp in Redis to indicate when it was started.


Why not keep setting"redis_start_time"?

Redis is provided by the user, not Ray, so it doesn't make sense for ray to write anything like this into db.

mwtian · 2022-06-18T00:00:15Z

python/ray/node.py

@@ -1086,9 +1044,7 @@ def start_head_processes(self):
            # Redis, when external Redis address is specified.
            # TODO(mwtian): after GCS bootstrapping is default and stable,


This TODO can be removed now.

mwtian · 2022-06-18T00:04:12Z

python/ray/node.py

@@ -96,10 +96,6 @@ def __init__(
            if len(external_redis) == 1:
                external_redis.append(external_redis[0])
            [primary_redis_ip, port] = external_redis[0].split(":")
-            ray._private.services.wait_for_redis_to_start(


Why do we remove wait_for_redis_to_start() here and in other places?

The user should provide a Redis cluster and make sure it's up, so this logic doesn't fit here.

We have retry in gcs&raylet, so it should be ok.

What are the remaining usages of wait_for_redis_to_start(), and can they be moved to a test-only util or removed as well?

Feel like it is still good to have some health checking before it starts. Does Redis have any API for that?

+1 to @rkooo567. I didn't find the right way to phrase it earlier, but I think giving a good error message if Redis is unavailable is important. Having a dedicated "wait and check Redis connection" logic seems the easiest way to achieve that. Maybe we don't need all the "wait for Redis connection" calls though.

Since all the logic related to the storage are moved to GCS, we probably should just pass that information from GCS to the driver.

I don't see any benefit of detecting it here in the python script especially given that it'll need the user to pip install redis to use this.

fishbone · 2022-06-21T01:12:16Z

Any other comments? @rkooo567 @scv119 @mwtian

mwtian · 2022-06-21T02:20:12Z

python/ray/node.py

@@ -96,10 +96,6 @@ def __init__(
            if len(external_redis) == 1:
                external_redis.append(external_redis[0])
            [primary_redis_ip, port] = external_redis[0].split(":")
-            ray._private.services.wait_for_redis_to_start(


What are the remaining usages of wait_for_redis_to_start(), and can they be moved to a test-only util or removed as well?

rkooo567 · 2022-06-21T15:07:10Z

python/ray/node.py

@@ -96,10 +96,6 @@ def __init__(
            if len(external_redis) == 1:
                external_redis.append(external_redis[0])
            [primary_redis_ip, port] = external_redis[0].split(":")
-            ray._private.services.wait_for_redis_to_start(


Feel like it is still good to have some health checking before it starts. Does Redis have any API for that?

rkooo567 · 2022-06-21T15:07:47Z

python/ray/_private/services.py

-    redis_client.config_set("client-output-buffer-limit", " ".join(cur_config_list))
-    # Put a time stamp in Redis to indicate when it was started.
-    redis_client.set("redis_start_time", time.time())
+    # Construct the command to start the Redis server.


Do we need this method at all? Why don't we move it to test utils?

Yes, we should. Let me update the PR.

fishbone · 2022-06-21T18:22:37Z

@rkooo567 @mwtian I agree with what you said and in high-level I think what we should do is that pop up useful messages when something is not working.

But this kind of message shouldn't be put into python layer given that ray doesn't depend on redis and it'll require the user to pip install redis to use this and for the user's image, sometimes it's rebuilt and this will require them to update their environment to use this redis storage which is very inconvenient given that this storage is supported by ray right now (not through plugin mode).
I'm also not comfortable to put it into core worker given that core worker (cpp) also doesn't depend on this.

@rkooo567 right now if redis failed to start, GCS will crash and pop some error like redis failed to start, for case like this (component crash), do we have a way to expose this message via observability features you built to the user?

fishbone · 2022-06-22T05:33:58Z

Discuss it with @rkooo567
I'll do

add log in driver about failed to connect to GCS and ask the user to check GCS log for details
print detailed message about GCS failed to reach redis

Besides this, I'll also check whether it's possible to push message from GCS to core worker before GCS crashed (stretch goal) for better observability.

Signed-off-by: Yi Cheng <[email protected]>

…ect#25875) Signed-off-by: Yi Cheng <[email protected]> ## Why are these changes needed? Right now, only cpp layer in ray is connecting to redis which means we don't need pip redis to connect to a redis db. The blocking part is that we are doing some sharding in redis right now. But this feature is not actually used and the shard is always 1. So to make things simple, this feature is just disabled. Test is added to make sure we can start ray with a redis db without pip redis. Signed-off-by: Rohan138 <[email protected]>

…ect#25875) Signed-off-by: Yi Cheng <[email protected]> ## Why are these changes needed? Right now, only cpp layer in ray is connecting to redis which means we don't need pip redis to connect to a redis db. The blocking part is that we are doing some sharding in redis right now. But this feature is not actually used and the shard is always 1. So to make things simple, this feature is just disabled. Test is added to make sure we can start ray with a redis db without pip redis. Signed-off-by: Stefan van der Kleij <[email protected]>

fishbone assigned scv119 and mwtian Jun 17, 2022

fishbone assigned ericl and rkooo567 Jun 17, 2022

ericl removed their assignment Jun 17, 2022

mwtian reviewed Jun 18, 2022

View reviewed changes

mwtian approved these changes Jun 21, 2022

View reviewed changes

rkooo567 reviewed Jun 21, 2022

View reviewed changes

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 21, 2022

fishbone added the do-not-merge Do not merge this PR! label Jun 22, 2022

fix

39f1eb3

Signed-off-by: Yi Cheng <[email protected]>

fishbone force-pushed the cleanup-redis branch from 5bcc85d to 39f1eb3 Compare July 24, 2022 04:53

fishbone removed the do-not-merge Do not merge this PR! label Jul 24, 2022

fishbone merged commit 0c16619 into ray-project:master Jul 24, 2022

scv119 mentioned this pull request Jul 25, 2022

[Core][Nightly] datasets_ingest_train_infer failing #26966

Closed

scv119 mentioned this pull request Jul 26, 2022

[AIR] Significant data reading regression in Ray cluster from xgboost 100GB test #26995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Make ray able to connect to redis without pip redis. #25875

[core] Make ray able to connect to redis without pip redis. #25875

fishbone commented Jun 17, 2022

fishbone commented Jun 17, 2022

fishbone commented Jun 17, 2022

mwtian Jun 17, 2022

fishbone Jun 18, 2022

mwtian Jun 17, 2022

fishbone Jun 18, 2022

mwtian Jun 18, 2022

mwtian Jun 18, 2022

fishbone Jun 18, 2022

mwtian Jun 21, 2022

rkooo567 Jun 21, 2022

mwtian Jun 21, 2022 •

edited

Loading

fishbone Jun 21, 2022

fishbone commented Jun 21, 2022

mwtian Jun 21, 2022

rkooo567 Jun 21, 2022

rkooo567 Jun 21, 2022

fishbone Jun 21, 2022

fishbone commented Jun 21, 2022 •

edited

Loading

fishbone commented Jun 22, 2022

		@@ -1086,9 +1044,7 @@ def start_head_processes(self):
		# Redis, when external Redis address is specified.
		# TODO(mwtian): after GCS bootstrapping is default and stable,

[core] Make ray able to connect to redis without pip redis. #25875

[core] Make ray able to connect to redis without pip redis. #25875

Conversation

fishbone commented Jun 17, 2022

Why are these changes needed?

Related issue number

Checks

fishbone commented Jun 17, 2022

fishbone commented Jun 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwtian Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Jun 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Jun 21, 2022 • edited Loading

fishbone commented Jun 22, 2022

mwtian Jun 21, 2022 •

edited

Loading

fishbone commented Jun 21, 2022 •

edited

Loading