[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

jbedorf · 2022-01-20T11:26:49Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

When using RLlib and Ray Client then you will receive an error (see below) when relying on: ray.init(f"ray://127.0.0.1:10001")
whereas things work when using: export RAY_ADDRESS="ray://127.0.0.1:10001"

In particular this error only happens when using the default gym registered strings. When using a custom registration then code runs as expected.

So:

gym-string + ray.init -> error
gym-string + RAY_ADDRESS -> works
self-registration + ray.init -> works
self-registration + RAY_ADDRESS -> works

2022-01-20 03:24:32,339 INFO trainer.py:2054 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
Traceback (most recent call last):
  File "rllib4.py", line 28, in <module>
    trainer = PPOTrainer(config=config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 728, in __init__
    super().__init__(config, logger_creator, remote_checkpoint_dir,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 122, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 754, in setup
    self.env_creator = _global_registry.get(ENV_CREATOR, env)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/registry.py", line 168, in get
    return pickle.loads(value)
EOFError: Ran out of input

Versions / Dependencies

Ray 1.10.0-py38 Docker image with TensorFlow installed.

>>> ray.__commit__
'1583379dce891e96e9721bb958e80d485753aed7'
>>> ray.__version__
'1.10.0'

Reproduction script

# Import the RL algorithm (Trainer) we would like to use.
import ray

ray.init(f"ray://127.0.0.1:10001")  # Comment out to make this work.

from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env
from gym.envs.classic_control.cartpole import CartPoleEnv

def env_creator(config):
    return CartPoleEnv()

register_env("my_env", env_creator)


# Configure the algorithm.
config = {
    # Environment (RLlib understands openAI gym registered strings).
    "env" : "CartPole-v1",  # <-- Fails
    #"env" : "my_env",  # <-- Works
    "num_workers": 2,
    "framework": "tf"
}

trainer = PPOTrainer(config=config)
for _ in range(3):
    print(trainer.train())

Anything else

Happens always.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

xwjiang2010 · 2022-01-24T17:41:03Z

@ericl
Hey Eric, I have a fix to correct this specific behavior, but want to check with you what is expected behavior of gcs client when a key does not exist? Should it return None (not empty bytes)?

xwjiang2010 · 2022-01-26T20:44:11Z

@mwtian See above. Can you help clarify the behavior of gcs client or point me to someone?

mwtian · 2022-01-27T00:26:41Z

If this is about gcs kv client (for get / put etc), @iycheng will be the most knowledgeable. Thanks for making the fix, and feel free to assign both of us to the PR!

mwtian · 2022-01-27T00:29:47Z

For ray.experimental.internal_kv._internal_kv_get() on a non-existent key, returning None seems right.

avnishn · 2022-02-07T20:35:11Z

I'm unable to produce this bug. @xwjiang2010 did you produce a fix for this, and can this issue be closed?

xwjiang2010 · 2022-02-15T18:11:22Z

@mwtian Thanks for the response.
In that case, I will close my PR and reassign it to you :)

Minimal reproduce:

In [1]: import ray

In [2]: ray.init(f"ray://127.0.0.1:10001")  # Comment out to make this work.
Out[2]: ClientContext(dashboard_url=None, python_version='3.7.11', ray_version='2.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}', protocol_version='2021-12-07', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7f8be02ed610>)

In [3]: from ray.experimental.internal_kv import _internal_kv_initialized, \
   ...:    ...:     _internal_kv_get, _internal_kv_put

In [4]: _internal_kv_initialized()
Out[4]: True

In [5]: value = _internal_kv_get("bla")

In [6]: value
Out[6]: b''

In [7]:

mwtian · 2022-02-15T18:18:02Z

@xwjiang2010 , just to make sure, Out[6]: b'' is unexpected, and it should be None instead?

@iycheng, do you want to take a look?

xwjiang2010 · 2022-02-15T19:01:55Z

@mwtian that's my assumption about gcs client protocol. Maybe @iycheng can clarify?

jovany-wang · 2022-04-18T07:06:57Z

@mwtian @iycheng Do you have any update for this? It seems we have met the same issue in our application.

jovany-wang · 2022-04-19T03:37:53Z

This is a P0 issue from our side. @ericl CC

mwtian · 2022-04-19T03:45:40Z

@jovany-wang just to confirm, you are receiving empty bytes when calling _internal_kv_get() on a non-existent key via Ray client, but None is returned when not using Ray client, right?

jovany-wang · 2022-04-19T04:02:02Z

@mwtian I believe it's totally the same issue according to my stack:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
/tmp/ipykernel_4689/1080049057.py in <module>
     47 
     48 ray.client('100.88.148.29:38159').connect()
---> 49 main()

/tmp/ipykernel_4689/1080049057.py in main()
     33 
     34     # Create our RLlib Trainer.
---> 35     trainer = PPOTrainer(config=config)
     36 
     37     # Run it for n training iterations. A training iteration includes

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py in __init__(self, config, env, logger_creator)
    121 
    122         def __init__(self, config=None, env=None, logger_creator=None):
--> 123             Trainer.__init__(self, config, env, logger_creator)
    124 
    125         def _init(self, config: TrainerConfigDict,

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in __init__(self, config, env, logger_creator)
    546             logger_creator = default_logger_creator
    547 
--> 548         super().__init__(config, logger_creator)
    549 
    550     @classmethod

~/.local/lib/python3.7/site-packages/ray/tune/trainable.py in __init__(self, config, logger_creator)
     96 
     97         start_time = time.time()
---> 98         self.setup(copy.deepcopy(self.config))
     99         setup_time = time.time() - start_time
    100         if setup_time > SETUP_TIME_THRESHOLD:

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in setup(self, config)
    640             # An already registered env.
    641             if _global_registry.contains(ENV_CREATOR, env):
--> 642                 self.env_creator = _global_registry.get(ENV_CREATOR, env)
    643             # A class specifier.
    644             elif "." in env:

~/.local/lib/python3.7/site-packages/ray/tune/registry.py in get(self, category, key)
    138                     "Registry value for {}/{} doesn't exist.".format(
    139                         category, key))
--> 140             return pickle.loads(value)
    141         else:
    142             return pickle.loads(self._to_flush[(category, key)])

EOFError: Ran out of input

jovany-wang · 2022-04-19T04:15:35Z

@mwtian FYI, we are using 1.4 or 1.2 I believe _internal_kv_get is not used.

jovany-wang · 2022-04-19T04:19:26Z

@mwtian FYI, we are using 1.4 or 1.2 I believe _internal_kv_get is not used.

Sorry, it still uses _internal_kv_get:

    def get(self, category, key):
        if _internal_kv_initialized():
            value = _internal_kv_get(_make_key(category, key))
            if value is None:
                raise ValueError(
                    "Registry value for {}/{} doesn't exist.".format(
                        category, key))
            return pickle.loads(value)

mwtian · 2022-04-19T04:31:17Z

Will try to take a look tomorrow. Btw the fix will very unlikely get back ported.

jovany-wang · 2022-04-20T06:47:20Z

@mwtian Do we have any update?

mwtian · 2022-04-20T22:59:35Z

Let's see if #24058 can fix the issue.

jbedorf added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2022

jbedorf changed the title ~~[Bug] Gym environment registration does not work when using Ray Client and ray.init~~ [Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init Jan 20, 2022

xwjiang2010 mentioned this issue Jan 24, 2022

[rllib] Registry should not contain a key when gcs client returns empty bytes #21823

Closed

6 tasks

xwjiang2010 added rllib RLlib related issues tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 24, 2022

xwjiang2010 assigned mwtian Feb 15, 2022

mwtian assigned fishbone and unassigned mwtian Mar 2, 2022

jovany-wang added the P0 Issues that should be fixed in short order label Apr 19, 2022

mwtian mentioned this issue Apr 20, 2022

[Ray client] return None from internal KV for non-existent keys #24058

Merged

6 tasks

fishbone closed this as completed in #24058 Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

jbedorf commented Jan 20, 2022

xwjiang2010 commented Jan 24, 2022

xwjiang2010 commented Jan 26, 2022

mwtian commented Jan 27, 2022

mwtian commented Jan 27, 2022

avnishn commented Feb 7, 2022

xwjiang2010 commented Feb 15, 2022

mwtian commented Feb 15, 2022

xwjiang2010 commented Feb 15, 2022

jovany-wang commented Apr 18, 2022

jovany-wang commented Apr 19, 2022

mwtian commented Apr 19, 2022

jovany-wang commented Apr 19, 2022

jovany-wang commented Apr 19, 2022

jovany-wang commented Apr 19, 2022 •

edited

Loading

mwtian commented Apr 19, 2022 •

edited

Loading

jovany-wang commented Apr 20, 2022

mwtian commented Apr 20, 2022

[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

Comments

jbedorf commented Jan 20, 2022

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

xwjiang2010 commented Jan 24, 2022

xwjiang2010 commented Jan 26, 2022

mwtian commented Jan 27, 2022

mwtian commented Jan 27, 2022

avnishn commented Feb 7, 2022

xwjiang2010 commented Feb 15, 2022

mwtian commented Feb 15, 2022

xwjiang2010 commented Feb 15, 2022

jovany-wang commented Apr 18, 2022

jovany-wang commented Apr 19, 2022

mwtian commented Apr 19, 2022

jovany-wang commented Apr 19, 2022

jovany-wang commented Apr 19, 2022

jovany-wang commented Apr 19, 2022 • edited Loading

mwtian commented Apr 19, 2022 • edited Loading

jovany-wang commented Apr 20, 2022

mwtian commented Apr 20, 2022

jovany-wang commented Apr 19, 2022 •

edited

Loading

mwtian commented Apr 19, 2022 •

edited

Loading