Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init #21734

Closed
1 of 2 tasks
jbedorf opened this issue Jan 20, 2022 · 17 comments · Fixed by #24058
Closed
1 of 2 tasks
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order rllib RLlib related issues tune Tune-related issues

Comments

@jbedorf
Copy link
Contributor

jbedorf commented Jan 20, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

When using RLlib and Ray Client then you will receive an error (see below) when relying on: ray.init(f"ray://127.0.0.1:10001")
whereas things work when using: export RAY_ADDRESS="ray://127.0.0.1:10001"

In particular this error only happens when using the default gym registered strings. When using a custom registration then code runs as expected.

So:

  • gym-string + ray.init -> error
  • gym-string + RAY_ADDRESS -> works
  • self-registration + ray.init -> works
  • self-registration + RAY_ADDRESS -> works
2022-01-20 03:24:32,339 INFO trainer.py:2054 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
Traceback (most recent call last):
  File "rllib4.py", line 28, in <module>
    trainer = PPOTrainer(config=config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 728, in __init__
    super().__init__(config, logger_creator, remote_checkpoint_dir,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 122, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 754, in setup
    self.env_creator = _global_registry.get(ENV_CREATOR, env)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/registry.py", line 168, in get
    return pickle.loads(value)
EOFError: Ran out of input

Versions / Dependencies

Ray 1.10.0-py38 Docker image with TensorFlow installed.

>>> ray.__commit__
'1583379dce891e96e9721bb958e80d485753aed7'
>>> ray.__version__
'1.10.0'

Reproduction script

# Import the RL algorithm (Trainer) we would like to use.
import ray

ray.init(f"ray://127.0.0.1:10001")  # Comment out to make this work.

from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env
from gym.envs.classic_control.cartpole import CartPoleEnv

def env_creator(config):
    return CartPoleEnv()

register_env("my_env", env_creator)


# Configure the algorithm.
config = {
    # Environment (RLlib understands openAI gym registered strings).
    "env" : "CartPole-v1",  # <-- Fails
    #"env" : "my_env",  # <-- Works
    "num_workers": 2,
    "framework": "tf"
}

trainer = PPOTrainer(config=config)
for _ in range(3):
    print(trainer.train())


Anything else

Happens always.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@jbedorf jbedorf added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2022
@jbedorf jbedorf changed the title [Bug] Gym environment registration does not work when using Ray Client and ray.init [Bug][RLlib] Gym environment registration does not work when using Ray Client and ray.init Jan 20, 2022
@xwjiang2010
Copy link
Contributor

@ericl
Hey Eric, I have a fix to correct this specific behavior, but want to check with you what is expected behavior of gcs client when a key does not exist? Should it return None (not empty bytes)?

@xwjiang2010 xwjiang2010 added rllib RLlib related issues tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 24, 2022
@xwjiang2010
Copy link
Contributor

@mwtian See above. Can you help clarify the behavior of gcs client or point me to someone?

@mwtian
Copy link
Member

mwtian commented Jan 27, 2022

If this is about gcs kv client (for get / put etc), @iycheng will be the most knowledgeable. Thanks for making the fix, and feel free to assign both of us to the PR!

@mwtian
Copy link
Member

mwtian commented Jan 27, 2022

For ray.experimental.internal_kv._internal_kv_get() on a non-existent key, returning None seems right.

@avnishn
Copy link
Member

avnishn commented Feb 7, 2022

I'm unable to produce this bug. @xwjiang2010 did you produce a fix for this, and can this issue be closed?

@xwjiang2010
Copy link
Contributor

@mwtian Thanks for the response.
In that case, I will close my PR and reassign it to you :)

Minimal reproduce:

In [1]: import ray

In [2]: ray.init(f"ray://127.0.0.1:10001")  # Comment out to make this work.
Out[2]: ClientContext(dashboard_url=None, python_version='3.7.11', ray_version='2.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}', protocol_version='2021-12-07', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7f8be02ed610>)

In [3]: from ray.experimental.internal_kv import _internal_kv_initialized, \
   ...:    ...:     _internal_kv_get, _internal_kv_put

In [4]: _internal_kv_initialized()
Out[4]: True

In [5]: value = _internal_kv_get("bla")

In [6]: value
Out[6]: b''

In [7]:

@mwtian
Copy link
Member

mwtian commented Feb 15, 2022

@xwjiang2010 , just to make sure, Out[6]: b'' is unexpected, and it should be None instead?

@iycheng, do you want to take a look?

@xwjiang2010
Copy link
Contributor

@mwtian that's my assumption about gcs client protocol. Maybe @iycheng can clarify?

@mwtian mwtian assigned fishbone and unassigned mwtian Mar 2, 2022
@jovany-wang
Copy link
Contributor

@mwtian @iycheng Do you have any update for this? It seems we have met the same issue in our application.

@jovany-wang
Copy link
Contributor

This is a P0 issue from our side. @ericl CC

@jovany-wang jovany-wang added the P0 Issues that should be fixed in short order label Apr 19, 2022
@mwtian
Copy link
Member

mwtian commented Apr 19, 2022

@jovany-wang just to confirm, you are receiving empty bytes when calling _internal_kv_get() on a non-existent key via Ray client, but None is returned when not using Ray client, right?

@jovany-wang
Copy link
Contributor

@mwtian I believe it's totally the same issue according to my stack:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
/tmp/ipykernel_4689/1080049057.py in <module>
     47 
     48 ray.client('100.88.148.29:38159').connect()
---> 49 main()

/tmp/ipykernel_4689/1080049057.py in main()
     33 
     34     # Create our RLlib Trainer.
---> 35     trainer = PPOTrainer(config=config)
     36 
     37     # Run it for n training iterations. A training iteration includes

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py in __init__(self, config, env, logger_creator)
    121 
    122         def __init__(self, config=None, env=None, logger_creator=None):
--> 123             Trainer.__init__(self, config, env, logger_creator)
    124 
    125         def _init(self, config: TrainerConfigDict,

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in __init__(self, config, env, logger_creator)
    546             logger_creator = default_logger_creator
    547 
--> 548         super().__init__(config, logger_creator)
    549 
    550     @classmethod

~/.local/lib/python3.7/site-packages/ray/tune/trainable.py in __init__(self, config, logger_creator)
     96 
     97         start_time = time.time()
---> 98         self.setup(copy.deepcopy(self.config))
     99         setup_time = time.time() - start_time
    100         if setup_time > SETUP_TIME_THRESHOLD:

~/.local/lib/python3.7/site-packages/ray/rllib/agents/trainer.py in setup(self, config)
    640             # An already registered env.
    641             if _global_registry.contains(ENV_CREATOR, env):
--> 642                 self.env_creator = _global_registry.get(ENV_CREATOR, env)
    643             # A class specifier.
    644             elif "." in env:

~/.local/lib/python3.7/site-packages/ray/tune/registry.py in get(self, category, key)
    138                     "Registry value for {}/{} doesn't exist.".format(
    139                         category, key))
--> 140             return pickle.loads(value)
    141         else:
    142             return pickle.loads(self._to_flush[(category, key)])

EOFError: Ran out of input

@jovany-wang
Copy link
Contributor

@mwtian FYI, we are using 1.4 or 1.2 I believe _internal_kv_get is not used.

@jovany-wang
Copy link
Contributor

jovany-wang commented Apr 19, 2022

@mwtian FYI, we are using 1.4 or 1.2 I believe _internal_kv_get is not used.

Sorry, it still uses _internal_kv_get:

    def get(self, category, key):
        if _internal_kv_initialized():
            value = _internal_kv_get(_make_key(category, key))
            if value is None:
                raise ValueError(
                    "Registry value for {}/{} doesn't exist.".format(
                        category, key))
            return pickle.loads(value)

@mwtian
Copy link
Member

mwtian commented Apr 19, 2022

Will try to take a look tomorrow. Btw the fix will very unlikely get back ported.

@jovany-wang
Copy link
Contributor

@mwtian Do we have any update?

@mwtian
Copy link
Member

mwtian commented Apr 20, 2022

Let's see if #24058 can fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order rllib RLlib related issues tune Tune-related issues
Projects
None yet
6 participants