Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib][gcs][placementgroups] instability issues running tune/rllib #18003

Closed
2 tasks
AmeerHajAli opened this issue Aug 22, 2021 · 7 comments
Closed
2 tasks
Assignees
Labels
bug Something that is supposed to be working; but isn't rllib RLlib related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@AmeerHajAli
Copy link
Contributor

AmeerHajAli commented Aug 22, 2021

When I run rllib on ray 1.5.2:

  1. the resource demands stay even after the application finishes, for example, I still see the following resource demands (for a few minutes) from the scheduler even after the job prints (pid=191) 2021-08-22 10:45:21,492 INFO tune.py:550 -- Total run time: 1095.71 seconds (1094.69 seconds for the tuning loop). :
Demands:
 {'CPU_group_8eb7d5e8a4ed413432db93d0b79b3e67': 1.0}: 96+ pending tasks/actors
 {'GPU_group_16cd93bbf7607454e10fb4e3334f5da6': 0.001, 'GPU_group_0_16cd93bbf7607454e10fb4e3334f5da6': 0.001}: 1+ pending tasks/actors
 {'GPU_group_1431a0326b37900afe3595513b2e1818': 0.001, 'GPU_group_0_1431a0326b37900afe3595513b2e1818': 0.001}: 1+ pending tasks/actors
 {'CPU': 1.0, 'GPU': 1.0} * 1, {'CPU': 1.0} * 128 (PACK): 1+ pending placement groups
  1. RLLIB prints a lot of verbose resources:
(pid=191) == Status ==
(pid=191) Memory usage on this node: 6.1/31.4 GiB
(pid=191) Using FIFO scheduling algorithm.
(pid=191) Resources requested: 0/296 CPUs, 0/8 GPUs, 0.0/787.44 GiB heap, 0.0/338.81 GiB objects (0.0/1.0 CPU_group_15_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_2_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_0_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_4_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_6_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 GPU_group_0_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 GPU_group_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_12_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_13_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_10_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_1_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_7_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_9_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_3_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_11_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_8_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_14_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_5_8c84f56bef40324a35f6e63418c2a54d, 0.0/129.0 CPU_group_8c84f56bef40324a35f6e63418c2a54d, 0.0/8.0 accelerator_type:T4, 0.0/1.0 CPU_group_116_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_127_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_117_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_119_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_121_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_113_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_124_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_123_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_118_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_115_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_126_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_120_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_114_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_125_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_122_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_112_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_128_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_83_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_85_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_94_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_87_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_90_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_84_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_88_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_82_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_89_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_91_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_92_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_86_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_80_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_93_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_81_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_95_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_100_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_97_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_103_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_108_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_98_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_104_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_111_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_102_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_96_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_99_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_110_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_101_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_106_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_109_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_105_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_107_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_43_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_33_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_36_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_32_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_34_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_35_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_37_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_40_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_39_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_42_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_45_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_44_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_41_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_46_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_47_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_38_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_21_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_18_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_28_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_16_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_19_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_25_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_20_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_27_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_17_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_24_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_22_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_26_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_23_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_30_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_31_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_29_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_71_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_72_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_76_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_68_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_79_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_78_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_70_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_69_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_67_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_65_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_64_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_75_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_66_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_74_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_73_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_77_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_52_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_48_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_63_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_56_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_54_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_62_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_55_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_59_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_51_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_53_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_57_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_58_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_50_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_49_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_61_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_60_8c84f56bef40324a35f6e63418c2a54d)
  1. RLLIB requests a lot of resources sometimes, and if the cluster cannot scale up to accommodate it ends up adding nodes and removing them for being idle and hanging forever. (e.g., it requests resources that should run on 200 nodes, but the cluster can scale only to 10 nodes, so it keeps adding 10 nodes and removing them while the trials says “pending”).

  2. I think we should have e2e tests of rllib with GPUs, this might be already existing but for some reason, I am not able for example to run (the cluster keeps adding and removing nodes like issue 3) : ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml or ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/impala/atari-impala-large.yaml

  3. when I run ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml I get a lot of:

A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: ffffffffffffffffa70b3f9b10676c460808312e01000000 Worker ID: 1d806191d3304d0dbcc5fabedf3eefd9e6f12694227b34ae602c0203 Node ID: 3d02b42b39be8dbcd291b2611f9c36841f00f38e98c599c55ecfe827 Worker IP address: 192.168.75.4 Worker port: 10059 Worker PID: 446844
(pid=237) 2021-08-22 13:08:42,288	ERROR trial_runner.py:773 -- Trial APEX_BreakoutNoFrameskip-v4_95b82_00015: Error processing event.
(pid=237) Traceback (most recent call last):
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 739, in _process_trial
(pid=237)     results = self.trial_executor.fetch_result(trial)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 729, in fetch_result
(pid=237)     result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
(pid=237)     return func(*args, **kwargs)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1564, in get
(pid=237)     raise value.as_instanceof_cause()
(pid=237) ray.exceptions.RayTaskError: ray::APEX.train_buffered() (pid=220341, ip=192.168.75.4)
(pid=237)   File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task
(pid=237)   File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task.function_executor
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
(pid=237)     return method(__ray_actor, *args, **kwargs)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 178, in train_buffered
(pid=237)     result = self.train()
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 640, in train
(pid=237)     raise e
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 629, in train
(pid=237)     result = Trainable.train(self)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 237, in train
(pid=237)     result = self.step()
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 170, in step
(pid=237)     res = next(self.train_exec_impl)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
(pid=237)     return next(self.built_iterator)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
(pid=237)     item = next(it)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
(pid=237)     return next(self.built_iterator)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237)     for item in it:
(pid=237)   [Previous line repeated 1 more time]
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237)     for item in it:
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 551, in base_iterator
(pid=237)     batch = ray.get(obj_ref)
(pid=237)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
(pid=237)     return func(*args, **kwargs)
(pid=237) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

CC @wuisawesome

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@AmeerHajAli AmeerHajAli added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 22, 2021
@AmeerHajAli AmeerHajAli changed the title [rllib][gcs] inconsistencies in resource demands [rllib][gcs][placementgroups] instability issues running tune/rllib Aug 22, 2021
@krfricke
Copy link
Contributor

Thank you so much for discovering these. I'll investigate further and might create child issues to track these individually after finding out the cause.

@AmeerHajAli
Copy link
Contributor Author

@krfricke , after running compact-regression-test.yaml, I am also getting:

Traceback (most recent call last):
  File "/Users/ameerhajali/anaconda3/envs/ray/bin/rllib", line 8, in <module>
    sys.exit(cli())
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/scripts.py", line 34, in cli
    train.run(options, train_parser)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/rllib/train.py", line 255, in run
    concurrent=True)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/tune/tune.py", line 624, in run_experiments
    _remote=False))
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 81, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py", line 225, in get
    res = self._get(obj_ref, op_timeout)
  File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/util/client/worker.py", line 244, in _get
    err = cloudpickle.loads(data.error)
ModuleNotFoundError: No module named 'tblib'

@krfricke
Copy link
Contributor

  1. I will look into this today
  2. Should be fixed in latest master
  3. [placement groups/autoscaler] unfulfillable requests should raise an error #18018
  4. see 3.
  5. Might be unrelated to RLLib, but I'll look into this toda7
  6. This is probably not related to RLLib, too, as it occurs during general error handling (pickling an exception). Still it's unclear why there is a dependency mismatch here. If I see it, I'll try to figure out what's going on. It might be helpful to provide your environment information (pip freeze -l) here

@krfricke
Copy link
Contributor

I can't repro 5 and 6. Does this come up immediately? (It ran for ~1.5 hours without any problems). If it still comes up for you, can you post some local environment information (Python version and pip freeze -l)?

@AmeerHajAli
Copy link
Contributor Author

(ray) ~/Desktop> pip freeze -l
aiobotocore==1.2.2
aiodataloader==0.2.0
aiofiles==0.5.0
aiohttp==3.7.4.post0
aiohttp-cors==0.7.0
aiohttp-middlewares==1.1.0
aioitertools==0.7.1
aiojobs==0.3.0
aiopg==1.2.0
aioredis==1.3.1
alabaster==0.7.12
alchemy-mock==0.4.3
alembic==1.5.2
aniso8601==7.0.0
anyio==2.2.0
anyscale==0.4.18
apipkg==1.5
appdirs==1.4.4
appnope==0.1.0
argon2==0.1.10
argon2-cffi==20.1.0
asgiref==3.3.1
astroid==2.5.6
async-exit-stack==1.0.1
async-generator==1.10
async-timeout==3.0.1
asyncache==0.1.1
asyncpg==0.21.0
asynctest==0.13.0
attrs==20.3.0
aws==0.2.5
aws-sam-translator==1.28.1
aws-xray-sdk==2.6.0
awscli==1.19.62
awspricing==2.0.3
Babel==2.9.0
backcall==0.2.0
backoff==1.10.0
bcrypt==3.1.7
beautifulsoup4==4.9.1
black==19.10b0
bleach==3.1.5
blessings==1.7
blis==0.7.4
boto==2.49.0
boto3==1.16.52
botocore==1.19.52
cachetools==4.2.0
caffeinate==0.1.0
catalogue==1.0.0
certifi==2020.12.5
cffi==1.14.4
cfgv==3.2.0
cfn-lint==0.39.0
chardet==3.0.4
click==7.1.2
cliff==3.6.0
cloudpickle==1.6.0
cmaes==0.7.1
cmd2==1.5.0
cmdstanpy==0.9.68
colorama==0.4.4
coloredlogs==15.0
colorful==0.5.4
colorlog==4.7.2
colorthief==0.2.1
commonmark==0.8.1
conda-pack==0.6.0
ConfigArgParse==1.4
convertdate==2.3.2
coverage==5.3.1
cryptography==3.3.1
cycler==0.10.0
cymem==2.0.5
Cython==0.29
dask==2021.4.0
databases==0.4.2
dataclasses==0.6
decorator==4.4.2
defusedxml==0.6.0
Deprecated==1.2.12
distlib==0.3.1
dm-tree==0.1.6
dnspython==2.1.0
docker==4.4.1
docspec==0.2.1
docspec-python==0.2.0
docutils==0.14
ecdsa==0.14.1
email-validator==1.1.2
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
ephem==3.7.7.1
execnet==1.7.1
expiringdict==1.1.4
fabric==2.5.0
fastapi==0.59.0
filelock==3.0.12
flake8==3.8.4
flake8-alfred==1.1.1
flake8-import-order==0.18.1
flake8-polyfill==1.0.2
flake8-quotes==3.2.0
Flask==1.1.4
Flask-BasicAuth==0.2.0
Flask-Cors==3.0.10
flask-pytest==0.0.5
Flask-RESTful==0.3.8
flatbuffers==1.12
freezegun==1.1.0
fsspec==2021.6.1
future==0.18.2
gensim==3.8.3
gevent==21.1.2
geventhttpclient==1.4.4
gitdb==4.0.5
GitPython==3.1.3
google==3.0.0
google-api-core==1.25.0
google-api-python-client==1.12.8
google-auth==1.24.0
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.2
google-cloud==0.34.0
google-cloud-billing==1.1.0
google-cloud-core==1.5.0
google-cloud-iam==2.0.0
google-cloud-resource-manager==0.30.3
googleapis-common-protos==1.52.0
gpustat==0.6.0
graphene==2.1.8
graphql-core==2.3.2
graphql-relay==2.0.1
greenlet==1.0.0
grimp==1.2.3
grpc-google-iam-v1==0.12.3
grpc-stubs==1.24.3
grpcio==1.35.0
grpcio-tools==1.35.0
gym==0.18.0
h11==0.9.0
hijri-converter==2.1.1
hiredis==2.0.0
holidays==0.11.1
httplib2==0.18.1
httptools==0.1.1
humanfriendly==9.1
hurry.filesize==0.9
identify==2.2.4
idna==2.10
imagesize==1.2.0
import-linter==1.2.1
importlib-metadata==4.0.1
iniconfig==1.1.1
invoke==1.4.1
ipykernel==5.3.4
ipython==7.17.0
ipython-genutils==0.2.0
iso8601==0.1.14
isort==5.8.0
itsdangerous==1.1.0
jedi==0.17.2
Jinja2==2.11.2
jmespath==0.10.0
joblib==1.0.0
json5==0.9.5
jsondiff==1.2.0
jsonpatch==1.28
jsonpickle==1.4.1
jsonpointer==2.0
jsonschema==3.2.0
junit-xml==1.9
jupyter-client==6.1.6
jupyter-core==4.6.3
jupyter-packaging==0.7.12
jupyter-server==1.4.1
jupyterlab==3.0.12
jupyterlab-server==2.3.0
kiwisolver==1.3.1
kopf==1.32.1
korean-lunar-calendar==0.2.1
kubernetes==17.17.0
kubernetes-asyncio==12.0.1
launchdarkly-server-sdk==6.13.1
lazy-object-proxy==1.6.0
libcst==0.3.16
libhoney==1.9.0
locket==0.2.1
locust==1.4.3
LunarCalendar==0.0.9
lz4==3.1.3
Mako==1.1.4
MarkupSafe==1.1.1
matplotlib==3.3.4
mccabe==0.6.1
mistune==0.8.4
mock==1.0.1
modin==0.10.0
more-itertools==8.7.0
moto==1.3.16
msgpack==1.0.2
multidict==5.1.0
murmurhash==1.0.5
mypy==0.790
mypy-extensions==0.4.3
nbclassic==0.2.6
nbconvert==5.6.1
nbformat==5.0.7
networkx==2.5.1
nltk==3.6.2
nodeenv==1.6.0
notebook==6.0.3
npm==0.1.1
nr.collections==0.0.1
nr.databind.core==0.0.22
nr.databind.json==0.0.14
nr.fs==1.6.3
nr.interface==0.0.5
nr.metaclass==0.0.6
nr.parsing.date==0.6.1
nr.pylang.utils==0.0.4
nr.stream==0.0.5
nr.utils.re==0.1.1
numpy==1.19.5
nvidia-ml-py3==7.352.0
oauth2client==3.0.0
oauthlib==3.1.0
onelogin==2.0.2
opencensus==0.7.12
opencensus-context==0.1.2
opencv-python-headless==4.3.0.36
opentelemetry-api==1.4.1
opentelemetry-exporter-otlp==0.17b0
opentelemetry-exporter-otlp-proto-grpc==1.4.1
opentelemetry-ext-asgi==0.11b0
opentelemetry-ext-asyncpg==0.11b0
opentelemetry-ext-botocore==0.11b0
opentelemetry-ext-honeycomb==0.5b0
opentelemetry-instrumentation==0.23b2
opentelemetry-instrumentation-asgi==0.17b0
opentelemetry-instrumentation-asyncpg==0.17b0
opentelemetry-instrumentation-botocore==0.17b0
opentelemetry-instrumentation-sqlalchemy==0.17b0
opentelemetry-instrumentation-starlette==0.17b0
opentelemetry-proto==1.4.1
opentelemetry-sdk==1.4.1
opentelemetry-semantic-conventions==0.23b2
optional-django==0.1.0
optuna==2.5.0
orjson==3.4.7
packaging==20.8
pandas==1.2.4
pandoc==1.0.2
pandocfilters==1.4.2
paramiko==2.7.1
parso==0.7.1
partd==1.1.0
pathspec==0.8.1
pbr==5.5.1
pep8-naming==0.11.1
pexpect==4.8.0
pickle5==0.0.11
pickleshare==0.7.5
Pillow==7.2.0
pip-tools==5.5.0
plac==1.1.3
plotly==4.14.3
pluggy==0.13.1
ply==3.11
postgres==3.0.0
pre-commit==2.12.1
preshed==3.0.5
prettytable==0.7.2
prometheus-client==0.10.1
promise==2.3
prompt-toolkit==3.0.6
prophet==1.0.1
proto-plus==1.13.0
protobuf==3.15.3
psutil==5.8.0
psycopg2-binary==2.8.6
psycopg2-pool==1.1
ptyprocess==0.6.0
py==1.10.0
py-spy==0.3.5
pyaml==20.4.0
pyarrow==3.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybase62==0.4.3
pycodestyle==2.6.0
pycparser==2.20
pydantic==1.8.1
pydata-sphinx-theme==0.4.3
pydoc-markdown==3.13.0
pydocstyle==5.0.2
pyflakes==2.2.0
PyGithub==1.55
pyglet==1.5.0
Pygments==2.3.1
PyJWT==2.1.0
pylama==7.7.1
pylint==2.8.2
PyMeeus==0.5.11
PyNaCl==1.4.0
pynput==1.7.3
pyobjc-core==7.3
pyobjc-framework-Cocoa==7.3
pyobjc-framework-Quartz==7.3
pyparsing==2.4.7
pyperclip==1.8.1
pyRFC3339==1.1
pyrsistent==0.17.3
pystan==2.19.1.1
pytest==6.2.1
pytest-aiohttp==0.3.0
pytest-asyncio==0.14.0
pytest-azurepipelines==0.8.0
pytest-cov==2.11.1
pytest-flask==1.0.0
pytest-forked==1.3.0
pytest-timeout==1.4.2
pytest-tornado==0.8.1
pytest-xdist==2.2.0
python-dateutil==2.8.1
python-editor==1.0.4
python-engineio==3.14.2
python-jose==3.2.0
python-json-logger==2.0.1
python-multipart==0.0.5
python-socketio==4.6.0
python3-wget==0.0.2b1
pytz==2020.5
PyYAML==5.4.1
pyzmq==19.0.2
ray==1.5.2
readthedocs-sphinx-ext==1.0.4
recommonmark==0.5.0
redis==3.5.0
regex==2021.4.4
requests==2.25.1
requests-oauthlib==1.3.0
responses==0.12.0
retrying==1.3.3
rsa==4.7
Rx==1.6.1
s3fs==2021.6.1
s3transfer==0.3.7
sacremoses==0.0.43
scalesec-gcp-workload-identity==1.0.7
scikit-learn==0.23.2
scikit-optimize==0.8.1
scipy==1.5.4
semver==2.13.0
Send2Trash==1.5.0
sentencepiece==0.1.95
sentry-sdk==1.1.0
setuptools-git==1.2
six==1.15.0
sklearn==0.0
smart-open==5.1.0
smmap==3.0.4
sniffio==1.2.0
snowballstemmer==2.0.0
soupsieve==2.0.1
spacy==2.3.5
Sphinx==3.0.4
sphinx-book-theme==0.0.39
sphinx-click==2.5.0
sphinx-copybutton==0.3.1
sphinx-gallery==0.8.2
sphinx-jsonschema==1.16.7
sphinx-tabs==2.0.1
sphinx-version-warning==1.1.2
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
sphinxcontrib-websupport==1.2.4
sphinxcontrib.yt==0.2.2
sphinxemoji==0.1.8
SQLAlchemy==1.4.0b1
sqlalchemy-stubs==0.4
srsly==1.0.5
sshpubkeys==3.1.0
starlette==0.13.4
statsd==3.3.0
stevedore==3.3.0
svgwrite==1.4.1
tabulate==0.8.7
tensorboardX==2.1
terminado==0.8.3
testfixtures==6.15.0
testpath==0.4.4
texthero==1.0.9
thinc==7.4.5
threadpoolctl==2.1.0
tokenizers==0.8.1rc2
toml==0.10.2
toolz==0.11.1
torch==1.7.1
torchvision==0.8.2
tornado==6.1
tqdm==4.56.0
traitlets==4.3.3
transformers==3.1.0
tune-sklearn==0.2.1
typed-ast==1.4.2
typer==0.3.2
typing-extensions==3.10.0.0
typing-inspect==0.6.0
ujson==3.2.0
Unidecode==1.2.0
uritemplate==3.0.1
urllib3==1.26.2
uvicorn==0.11.8
uvloop==0.14.0
virtualenv==20.4.4
vulture==2.3
wasabi==0.8.2
watchdog==1.0.2
wcwidth==0.1.9
webencodings==0.5.1
websocket-client==0.57.0
websockets==8.1
Werkzeug==1.0.1
wordcloud==1.8.1
wrapt==1.12.1
xgboost==1.4.2
xgboost-ray==0.1.1
xmltodict==0.12.0
yapf==0.23.0
yarl==1.6.3
yaspin==1.0.0
zipp==3.4.1
zope.event==4.5.0
zope.interface==5.3.0

python 3.7
I think it is straight forward to repro if you run against a session in the product with the default cluster compute.

@AmeerHajAli
Copy link
Contributor Author

CC @wuisawesome, I think the placement groups are potentially leaking or not being cleaned up appropriately.

@richardliaw richardliaw added the rllib RLlib related issues label Oct 5, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 2, 2021

I believe this should be fixed int he master. Please reopen if you see the issue again

@rkooo567 rkooo567 closed this as completed Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't rllib RLlib related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants