Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

Closed
2 tasks
roireshef opened this issue Nov 24, 2020 · 28 comments
Closed
2 tasks

[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

roireshef opened this issue Nov 24, 2020 · 28 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks tune Tune-related issues

Comments

@roireshef
Copy link
Contributor

What is the problem?

I'm running Tune for hyperparameter-search with single-node trials that should last 200M steps each, and which are running simultaneously. My Ray cluster is a private cluster setup. It seems like all of the trials start with ~500 steps/sec rate, and after two to three hours they all start slowing down dramatically.

Since they all start at the same time (+-1min), I expected Tune to show similar time passed since initiation for all of them (at the summary table it prints occasionally), but the time varies greatly between trials.

Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.7.2 (I'm pretty sure it reproduced in 1.0.1 as well)
Docker 19.03.11
PyTorch 1.5.1
Ubuntu 16.04 / SUSE Linux Enterprise Server 12 SP4

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I'm running a FLOW-based environment code with confidential logic I can't publish. I'm attaching logs instead.

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@roireshef roireshef added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 24, 2020
@roireshef
Copy link
Contributor Author

roireshef commented Nov 24, 2020

tensorboard charts show sampling rate gradually slows down at some point (shows all trials at that same experiment)
slowed_down_training_num_steps

I'm attaching the top outputs for some node in the cluster (sumo is external process spawned by the environment - it's a traffic simulator the environment's code is interacting with). See how the memory consumption of the RLlib worker went up for processes that exist for more than few hours, while the new (possibly recovered) process has lower memory consumption. Also, some workers exhibit relatively high CPU usage while others are stuck on 0%.
slowing_down_training

Finally, I'm also attaching the raylet logs at that same machine:
session_2020-11-23_07-52-28_704870_1.tar.gz

I'm currently debugging under the assumption this is a memory issue, but:

  1. I've seen in another issue that @rkooo567 suggested using lru-evict flag, but I fail to apply it to ray start - is it supported in 0.7.2? Also, AFAIK this is a workaround but it's not guaranteed to be stable and anyway the better practice is to find the root cause and fix it. So I wonder how can this be done. Is there any best practice or tool for that?
  2. It might be worth printing warnings to the console if memory is exhausted, so detection will be immediate.

@roireshef
Copy link
Contributor Author

@ericl , @richardliaw - I think I have an interesting insights about that issue. I ran two experiments with everything configured exactly the same (both hardware/software), just that one experiment (will be referenced as "stress_test") were given enough nodes to run all trials simultaneously (48 nodes x 32 cores) while the other one (will be referenced as "debug") was given only 10 nodes (but experiment still has 48 trials).

Now, see how trials accumulate steps - the number of steps sampled/trained is shown in next charts for trials from both exepriments - the debug trials have a steady bandwidth, providing linear steps accumulated over time, while the stress_test ones are flattening out over time (their steps per time rate is gradually decreasing):
debug_vs_stress_test

The only difference between the two is the amount of nodes in the cluster that Tune has to manage simultaneously. The linear lines (debug trials) have occasional drops in them due to crashes and recoveries I'm aware of, and are fixed in my newer environment's code, so please ignore them.

Another interesting finding is when running top on two nodes (from the different experiments), the one that belongs to stress_test shows ~2GB memory used for all Ray worker processes with quite a lot of them on 0% CPU utilization, whereas debug node's processes have stabilized on ~1GB or ~1.5GB (depending on num_envs_per_worker, which is tested for 3 and 5, respectively) with all processes frequently utilizing CPU well.

@richardliaw
Copy link
Contributor

@roireshef do you have output logs from the tune trial? I'm wondering if this is some artifact of checkpointing.

FYI, we don't have long-term support for older versions like 0.7.2

@roireshef
Copy link
Contributor Author

roireshef commented Nov 27, 2020

Hi @richardliaw,

@roireshef do you have output logs from the tune trial? I'm wondering if this is some artifact of checkpointing.

I'm pretty sure it isn't related to checkpointing because while I've set checkpoint to happen every 50 iterations, you can actually see the training speed desaturates very smoothly over time. Can you use the logs in #12352 (comment) ?

FYI, we don't have long-term support for older versions like 0.7.2

I understand. I tried migrating from 0.7.2 to several versions (0.8.6, 0.8.7, 1.0.1, 1.1.0), each one had at least one issue that prevented me from migrating successfully. I've been waiting for those to be fixed for some time, but had to continue my actual research. I will be trying to again in about a month from today. That said, I'm pretty sure this specific issue reproduces on newer versions as well. I've recently provided @ericl with reproduction instructions in #11239 (comment) - seems like my environment is heavier and more memory-demanding than the other ones often used, although here specifically it seems like the the correlation with the number of nodes used by Tune points to a Ray issue rather than the environment implementation.

@richardliaw
Copy link
Contributor

I'm pretty sure it isn't related to checkpointing because while I've set checkpoint to happen every 50 iterations,

I'm referring to an internal Tune checkpointing mechanism. Unfortunately, these logs posted do not capture stdout/stderr of the main python process. Can you please provide stdout/stderr of the python process? If you plan to rerun this, it'd be great to set ray.init(logging_level="DEBUG") too.

@ericl
Copy link
Contributor

ericl commented Nov 28, 2020

This kind of reminds me of #4211, though that issue should be solved. It does sound to me like some problem in the Tune event loop and not Ray though.

@ericl
Copy link
Contributor

ericl commented Nov 28, 2020

@richardliaw running a similar workload on 50 nodes, I see warnings like this:

2020-11-28 03:12:20,616 WARNING util.py:143 -- The on_step_begin operation took 0.8207757472991943 seconds to complete, which may be a performance bottleneck.
2020-11-28 03:12:22,304 WARNING util.py:143 -- The on_step_begin operation took 0.8349165916442871 seconds to complete, which may be a performance bottleneck.
2020-11-28 03:12:23,944 WARNING util.py:143 -- The on_step_begin operation took 0.805654764175415 seconds to complete, which may be a performance bottleneck.

This could be a O(n^2) kind of issue: as the number of nodes grows, the number of steps per second that must be processed increases. At the same time, the amount of work per step (calling ray.cluster_resources() etc in on_step_begin) also increases linearly. At some point the n^2 factor overwhelms Tune.

@roireshef I think it's the large number of concurrent trials is causing Tune to become backlogged processing trial results. One way of fixing this is to increase the reporting period of the trials. In rllib you can set min_iter_time_s: 90 to report results once per 90 seconds per trial. Can you try this out to see if it helps?

@ericl ericl added P1 Issue that should be fixed within a few weeks tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 28, 2020
@roireshef
Copy link
Contributor Author

roireshef commented Nov 29, 2020

@ericl I'll do that soon and report back, thanks.
Edit: I ran two experiments more or less at the same time, one with a 3(x32c) cluster and the other one with 36(x32c) cluster. Results of the smaller one show:

Number of trials: 3/3 (3 RUNNING)
+-------------------------------------------------------+----------+--------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                                            | status   | loc                |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------------------------+----------+--------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| A3C_NegotiationMergeHandoffFullLateralEnv_42e58_00000 | RUNNING  | 10.120.252.248:68  |                       100 |     67 |          **8280.11** | **4370702** | 0.730443 |              1.35521 |            -1.07226  |            21.3256 |
| A3C_NegotiationMergeHandoffFullLateralEnv_42e58_00001 | RUNNING  | 10.120.253.5:68    |                       200 |     63 |          **8291.89** | **3115123** | 0.550906 |              1.4126  |            -0.749485 |            20.5144 |
| A3C_NegotiationMergeHandoffFullLateralEnv_42e58_00002 | RUNNING  | 10.120.252.247:162 |                       300 |     59 |          **8268.29** | **3077160** | 0.485682 |              1.47101 |            -0.692797 |            29.0559 |
+-------------------------------------------------------+----------+--------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+

While the summary table for the larger cluster shows:

Number of trials: 36/36 (36 RUNNING)
+-------------------------------------------------------+----------+---------------------+-----------------------------------------------+---------------------------------------------------------+-----------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                                            | status   | loc                 |   env_config/REWARD_FUNCTION_PARAMS/JERK_COEF |   env_config/REWARD_FUNCTION_PARAMS/KNOT_JERK_DIFF_COEF |   num_envs_per_worker |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------------------------+----------+---------------------+-----------------------------------------------+---------------------------------------------------------+-----------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00000 | RUNNING  | 10.120.253.39:68    |                                          0.1  |                                                    0.1  |                     1 |                       100 |     22 |          **2547.62** | **1417880** | 0.703202 |              1.36162 |          -0.140709   |           24.2091  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00001 | RUNNING  | 10.120.252.217:68   |                                          0.01 |                                                    0.1  |                     1 |                       100 |     24 |          **2648.8**  | **1488045** | 0.786167 |              1.37889 |           0.00586435 |           25.7088  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00002 | RUNNING  | 10.120.253.61:1771  |                                          0.1  |                                                    0.01 |                     1 |                       100 |     24 |          **2558.12** | **1448053** | 0.800076 |              1.39519 |          -0.127671   |           27.2306  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00003 | RUNNING  | 10.120.252.243:68   |                                          0.01 |                                                    0.01 |                     1 |                       100 |     21 |          **2535.21** | **1528173** | 0.873466 |              1.56099 |           0.147064   |           31.104   |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00004 | RUNNING  | 10.120.253.26:1164  |                                          0.1  |                                                    0.1  |                     3 |                       100 |     21 |          **2476.43** | **1195145** | 0.558624 |              1.2837  |          -0.228631   |           18.7178  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00005 | RUNNING  | 10.120.253.61:68    |                                          0.01 |                                                    0.1  |                     3 |                       100 |     23 |          **2604.92** | **1288328** | 0.585579 |              1.46326 |          -1.13464    |           28.5279  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00006 | RUNNING  | 10.120.252.182:1186 |                                          0.1  |                                                    0.01 |                     3 |                       100 |     21 |          **2474.23** | **1216406** | 0.63641  |              1.48117 |           0.211203   |           25.1225  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00007 | RUNNING  | 10.120.252.240:68   |                                          0.01 |                                                    0.01 |                     3 |                       100 |     20 |          **2550.96** | **1325959** | 0.815871 |              1.60022 |           0.136913   |           28.6285  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00008 | RUNNING  | 10.120.253.53:72    |                                          0.1  |                                                    0.1  |                     5 |                       100 |     20 |          **2517.84** | **1123780** | 0.490222 |              1.3201  |          -1.39565    |           10.8477  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00009 | RUNNING  | 10.120.252.202:68   |                                          0.01 |                                                    0.1  |                     5 |                       100 |     20 |          **2546.36** | **1158058** | 0.576395 |              1.44916 |          -1.08705    |           24.7925  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00010 | RUNNING  | 10.120.252.229:68   |                                          0.1  |                                                    0.01 |                     5 |                       100 |     20 |          **2508.71** | **1144760** | 0.507061 |              1.38953 |          -0.118062   |           10.1495  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00011 | RUNNING  | 10.120.253.25:68    |                                          0.01 |                                                    0.01 |                     5 |                       100 |     21 |          **2517.59** | **1180637** | 0.712453 |              1.60088 |           0.186469   |           27.8747  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00012 | RUNNING  | 10.120.253.6:68     |                                          0.1  |                                                    0.1  |                     1 |                       200 |     21 |          **2512.58** | **1273095** | 0.684557 |              1.31557 |          -0.280583   |           25.5985  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00013 | RUNNING  | 10.120.252.213:205  |                                          0.01 |                                                    0.1  |                     1 |                       200 |     21 |          **2526.43** | **1354669** | 0.729874 |              1.31967 |          -0.344612   |           23.2483  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00014 | RUNNING  | 10.120.253.58:68    |                                          0.1  |                                                    0.01 |                     1 |                       200 |     20 |          **2473.29** | **1353609** | 0.738936 |              1.44908 |          -0.206731   |           25.2905  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00015 | RUNNING  | 10.120.253.6:72     |                                          0.01 |                                                    0.01 |                     1 |                       200 |     24 |          **2520.88** | **1394423** | 0.863334 |              1.51696 |           0.265222   |           28.039   |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00016 | RUNNING  | 10.120.253.31:1185  |                                          0.1  |                                                    0.1  |                     3 |                       200 |     24 |          **2666.26** |  **986375** | 0.503936 |              1.33032 |          -0.986918   |           21.1258  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00017 | RUNNING  | 10.120.253.34:68    |                                          0.01 |                                                    0.1  |                     3 |                       200 |     23 |          **2685.1  |** **1165452** | 0.564914 |              1.42336 |          -0.276867   |           27.8179  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00018 | RUNNING  | 10.120.253.25:72    |                                          0.1  |                                                    0.01 |                     3 |                       200 |     21 |          **2566.15 |** **1101520** | 0.482917 |              1.14707 |           0.169248   |            7.73379 |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00019 | RUNNING  | 10.120.252.192:87   |                                          0.01 |                                                    0.01 |                     3 |                       200 |     25 |          **2850.1**  | **1367602** | 0.660461 |              1.37393 |           0.216757   |           18.0209  |
+-------------------------------------------------------+----------+---------------------+-----------------------------------------------+---------------------------------------------------------+-----------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
... 16 more trials not shown (16 RUNNING)

I emphasized the time and steps columns. The progress rate for the trials in the 36 nodes cluster is ~30% of the smaller cluster's trials for both time and steps, so the problem remains. I ran top on one node from each cluster and while the smaller cluster's node's top results show RolloutWorker consuming ~80% cpu most of the time, the larger cluster's node's top results already show RolloutWorker consuming 0% cpu - see the following snapshot:
Screenshot from 2020-11-29 23-31-03

I'll wait few more hours and see if the steps/hour demonstrate the same behavior as in #12352 (comment) and I will also provide the logs again, this time I'm running with latest nightly wheel, meaning logs will be more detailed (although I forgot to set logging level to DEBUG this time - sorry @richardliaw, but I think they do include the main process outputs as well)

@ericl - I'm not sure I understand what exactly is O(n^2) here...? (n being the number of nodes in the cluster)
If the runtime complexity grows quadratically in the number of nodes in the cluster - that basically means there will always be a low limit to how many nodes we can run in a Tune experiment. Can it be avoided??

@roireshef
Copy link
Contributor Author

roireshef commented Nov 30, 2020

Screenshot from 2020-11-30 21-11-26

Yes, it's clear. The 36 nodes experiment just stops making any progress after less than a full hour of training. The 3 trials that did make progress are the ones that belong to the 3 nodes cluster...

@richardliaw, attached:

  1. Console outputs from main python process main_cleaned.tar.gz
  2. Temp logs from some machine across the clusters logs.tar.gz

@roireshef
Copy link
Contributor Author

@ericl , @richardliaw - Whatever machine I was connecting to and trying to pull the temporary logs out of it, I ended up with the same files. I was under the impression each node stores its own temporary logs. Is it the case where all nodes are communicating temporary logs between one another? @ericl is this why you said it's a O(n^2) thing? Can this be avoided in any way?

@richardliaw
Copy link
Contributor

Taking a look!

@richardliaw
Copy link
Contributor

@roireshef how did you generate the plot in #12352 (comment) ? Are they at all related to the logs show in the Console Output from main python process?

In the logs, I'm seeing that every process is making progress and incrementing the "timesteps_total" value. If you look closely, every trial at training_iteration = 30 has passed 1e7 timesteps.

What am I missing?

Also, for /tmp/ray/session_latest, for me what is most useful is just (raylet, gcs_server) logs on the head node.

@roireshef
Copy link
Contributor Author

roireshef commented Dec 1, 2020

@roireshef how did you generate the plot in #12352 (comment) ? Are they at all related to the logs show in the Console Output from main python process?

Yes, the chart at that comment is produced by Tensorboard, overlaying this 36-nodes experiment with another 3-nodes experiment for a baseline (the consistent linear progress is made by the 3 expriments in the 3-node experiment).

In the logs, I'm seeing that every process is making progress and incrementing the "timesteps_total" value. If you look closely, every trial at training_iteration = 30 has passed 1e7 timesteps.

What am I missing?

@richardliaw - That's very strange. Where excatly do you see 1e7 timesteps? (although 1e7 is also a bad number, it should have been in the scale of 4e7-5e7 timesteps at that point, after more than 24h of training). Are you sure we are looking at the same files? (the logs file in #12352 (comment) doesn't have session_latest soft-link at all, please confirm that you are reading files from this folder in the archive: session_2020-11-29_14-05-25_874447_1)

The main_cleaned.tar.gz files is ending with the following lines:
timers:
apply_grad_throughput: 264323.924
apply_grad_time_ms: 3.842
grad_wait_time_ms: 1.624
update_time_ms: 1.171
timestamp: 1606764482
timesteps_since_restore: 0
timesteps_total: 1926095 <------- see that
training_iteration: 65
trial_id: e9458_00023

and the last Tune summary table clearly shows 4000-5000 [sec] in the total time (s) column (although 24h passed from start of experiment, see the date-time reported in some lines, and compare first summary table to last one in that file) and the timesteps column has values ranging in 1e6-2e6...

BTW, what I see in main_cleaned.tar.gz is that all trails have reported only 60-80 iterations (not 30) - please consider this is after 24h of training... 3 iterations/hour is very far from what you'd expect when min_iter_time_s=120sec and timesteps_per_iteration=10000

@roireshef
Copy link
Contributor Author

I guess this is pending #12360 now...

@richardliaw
Copy link
Contributor

richardliaw commented Dec 6, 2020

Ah, I meant 1e6, sorry! Let me take a look again later today.

@edoakes
Copy link
Contributor

edoakes commented Dec 17, 2020

Hey @roireshef I see the issue you linked is resolved now, can we close this one?

@roireshef
Copy link
Contributor Author

roireshef commented Dec 17, 2020

Hi @edoakes thanks for following up. I just ran larger scale experiments (~30 nodes each) few hours ago, I'll report the results tomorrow. So far it looks good, but just to make sure, I'd like to let it go for ~1 day or so.

@roireshef
Copy link
Contributor Author

I'm closing this one, although I feel that this doesn't scale the way I expected, but I'll have to do some more tests and see if this is worth re-opening. Thanks for your help!

@roireshef
Copy link
Contributor Author

roireshef commented Jan 2, 2021

This issue isn't resolved, but rather minimized to some degree - previously I was experiencing the slowdowns when scheduling 4 or more trials (each one assigned to a full and single node in the cluster, 32 cores) after 5-6 hours of training. Now it seems like that for 4-8 nodes the slowdowns are minimized to the point they aren't a real issue anymore (although they still exist to some lower degree), but the slowdowns when scheduling 20-40 nodes are definitely there, although postponed to start later (at around 25 hours from the start of training), and pretty much making Tune impractical for hyperparameter-tuning at that scale.

I'm attaching some charts for evidence. Please let me know if you need anything else for debugging.

@roireshef roireshef reopened this Jan 2, 2021
@roireshef
Copy link
Contributor Author

roireshef commented Jan 2, 2021

Screenshot from 2021-01-02 13-40-55

As you see above, if the rate of sampling/training was maintained throughout the whole session, the trials were to reach 300M steps (my stopping criterion) in ~150 hours (6 days). In reality, the number of steps sampled/trained after at that time is ~50M (1/6 of the expected steps) and to reach 300M it would take another 1250 hours (52 days).

Screenshot from 2021-01-02 13-43-12

@ericl ericl added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jan 3, 2021
@ericl ericl pinned this issue Jan 3, 2021
@roireshef
Copy link
Contributor Author

roireshef commented Jan 5, 2021

@ericl - if you need anything from my end (logs, etc.) please LMK.
Otherwise, I would like to terminate those trials to free the resources assigned to them soon.

And thanks again!

@ericl ericl unpinned this issue Jan 11, 2021
@ericl
Copy link
Contributor

ericl commented Feb 16, 2021

@richardliaw is this fixed now?

@richardliaw
Copy link
Contributor

I was planning to leave this open until @sven1977 merges placement groups support for RLlib.

@richardliaw richardliaw added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Feb 17, 2021
@roireshef
Copy link
Contributor Author

roireshef commented Feb 24, 2021

@

I was planning to leave this open until @sven1977 merges placement groups support for RLlib.

Any plans for when will it happen? It will be much appreciated if we could run RLlib/Tune at scale. I'm currently limited to 10 nodes per Tune experiment.

@roireshef
Copy link
Contributor Author

@sven1977 @ericl @richardliaw ?

@ericl
Copy link
Contributor

ericl commented May 1, 2021

Placement groups are now on by default in the latest release. Should this be closed?

@roireshef
Copy link
Contributor Author

roireshef commented May 1, 2021

Sounds good. Thanks. I haven't migrated to latest release yet, I must say.

@ericl - is latest release well benchmarked with respect to this issue? Were you able to spawn few 10s of trials with a single Tune call and maintain consistent bandwidth throughout a day or two of training?

@richardliaw
Copy link
Contributor

This should be addressed; just created a new issue to track remaining work item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

5 participants