-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352
Comments
tensorboard charts show sampling rate gradually slows down at some point (shows all trials at that same experiment) I'm attaching the top outputs for some node in the cluster (sumo is external process spawned by the environment - it's a traffic simulator the environment's code is interacting with). See how the memory consumption of the RLlib worker went up for processes that exist for more than few hours, while the new (possibly recovered) process has lower memory consumption. Also, some workers exhibit relatively high CPU usage while others are stuck on 0%. Finally, I'm also attaching the raylet logs at that same machine: I'm currently debugging under the assumption this is a memory issue, but:
|
@ericl , @richardliaw - I think I have an interesting insights about that issue. I ran two experiments with everything configured exactly the same (both hardware/software), just that one experiment (will be referenced as "stress_test") were given enough nodes to run all trials simultaneously (48 nodes x 32 cores) while the other one (will be referenced as "debug") was given only 10 nodes (but experiment still has 48 trials). Now, see how trials accumulate steps - the number of steps sampled/trained is shown in next charts for trials from both exepriments - the debug trials have a steady bandwidth, providing linear steps accumulated over time, while the stress_test ones are flattening out over time (their steps per time rate is gradually decreasing): The only difference between the two is the amount of nodes in the cluster that Tune has to manage simultaneously. The linear lines (debug trials) have occasional drops in them due to crashes and recoveries I'm aware of, and are fixed in my newer environment's code, so please ignore them. Another interesting finding is when running top on two nodes (from the different experiments), the one that belongs to stress_test shows ~2GB memory used for all Ray worker processes with quite a lot of them on 0% CPU utilization, whereas debug node's processes have stabilized on ~1GB or ~1.5GB (depending on num_envs_per_worker, which is tested for 3 and 5, respectively) with all processes frequently utilizing CPU well. |
@roireshef do you have output logs from the tune trial? I'm wondering if this is some artifact of checkpointing. FYI, we don't have long-term support for older versions like 0.7.2 |
Hi @richardliaw,
I'm pretty sure it isn't related to checkpointing because while I've set checkpoint to happen every 50 iterations, you can actually see the training speed desaturates very smoothly over time. Can you use the logs in #12352 (comment) ?
I understand. I tried migrating from 0.7.2 to several versions (0.8.6, 0.8.7, 1.0.1, 1.1.0), each one had at least one issue that prevented me from migrating successfully. I've been waiting for those to be fixed for some time, but had to continue my actual research. I will be trying to again in about a month from today. That said, I'm pretty sure this specific issue reproduces on newer versions as well. I've recently provided @ericl with reproduction instructions in #11239 (comment) - seems like my environment is heavier and more memory-demanding than the other ones often used, although here specifically it seems like the the correlation with the number of nodes used by Tune points to a Ray issue rather than the environment implementation. |
I'm referring to an internal Tune checkpointing mechanism. Unfortunately, these logs posted do not capture stdout/stderr of the main python process. Can you please provide stdout/stderr of the python process? If you plan to rerun this, it'd be great to set |
This kind of reminds me of #4211, though that issue should be solved. It does sound to me like some problem in the Tune event loop and not Ray though. |
@richardliaw running a similar workload on 50 nodes, I see warnings like this: 2020-11-28 03:12:20,616 WARNING util.py:143 -- The This could be a O(n^2) kind of issue: as the number of nodes grows, the number of steps per second that must be processed increases. At the same time, the amount of work per step (calling ray.cluster_resources() etc in on_step_begin) also increases linearly. At some point the n^2 factor overwhelms Tune. @roireshef I think it's the large number of concurrent trials is causing Tune to become backlogged processing trial results. One way of fixing this is to increase the reporting period of the trials. In rllib you can set |
@ericl I'll do that soon and report back, thanks.
While the summary table for the larger cluster shows:
I emphasized the time and steps columns. The progress rate for the trials in the 36 nodes cluster is ~30% of the smaller cluster's trials for both time and steps, so the problem remains. I ran top on one node from each cluster and while the smaller cluster's node's top results show RolloutWorker consuming ~80% cpu most of the time, the larger cluster's node's top results already show RolloutWorker consuming 0% cpu - see the following snapshot: I'll wait few more hours and see if the steps/hour demonstrate the same behavior as in #12352 (comment) and I will also provide the logs again, this time I'm running with latest nightly wheel, meaning logs will be more detailed (although I forgot to set logging level to DEBUG this time - sorry @richardliaw, but I think they do include the main process outputs as well) @ericl - I'm not sure I understand what exactly is O(n^2) here...? (n being the number of nodes in the cluster) |
Yes, it's clear. The 36 nodes experiment just stops making any progress after less than a full hour of training. The 3 trials that did make progress are the ones that belong to the 3 nodes cluster... @richardliaw, attached:
|
@ericl , @richardliaw - Whatever machine I was connecting to and trying to pull the temporary logs out of it, I ended up with the same files. I was under the impression each node stores its own temporary logs. Is it the case where all nodes are communicating temporary logs between one another? @ericl is this why you said it's a O(n^2) thing? Can this be avoided in any way? |
Taking a look! |
@roireshef how did you generate the plot in #12352 (comment) ? Are they at all related to the logs show in the Console Output from main python process? In the logs, I'm seeing that every process is making progress and incrementing the "timesteps_total" value. If you look closely, every trial at training_iteration = 30 has passed 1e7 timesteps. What am I missing? Also, for |
Yes, the chart at that comment is produced by Tensorboard, overlaying this 36-nodes experiment with another 3-nodes experiment for a baseline (the consistent linear progress is made by the 3 expriments in the 3-node experiment).
@richardliaw - That's very strange. Where excatly do you see 1e7 timesteps? (although 1e7 is also a bad number, it should have been in the scale of 4e7-5e7 timesteps at that point, after more than 24h of training). Are you sure we are looking at the same files? (the logs file in #12352 (comment) doesn't have session_latest soft-link at all, please confirm that you are reading files from this folder in the archive: session_2020-11-29_14-05-25_874447_1) The main_cleaned.tar.gz files is ending with the following lines: and the last Tune summary table clearly shows 4000-5000 [sec] in the total time (s) column (although 24h passed from start of experiment, see the date-time reported in some lines, and compare first summary table to last one in that file) and the timesteps column has values ranging in 1e6-2e6... BTW, what I see in main_cleaned.tar.gz is that all trails have reported only 60-80 iterations (not 30) - please consider this is after 24h of training... 3 iterations/hour is very far from what you'd expect when min_iter_time_s=120sec and timesteps_per_iteration=10000 |
I guess this is pending #12360 now... |
Ah, I meant 1e6, sorry! Let me take a look again later today. |
Hey @roireshef I see the issue you linked is resolved now, can we close this one? |
Hi @edoakes thanks for following up. I just ran larger scale experiments (~30 nodes each) few hours ago, I'll report the results tomorrow. So far it looks good, but just to make sure, I'd like to let it go for ~1 day or so. |
I'm closing this one, although I feel that this doesn't scale the way I expected, but I'll have to do some more tests and see if this is worth re-opening. Thanks for your help! |
This issue isn't resolved, but rather minimized to some degree - previously I was experiencing the slowdowns when scheduling 4 or more trials (each one assigned to a full and single node in the cluster, 32 cores) after 5-6 hours of training. Now it seems like that for 4-8 nodes the slowdowns are minimized to the point they aren't a real issue anymore (although they still exist to some lower degree), but the slowdowns when scheduling 20-40 nodes are definitely there, although postponed to start later (at around 25 hours from the start of training), and pretty much making Tune impractical for hyperparameter-tuning at that scale. I'm attaching some charts for evidence. Please let me know if you need anything else for debugging. |
As you see above, if the rate of sampling/training was maintained throughout the whole session, the trials were to reach 300M steps (my stopping criterion) in ~150 hours (6 days). In reality, the number of steps sampled/trained after at that time is ~50M (1/6 of the expected steps) and to reach 300M it would take another 1250 hours (52 days). |
@ericl - if you need anything from my end (logs, etc.) please LMK. And thanks again! |
@richardliaw is this fixed now? |
I was planning to leave this open until @sven1977 merges placement groups support for RLlib. |
@
Any plans for when will it happen? It will be much appreciated if we could run RLlib/Tune at scale. I'm currently limited to 10 nodes per Tune experiment. |
Placement groups are now on by default in the latest release. Should this be closed? |
Sounds good. Thanks. I haven't migrated to latest release yet, I must say. @ericl - is latest release well benchmarked with respect to this issue? Were you able to spawn few 10s of trials with a single Tune call and maintain consistent bandwidth throughout a day or two of training? |
This should be addressed; just created a new issue to track remaining work item. |
What is the problem?
I'm running Tune for hyperparameter-search with single-node trials that should last 200M steps each, and which are running simultaneously. My Ray cluster is a private cluster setup. It seems like all of the trials start with ~500 steps/sec rate, and after two to three hours they all start slowing down dramatically.
Since they all start at the same time (+-1min), I expected Tune to show similar time passed since initiation for all of them (at the summary table it prints occasionally), but the time varies greatly between trials.
Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.7.2 (I'm pretty sure it reproduced in 1.0.1 as well)
Docker 19.03.11
PyTorch 1.5.1
Ubuntu 16.04 / SUSE Linux Enterprise Server 12 SP4
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I'm running a FLOW-based environment code with confidential logic I can't publish. I'm attaching logs instead.
If we cannot run your script, we cannot fix your issue.
The text was updated successfully, but these errors were encountered: