[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

roireshef · 2020-11-24T14:22:21Z

What is the problem?

I'm running Tune for hyperparameter-search with single-node trials that should last 200M steps each, and which are running simultaneously. My Ray cluster is a private cluster setup. It seems like all of the trials start with ~500 steps/sec rate, and after two to three hours they all start slowing down dramatically.

Since they all start at the same time (+-1min), I expected Tune to show similar time passed since initiation for all of them (at the summary table it prints occasionally), but the time varies greatly between trials.

Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.7.2 (I'm pretty sure it reproduced in 1.0.1 as well)
Docker 19.03.11
PyTorch 1.5.1
Ubuntu 16.04 / SUSE Linux Enterprise Server 12 SP4

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I'm running a FLOW-based environment code with confidential logic I can't publish. I'm attaching logs instead.

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

roireshef · 2020-11-24T14:30:11Z

tensorboard charts show sampling rate gradually slows down at some point (shows all trials at that same experiment)

I'm attaching the top outputs for some node in the cluster (sumo is external process spawned by the environment - it's a traffic simulator the environment's code is interacting with). See how the memory consumption of the RLlib worker went up for processes that exist for more than few hours, while the new (possibly recovered) process has lower memory consumption. Also, some workers exhibit relatively high CPU usage while others are stuck on 0%.

Finally, I'm also attaching the raylet logs at that same machine:
session_2020-11-23_07-52-28_704870_1.tar.gz

I'm currently debugging under the assumption this is a memory issue, but:

I've seen in another issue that @rkooo567 suggested using lru-evict flag, but I fail to apply it to ray start - is it supported in 0.7.2? Also, AFAIK this is a workaround but it's not guaranteed to be stable and anyway the better practice is to find the root cause and fix it. So I wonder how can this be done. Is there any best practice or tool for that?
It might be worth printing warnings to the console if memory is exhausted, so detection will be immediate.

roireshef · 2020-11-25T08:55:51Z

@ericl , @richardliaw - I think I have an interesting insights about that issue. I ran two experiments with everything configured exactly the same (both hardware/software), just that one experiment (will be referenced as "stress_test") were given enough nodes to run all trials simultaneously (48 nodes x 32 cores) while the other one (will be referenced as "debug") was given only 10 nodes (but experiment still has 48 trials).

Now, see how trials accumulate steps - the number of steps sampled/trained is shown in next charts for trials from both exepriments - the debug trials have a steady bandwidth, providing linear steps accumulated over time, while the stress_test ones are flattening out over time (their steps per time rate is gradually decreasing):

The only difference between the two is the amount of nodes in the cluster that Tune has to manage simultaneously. The linear lines (debug trials) have occasional drops in them due to crashes and recoveries I'm aware of, and are fixed in my newer environment's code, so please ignore them.

Another interesting finding is when running top on two nodes (from the different experiments), the one that belongs to stress_test shows ~2GB memory used for all Ray worker processes with quite a lot of them on 0% CPU utilization, whereas debug node's processes have stabilized on ~1GB or ~1.5GB (depending on num_envs_per_worker, which is tested for 3 and 5, respectively) with all processes frequently utilizing CPU well.

richardliaw · 2020-11-26T21:05:21Z

@roireshef do you have output logs from the tune trial? I'm wondering if this is some artifact of checkpointing.

FYI, we don't have long-term support for older versions like 0.7.2

roireshef · 2020-11-27T15:13:22Z

Hi @richardliaw,

@roireshef do you have output logs from the tune trial? I'm wondering if this is some artifact of checkpointing.

I'm pretty sure it isn't related to checkpointing because while I've set checkpoint to happen every 50 iterations, you can actually see the training speed desaturates very smoothly over time. Can you use the logs in #12352 (comment) ?

FYI, we don't have long-term support for older versions like 0.7.2

I understand. I tried migrating from 0.7.2 to several versions (0.8.6, 0.8.7, 1.0.1, 1.1.0), each one had at least one issue that prevented me from migrating successfully. I've been waiting for those to be fixed for some time, but had to continue my actual research. I will be trying to again in about a month from today. That said, I'm pretty sure this specific issue reproduces on newer versions as well. I've recently provided @ericl with reproduction instructions in #11239 (comment) - seems like my environment is heavier and more memory-demanding than the other ones often used, although here specifically it seems like the the correlation with the number of nodes used by Tune points to a Ray issue rather than the environment implementation.

richardliaw · 2020-11-27T19:41:28Z

I'm pretty sure it isn't related to checkpointing because while I've set checkpoint to happen every 50 iterations,

I'm referring to an internal Tune checkpointing mechanism. Unfortunately, these logs posted do not capture stdout/stderr of the main python process. Can you please provide stdout/stderr of the python process? If you plan to rerun this, it'd be great to set ray.init(logging_level="DEBUG") too.

ericl · 2020-11-28T01:17:17Z

This kind of reminds me of #4211, though that issue should be solved. It does sound to me like some problem in the Tune event loop and not Ray though.

ericl · 2020-11-28T03:14:19Z

@richardliaw running a similar workload on 50 nodes, I see warnings like this:

2020-11-28 03:12:20,616 WARNING util.py:143 -- The on_step_begin operation took 0.8207757472991943 seconds to complete, which may be a performance bottleneck.
2020-11-28 03:12:22,304 WARNING util.py:143 -- The on_step_begin operation took 0.8349165916442871 seconds to complete, which may be a performance bottleneck.
2020-11-28 03:12:23,944 WARNING util.py:143 -- The on_step_begin operation took 0.805654764175415 seconds to complete, which may be a performance bottleneck.

This could be a O(n^2) kind of issue: as the number of nodes grows, the number of steps per second that must be processed increases. At the same time, the amount of work per step (calling ray.cluster_resources() etc in on_step_begin) also increases linearly. At some point the n^2 factor overwhelms Tune.

@roireshef I think it's the large number of concurrent trials is causing Tune to become backlogged processing trial results. One way of fixing this is to increase the reporting period of the trials. In rllib you can set min_iter_time_s: 90 to report results once per 90 seconds per trial. Can you try this out to see if it helps?

roireshef · 2020-11-29T11:55:48Z

@ericl I'll do that soon and report back, thanks.
Edit: I ran two experiments more or less at the same time, one with a 3(x32c) cluster and the other one with 36(x32c) cluster. Results of the smaller one show:

Number of trials: 3/3 (3 RUNNING)
+-------------------------------------------------------+----------+--------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                                            | status   | loc                |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------------------------+----------+--------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| A3C_NegotiationMergeHandoffFullLateralEnv_42e58_00000 | RUNNING  | 10.120.252.248:68  |                       100 |     67 |          **8280.11** | **4370702** | 0.730443 |              1.35521 |            -1.07226  |            21.3256 |
| A3C_NegotiationMergeHandoffFullLateralEnv_42e58_00001 | RUNNING  | 10.120.253.5:68    |                       200 |     63 |          **8291.89** | **3115123** | 0.550906 |              1.4126  |            -0.749485 |            20.5144 |
| A3C_NegotiationMergeHandoffFullLateralEnv_42e58_00002 | RUNNING  | 10.120.252.247:162 |                       300 |     59 |          **8268.29** | **3077160** | 0.485682 |              1.47101 |            -0.692797 |            29.0559 |
+-------------------------------------------------------+----------+--------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+

While the summary table for the larger cluster shows:

Number of trials: 36/36 (36 RUNNING)
+-------------------------------------------------------+----------+---------------------+-----------------------------------------------+---------------------------------------------------------+-----------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
| Trial name                                            | status   | loc                 |   env_config/REWARD_FUNCTION_PARAMS/JERK_COEF |   env_config/REWARD_FUNCTION_PARAMS/KNOT_JERK_DIFF_COEF |   num_envs_per_worker |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------------------------+----------+---------------------+-----------------------------------------------+---------------------------------------------------------+-----------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------|
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00000 | RUNNING  | 10.120.253.39:68    |                                          0.1  |                                                    0.1  |                     1 |                       100 |     22 |          **2547.62** | **1417880** | 0.703202 |              1.36162 |          -0.140709   |           24.2091  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00001 | RUNNING  | 10.120.252.217:68   |                                          0.01 |                                                    0.1  |                     1 |                       100 |     24 |          **2648.8**  | **1488045** | 0.786167 |              1.37889 |           0.00586435 |           25.7088  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00002 | RUNNING  | 10.120.253.61:1771  |                                          0.1  |                                                    0.01 |                     1 |                       100 |     24 |          **2558.12** | **1448053** | 0.800076 |              1.39519 |          -0.127671   |           27.2306  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00003 | RUNNING  | 10.120.252.243:68   |                                          0.01 |                                                    0.01 |                     1 |                       100 |     21 |          **2535.21** | **1528173** | 0.873466 |              1.56099 |           0.147064   |           31.104   |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00004 | RUNNING  | 10.120.253.26:1164  |                                          0.1  |                                                    0.1  |                     3 |                       100 |     21 |          **2476.43** | **1195145** | 0.558624 |              1.2837  |          -0.228631   |           18.7178  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00005 | RUNNING  | 10.120.253.61:68    |                                          0.01 |                                                    0.1  |                     3 |                       100 |     23 |          **2604.92** | **1288328** | 0.585579 |              1.46326 |          -1.13464    |           28.5279  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00006 | RUNNING  | 10.120.252.182:1186 |                                          0.1  |                                                    0.01 |                     3 |                       100 |     21 |          **2474.23** | **1216406** | 0.63641  |              1.48117 |           0.211203   |           25.1225  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00007 | RUNNING  | 10.120.252.240:68   |                                          0.01 |                                                    0.01 |                     3 |                       100 |     20 |          **2550.96** | **1325959** | 0.815871 |              1.60022 |           0.136913   |           28.6285  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00008 | RUNNING  | 10.120.253.53:72    |                                          0.1  |                                                    0.1  |                     5 |                       100 |     20 |          **2517.84** | **1123780** | 0.490222 |              1.3201  |          -1.39565    |           10.8477  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00009 | RUNNING  | 10.120.252.202:68   |                                          0.01 |                                                    0.1  |                     5 |                       100 |     20 |          **2546.36** | **1158058** | 0.576395 |              1.44916 |          -1.08705    |           24.7925  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00010 | RUNNING  | 10.120.252.229:68   |                                          0.1  |                                                    0.01 |                     5 |                       100 |     20 |          **2508.71** | **1144760** | 0.507061 |              1.38953 |          -0.118062   |           10.1495  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00011 | RUNNING  | 10.120.253.25:68    |                                          0.01 |                                                    0.01 |                     5 |                       100 |     21 |          **2517.59** | **1180637** | 0.712453 |              1.60088 |           0.186469   |           27.8747  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00012 | RUNNING  | 10.120.253.6:68     |                                          0.1  |                                                    0.1  |                     1 |                       200 |     21 |          **2512.58** | **1273095** | 0.684557 |              1.31557 |          -0.280583   |           25.5985  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00013 | RUNNING  | 10.120.252.213:205  |                                          0.01 |                                                    0.1  |                     1 |                       200 |     21 |          **2526.43** | **1354669** | 0.729874 |              1.31967 |          -0.344612   |           23.2483  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00014 | RUNNING  | 10.120.253.58:68    |                                          0.1  |                                                    0.01 |                     1 |                       200 |     20 |          **2473.29** | **1353609** | 0.738936 |              1.44908 |          -0.206731   |           25.2905  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00015 | RUNNING  | 10.120.253.6:72     |                                          0.01 |                                                    0.01 |                     1 |                       200 |     24 |          **2520.88** | **1394423** | 0.863334 |              1.51696 |           0.265222   |           28.039   |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00016 | RUNNING  | 10.120.253.31:1185  |                                          0.1  |                                                    0.1  |                     3 |                       200 |     24 |          **2666.26** |  **986375** | 0.503936 |              1.33032 |          -0.986918   |           21.1258  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00017 | RUNNING  | 10.120.253.34:68    |                                          0.01 |                                                    0.1  |                     3 |                       200 |     23 |          **2685.1  |** **1165452** | 0.564914 |              1.42336 |          -0.276867   |           27.8179  |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00018 | RUNNING  | 10.120.253.25:72    |                                          0.1  |                                                    0.01 |                     3 |                       200 |     21 |          **2566.15 |** **1101520** | 0.482917 |              1.14707 |           0.169248   |            7.73379 |
| A3C_NegotiationMergeHandoffFullLateralEnv_e9458_00019 | RUNNING  | 10.120.252.192:87   |                                          0.01 |                                                    0.01 |                     3 |                       200 |     25 |          **2850.1**  | **1367602** | 0.660461 |              1.37393 |           0.216757   |           18.0209  |
+-------------------------------------------------------+----------+---------------------+-----------------------------------------------+---------------------------------------------------------+-----------------------+---------------------------+--------+------------------+---------+----------+----------------------+----------------------+--------------------+
... 16 more trials not shown (16 RUNNING)

I emphasized the time and steps columns. The progress rate for the trials in the 36 nodes cluster is ~30% of the smaller cluster's trials for both time and steps, so the problem remains. I ran top on one node from each cluster and while the smaller cluster's node's top results show RolloutWorker consuming ~80% cpu most of the time, the larger cluster's node's top results already show RolloutWorker consuming 0% cpu - see the following snapshot:

I'll wait few more hours and see if the steps/hour demonstrate the same behavior as in #12352 (comment) and I will also provide the logs again, this time I'm running with latest nightly wheel, meaning logs will be more detailed (although I forgot to set logging level to DEBUG this time - sorry @richardliaw, but I think they do include the main process outputs as well)

@ericl - I'm not sure I understand what exactly is O(n^2) here...? (n being the number of nodes in the cluster)
If the runtime complexity grows quadratically in the number of nodes in the cluster - that basically means there will always be a low limit to how many nodes we can run in a Tune experiment. Can it be avoided??

roireshef · 2020-11-30T19:57:58Z

Yes, it's clear. The 36 nodes experiment just stops making any progress after less than a full hour of training. The 3 trials that did make progress are the ones that belong to the 3 nodes cluster...

@richardliaw, attached:

Console outputs from main python process main_cleaned.tar.gz
Temp logs from some machine across the clusters logs.tar.gz

roireshef · 2020-11-30T20:04:12Z

@ericl , @richardliaw - Whatever machine I was connecting to and trying to pull the temporary logs out of it, I ended up with the same files. I was under the impression each node stores its own temporary logs. Is it the case where all nodes are communicating temporary logs between one another? @ericl is this why you said it's a O(n^2) thing? Can this be avoided in any way?

richardliaw · 2020-11-30T20:37:20Z

Taking a look!

richardliaw · 2020-11-30T23:12:36Z

@roireshef how did you generate the plot in #12352 (comment) ? Are they at all related to the logs show in the Console Output from main python process?

In the logs, I'm seeing that every process is making progress and incrementing the "timesteps_total" value. If you look closely, every trial at training_iteration = 30 has passed 1e7 timesteps.

What am I missing?

Also, for /tmp/ray/session_latest, for me what is most useful is just (raylet, gcs_server) logs on the head node.

roireshef · 2020-12-01T13:44:52Z

@roireshef how did you generate the plot in #12352 (comment) ? Are they at all related to the logs show in the Console Output from main python process?

Yes, the chart at that comment is produced by Tensorboard, overlaying this 36-nodes experiment with another 3-nodes experiment for a baseline (the consistent linear progress is made by the 3 expriments in the 3-node experiment).

In the logs, I'm seeing that every process is making progress and incrementing the "timesteps_total" value. If you look closely, every trial at training_iteration = 30 has passed 1e7 timesteps.

What am I missing?

@richardliaw - That's very strange. Where excatly do you see 1e7 timesteps? (although 1e7 is also a bad number, it should have been in the scale of 4e7-5e7 timesteps at that point, after more than 24h of training). Are you sure we are looking at the same files? (the logs file in #12352 (comment) doesn't have session_latest soft-link at all, please confirm that you are reading files from this folder in the archive: session_2020-11-29_14-05-25_874447_1)

The main_cleaned.tar.gz files is ending with the following lines:
timers:
apply_grad_throughput: 264323.924
apply_grad_time_ms: 3.842
grad_wait_time_ms: 1.624
update_time_ms: 1.171
timestamp: 1606764482
timesteps_since_restore: 0
timesteps_total: 1926095 <------- see that
training_iteration: 65
trial_id: e9458_00023

and the last Tune summary table clearly shows 4000-5000 [sec] in the total time (s) column (although 24h passed from start of experiment, see the date-time reported in some lines, and compare first summary table to last one in that file) and the timesteps column has values ranging in 1e6-2e6...

BTW, what I see in main_cleaned.tar.gz is that all trails have reported only 60-80 iterations (not 30) - please consider this is after 24h of training... 3 iterations/hour is very far from what you'd expect when min_iter_time_s=120sec and timesteps_per_iteration=10000

roireshef · 2020-12-06T14:50:40Z

I guess this is pending #12360 now...

richardliaw · 2020-12-06T23:38:59Z

Ah, I meant 1e6, sorry! Let me take a look again later today.

edoakes · 2020-12-17T22:21:59Z

Hey @roireshef I see the issue you linked is resolved now, can we close this one?

roireshef · 2020-12-17T23:34:39Z

Hi @edoakes thanks for following up. I just ran larger scale experiments (~30 nodes each) few hours ago, I'll report the results tomorrow. So far it looks good, but just to make sure, I'd like to let it go for ~1 day or so.

roireshef · 2020-12-25T00:31:37Z

I'm closing this one, although I feel that this doesn't scale the way I expected, but I'll have to do some more tests and see if this is worth re-opening. Thanks for your help!

roireshef · 2021-01-02T11:40:34Z

This issue isn't resolved, but rather minimized to some degree - previously I was experiencing the slowdowns when scheduling 4 or more trials (each one assigned to a full and single node in the cluster, 32 cores) after 5-6 hours of training. Now it seems like that for 4-8 nodes the slowdowns are minimized to the point they aren't a real issue anymore (although they still exist to some lower degree), but the slowdowns when scheduling 20-40 nodes are definitely there, although postponed to start later (at around 25 hours from the start of training), and pretty much making Tune impractical for hyperparameter-tuning at that scale.

I'm attaching some charts for evidence. Please let me know if you need anything else for debugging.

roireshef · 2021-01-02T11:43:48Z

As you see above, if the rate of sampling/training was maintained throughout the whole session, the trials were to reach 300M steps (my stopping criterion) in ~150 hours (6 days). In reality, the number of steps sampled/trained after at that time is ~50M (1/6 of the expected steps) and to reach 300M it would take another 1250 hours (52 days).

roireshef · 2021-01-05T10:31:01Z

@ericl - if you need anything from my end (logs, etc.) please LMK.
Otherwise, I would like to terminate those trials to free the resources assigned to them soon.

And thanks again!

ericl · 2021-02-16T22:40:47Z

@richardliaw is this fixed now?

richardliaw · 2021-02-16T23:30:34Z

I was planning to leave this open until @sven1977 merges placement groups support for RLlib.

roireshef · 2021-02-24T22:48:33Z

@

I was planning to leave this open until @sven1977 merges placement groups support for RLlib.

Any plans for when will it happen? It will be much appreciated if we could run RLlib/Tune at scale. I'm currently limited to 10 nodes per Tune experiment.

roireshef · 2021-05-01T17:24:59Z

@sven1977 @ericl @richardliaw ?

ericl · 2021-05-01T17:28:13Z

Placement groups are now on by default in the latest release. Should this be closed?

roireshef · 2021-05-01T17:36:47Z

Sounds good. Thanks. I haven't migrated to latest release yet, I must say.

@ericl - is latest release well benchmarked with respect to this issue? Were you able to spawn few 10s of trials with a single Tune call and maintain consistent bandwidth throughout a day or two of training?

richardliaw · 2021-05-17T16:50:07Z

This should be addressed; just created a new issue to track remaining work item.

roireshef added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 24, 2020

roireshef mentioned this issue Nov 27, 2020

[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+ #11239

Closed

ericl added P1 Issue that should be fixed within a few weeks tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 28, 2020

roireshef mentioned this issue Dec 2, 2020

[RLlib] Trial workers are not co-located #12360

Closed

2 tasks

edoakes assigned richardliaw Dec 17, 2020

roireshef closed this as completed Dec 25, 2020

roireshef reopened this Jan 2, 2021

ericl added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jan 3, 2021

ericl pinned this issue Jan 3, 2021

ericl unpinned this issue Jan 11, 2021

richardliaw added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Feb 17, 2021

richardliaw assigned sven1977 Feb 17, 2021

richardliaw mentioned this issue May 17, 2021

[rllib] Report sample thruput in nightly release tests #15855

Closed

richardliaw closed this as completed May 17, 2021

jmakov mentioned this issue Oct 26, 2021

[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

roireshef commented Nov 24, 2020

roireshef commented Nov 24, 2020 •

edited

Loading

roireshef commented Nov 25, 2020

richardliaw commented Nov 26, 2020

roireshef commented Nov 27, 2020 •

edited

Loading

richardliaw commented Nov 27, 2020

ericl commented Nov 28, 2020

ericl commented Nov 28, 2020 •

edited

Loading

roireshef commented Nov 29, 2020 •

edited

Loading

roireshef commented Nov 30, 2020 •

edited

Loading

roireshef commented Nov 30, 2020

richardliaw commented Nov 30, 2020

richardliaw commented Nov 30, 2020

roireshef commented Dec 1, 2020 •

edited

Loading

roireshef commented Dec 6, 2020

richardliaw commented Dec 6, 2020 •

edited

Loading

edoakes commented Dec 17, 2020

roireshef commented Dec 17, 2020 •

edited

Loading

roireshef commented Dec 25, 2020

roireshef commented Jan 2, 2021 •

edited

Loading

roireshef commented Jan 2, 2021 •

edited

Loading

roireshef commented Jan 5, 2021 •

edited

Loading

ericl commented Feb 16, 2021

richardliaw commented Feb 16, 2021

roireshef commented Feb 24, 2021 •

edited

Loading

roireshef commented May 1, 2021

ericl commented May 1, 2021

roireshef commented May 1, 2021 •

edited

Loading

richardliaw commented May 17, 2021

[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

[tune] [rllib] Tune trials slowing down and reporting incorrect time #12352

Comments

roireshef commented Nov 24, 2020

What is the problem?

Reproduction (REQUIRED)

roireshef commented Nov 24, 2020 • edited Loading

roireshef commented Nov 25, 2020

richardliaw commented Nov 26, 2020

roireshef commented Nov 27, 2020 • edited Loading

richardliaw commented Nov 27, 2020

ericl commented Nov 28, 2020

ericl commented Nov 28, 2020 • edited Loading

roireshef commented Nov 29, 2020 • edited Loading

roireshef commented Nov 30, 2020 • edited Loading

roireshef commented Nov 30, 2020

richardliaw commented Nov 30, 2020

richardliaw commented Nov 30, 2020

roireshef commented Dec 1, 2020 • edited Loading

roireshef commented Dec 6, 2020

richardliaw commented Dec 6, 2020 • edited Loading

edoakes commented Dec 17, 2020

roireshef commented Dec 17, 2020 • edited Loading

roireshef commented Dec 25, 2020

roireshef commented Jan 2, 2021 • edited Loading

roireshef commented Jan 2, 2021 • edited Loading

roireshef commented Jan 5, 2021 • edited Loading

ericl commented Feb 16, 2021

richardliaw commented Feb 16, 2021

roireshef commented Feb 24, 2021 • edited Loading

roireshef commented May 1, 2021

ericl commented May 1, 2021

roireshef commented May 1, 2021 • edited Loading

richardliaw commented May 17, 2021

roireshef commented Nov 24, 2020 •

edited

Loading

roireshef commented Nov 27, 2020 •

edited

Loading

ericl commented Nov 28, 2020 •

edited

Loading

roireshef commented Nov 29, 2020 •

edited

Loading

roireshef commented Nov 30, 2020 •

edited

Loading

roireshef commented Dec 1, 2020 •

edited

Loading

richardliaw commented Dec 6, 2020 •

edited

Loading

roireshef commented Dec 17, 2020 •

edited

Loading

roireshef commented Jan 2, 2021 •

edited

Loading

roireshef commented Jan 2, 2021 •

edited

Loading

roireshef commented Jan 5, 2021 •

edited

Loading

roireshef commented Feb 24, 2021 •

edited

Loading

roireshef commented May 1, 2021 •

edited

Loading