You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
amogkam
changed the title
[ml] Trainer failure stack traces are too long
[ml] Trainer failure stack traces are too long and hide the original error
Mar 22, 2022
Sounds good. Looking at it now. One low hanging fruit is #23475.
So basically we have:
runner thread (where it actually happens) --> main thread --> Tune driver (wrapped as a RayTaskError) --> Write to disk (error.txt) --> load up when creating ResultGrid
Amog's suggestion applies to the first transition. In addition, there are the rest of the piping that would need to be updated.
Running this script
results in this stack trace
The actual error (
NameError
) is repeated multiple times and is only shown as a log message and not raised as the actual error.We should improve this by
NameError
stack traces by only showing them once.TuneError
/TrainingFailedError
.One way to implement 2 is for Tune to use Ray Train's thread utility that allows you to raise errors from the child thread in the main thread: https://github.com/ray-project/ray/blob/master/python/ray/train/utils.py#L88-L102
The text was updated successfully, but these errors were encountered: