-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] reduce unnecessary stacktrace #23475
Conversation
Nice! Do you have a before/after of this? |
Before: described in the original ticket
|
this part is printed multiple times as well:
But I suspect this is somewhere in ray core code. |
@amogkam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!!! Looks really good from my end! But will wait for other folks' review for final approval here.
python/ray/tune/result_grid.py
Outdated
@@ -58,9 +59,10 @@ def __getitem__(self, i) -> Result: | |||
|
|||
@staticmethod | |||
def _populate_exception(trial: Trial) -> Optional[TuneError]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is no longer just TuneError right?
Btw I think the assumption we are making here is that all exceptions are serializable. Should we fall back to the original approach if this is not the case? |
python/ray/ml/tests/test_trainer.py
Outdated
@@ -154,7 +155,7 @@ def fail(self): | |||
raise ValueError | |||
|
|||
trainer = DummyTrainer(fail) | |||
with pytest.raises(TrainingFailedError): | |||
with pytest.raises(RayTaskError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(RayTaskError): | |
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this to be ValueError? I thought you mentioned RayTaskError is ok as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to ValueError.
python/ray/tune/error.py
Outdated
@@ -8,3 +11,10 @@ class AbortTrialExecution(TuneError): | |||
"""Error that indicates a trial should not be retried.""" | |||
|
|||
pass | |||
|
|||
|
|||
def get_tb_from_exception(e: Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_tb_from_exception(e: Exception): | |
def get_tb_from_exception(e: Exception) -> str: |
Thanks @xwjiang2010! I'm not too familiar with this code, but a couple points
|
Yeah, I was also thinking about that. Consolidating to one would be nice.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, I have a few questions.
Also, can we test this code with unpickable exceptions in a function trainable, a class trainable, and in a callback (so driver side)?
Example for unpickable exception:
import pickle
class UnpickableException(RuntimeError):
def __init__(self, message: str):
super(UnpickableException, self).__init__(message)
def __reduce__(self):
raise TypeError("Can't pickle this exception")
exception_blob = None
try:
raise UnpickableException("Fail")
except Exception as e:
exception_blob = pickle.dumps(e)
print("Exception blob:", exception_blob)
Hmmm, we only pickle the RayTaskError: And if somehow trainable throws non-serializable errors,
This means RayTaskError must be serializable itself. So as long as we stick with RayTaskError, we should be good. |
python/ray/tune/trial_runner.py
Outdated
|
||
def get_trial(self, tid): | ||
trial = [t for t in self._trials if t.trial_id == tid] | ||
return trial[0] if trial else None | ||
|
||
def get_trials(self): | ||
"""Returns the list of trials managed by this TrialRunner. | ||
"""Returns the list of triaget_tb_from_exceptionls managed by this TrialRunner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit this var name seems off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor nit
Why are these changes needed?
There are a few changes:
_report_thread_runner_error
in main thread. So we could spare this raise in runner thread.Plus some cleanups to facilitate propagating exception in runner and executor code.
Final stacktrace looks like:
In Tune, we are capturing
traceback.format_exc
at the time the exception is caught and just pass the string around. This PR slightly changes that only in the case of when RayTaskError is raised, and we pass that object around.It may be worthwhile to settle down on a practice of error handling in Tune in general.
I am also curious to learn how other ray library does that and any good lessons to learn.
In particular, we should watch out for memory leaking in exception handling. Not sure if it is still a problem in python 3, but here are some articles I came across for reference
https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
Related issue number
#23405
Checks
scripts/format.sh
to lint the changes in this PR.