Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] reduce unnecessary stacktrace #23475

Merged
merged 11 commits into from
Mar 31, 2022
Merged

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Mar 24, 2022

Why are these changes needed?

There are a few changes:

  1. Between runner thread and main thread: The same stacktrace is raised in _report_thread_runner_error in main thread. So we could spare this raise in runner thread.
  2. Between function runner and Tune driver: Do not wrap RayTaskError in TuneError.
  3. Within Tune driver code: Introduces a per errored trial error.pkl and uses that to populate ResultGrid.

Plus some cleanups to facilitate propagating exception in runner and executor code.

Final stacktrace looks like:

(TrainTrainable pid=94824) 2022-03-25 09:22:40,653	ERROR function_runner.py:280 -- Runner Thread raised error.
(TrainTrainable pid=94824) Traceback (most recent call last):
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 271, in run
(TrainTrainable pid=94824)     self._entrypoint()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 346, in entrypoint
(TrainTrainable pid=94824)     self._status_reporter.get_checkpoint(),
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=94824)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 295, in _trainable_func
(TrainTrainable pid=94824)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 633, in _trainable_func
(TrainTrainable pid=94824)     output = fn()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 279, in train_func
(TrainTrainable pid=94824)     trainer.training_loop()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/ml/train/data_parallel_trainer.py", line 298, in training_loop
(TrainTrainable pid=94824)     for results in training_iterator:
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 720, in __next__
(TrainTrainable pid=94824)     self._finish_training
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
(TrainTrainable pid=94824)     return func()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 791, in _finish_training
(TrainTrainable pid=94824)     return self._backend_executor.finish_training()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 544, in finish_training
(TrainTrainable pid=94824)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
(TrainTrainable pid=94824)     success, failed_worker_indexes = check_for_failure(remote_values)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 53, in check_for_failure
(TrainTrainable pid=94824)     ray.get(object_ref)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=94824)     return func(*args, **kwargs)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/worker.py", line 1809, in get
(TrainTrainable pid=94824)     raise value.as_instanceof_cause()
(TrainTrainable pid=94824) ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=94840, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7ff72073c510>)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=94824)     return func(*args, **kwargs)
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=94824)     output = session.finish()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=94824)     func_output = self.training_thread.join()
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=94824)     raise self.exc
(TrainTrainable pid=94824)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=94824)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=94824)   File "tryout_error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=94824)     undefined_variable
(TrainTrainable pid=94824) NameError: name 'undefined_variable' is not defined
2022-03-25 09:22:40,786	ERROR trial_runner.py:886 -- Trial train_func_c7e0f_00000: Error processing event.
NoneType: None
Result for train_func_c7e0f_00000:
  date: 2022-03-25_09-22-38
  experiment_id: 5422f71bbd964595841e8ed5ef57c30f
  hostname: xw
  node_ip: 127.0.0.1
  pid: 94824
  timestamp: 1648225358
  trial_id: c7e0f_00000

== Status ==
Current time: 2022-03-25 09:22:40 (running for 00:00:04.92)
Memory usage on this node: 34.0/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/25.63 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/xwjiang/ray_results/train_func_2022-03-25_09-22-35
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_c7e0f_00000 | ERROR    | 127.0.0.1:94824 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+------------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                       |
|------------------------+--------------+------------------------------------------------------------------------------------------------------------------|
| train_func_c7e0f_00000 |            1 | /Users/xwjiang/ray_results/train_func_2022-03-25_09-22-35/train_func_c7e0f_00000_0_2022-03-25_09-22-36/error.txt |
+------------------------+--------------+------------------------------------------------------------------------------------------------------------------+

2022-03-25 09:22:40,900	ERROR tune.py:698 -- Trials did not complete: [train_func_c7e0f_00000]
2022-03-25 09:22:40,900	INFO tune.py:703 -- Total run time: 5.19 seconds (4.92 seconds for the tuning loop).
Traceback (most recent call last):
  File "tryout_error.py", line 7, in <module>
    result = trainer.fit()
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 255, in fit
    raise result.error
types.RayTaskError(NameError): ray::TrainTrainable.train() (pid=94824, ip=127.0.0.1, repr=train_func)
  File "/Users/xwjiang/ray/python/ray/tune/trainable.py", line 349, in train
    result = self.step()
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 398, in step
    self._report_thread_runner_error(block=True)
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 562, in _report_thread_runner_error
    raise e
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 271, in run
    self._entrypoint()
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 346, in entrypoint
    self._status_reporter.get_checkpoint(),
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 295, in _trainable_func
    super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 633, in _trainable_func
    output = fn()
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 279, in train_func
    trainer.training_loop()
  File "/Users/xwjiang/ray/python/ray/ml/train/data_parallel_trainer.py", line 298, in training_loop
    for results in training_iterator:
  File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 720, in __next__
    self._finish_training
  File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
    return func()
  File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 791, in _finish_training
    return self._backend_executor.finish_training()
  File "/Users/xwjiang/ray/python/ray/train/backend.py", line 544, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/Users/xwjiang/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
    success, failed_worker_indexes = check_for_failure(remote_values)
  File "/Users/xwjiang/ray/python/ray/train/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=94840, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7ff72073c510>)
  File "/Users/xwjiang/ray/python/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/Users/xwjiang/ray/python/ray/train/backend.py", line 535, in end_training
    output = session.finish()
  File "/Users/xwjiang/ray/python/ray/train/session.py", line 117, in finish
    func_output = self.training_thread.join()
  File "/Users/xwjiang/ray/python/ray/train/utils.py", line 101, in join
    raise self.exc
  File "/Users/xwjiang/ray/python/ray/train/utils.py", line 94, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "tryout_error.py", line 4, in train_loop_for_worker
    undefined_variable
NameError: name 'undefined_variable' is not **defined**

In Tune, we are capturing traceback.format_exc at the time the exception is caught and just pass the string around. This PR slightly changes that only in the case of when RayTaskError is raised, and we pass that object around.
It may be worthwhile to settle down on a practice of error handling in Tune in general.
I am also curious to learn how other ray library does that and any good lessons to learn.

In particular, we should watch out for memory leaking in exception handling. Not sure if it is still a problem in python 3, but here are some articles I came across for reference
https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/

Related issue number

#23405

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xwjiang2010 xwjiang2010 changed the title [air] reduce unnecessary stacktrace. [air] reduce unnecessary stacktrace - part 1. Mar 24, 2022
@amogkam
Copy link
Contributor

amogkam commented Mar 24, 2022

Nice! Do you have a before/after of this?

@xwjiang2010
Copy link
Contributor Author

Before: described in the original ticket
After:

(TrainTrainable pid=45284) 2022-03-24 12:33:02,963	ERROR function_runner.py:281 -- Runner Thread raised error.
(TrainTrainable pid=45284) Traceback (most recent call last):
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 272, in run
(TrainTrainable pid=45284)     self._entrypoint()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 350, in entrypoint
(TrainTrainable pid=45284)     self._status_reporter.get_checkpoint(),
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=45284)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 295, in _trainable_func
(TrainTrainable pid=45284)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 639, in _trainable_func
(TrainTrainable pid=45284)     output = fn()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 279, in train_func
(TrainTrainable pid=45284)     trainer.training_loop()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/ml/train/data_parallel_trainer.py", line 298, in training_loop
(TrainTrainable pid=45284)     for results in training_iterator:
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 720, in __next__
(TrainTrainable pid=45284)     self._finish_training
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
(TrainTrainable pid=45284)     return func()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 791, in _finish_training
(TrainTrainable pid=45284)     return self._backend_executor.finish_training()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 544, in finish_training
(TrainTrainable pid=45284)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
(TrainTrainable pid=45284)     success, failed_worker_indexes = check_for_failure(remote_values)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 53, in check_for_failure
(TrainTrainable pid=45284)     ray.get(object_ref)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=45284)     return func(*args, **kwargs)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/worker.py", line 1809, in get
(TrainTrainable pid=45284)     raise value.as_instanceof_cause()
(TrainTrainable pid=45284) ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=45300, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fdfd84bea50>)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=45284)     return func(*args, **kwargs)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=45284)     output = session.finish()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=45284)     func_output = self.training_thread.join()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=45284)     raise self.exc
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=45284)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=45284)   File "tryout_error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=45284)     undefined_variable
(TrainTrainable pid=45284) NameError: name 'undefined_variable' is not defined
(TrainTrainable pid=45284) 2022-03-24 12:33:02,966	ERROR worker.py:93 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=45301, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fed18276b50>)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=45284)     return func(*args, **kwargs)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=45284)     output = session.finish()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=45284)     func_output = self.training_thread.join()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=45284)     raise self.exc
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=45284)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=45284)   File "tryout_error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=45284)     undefined_variable
(TrainTrainable pid=45284) NameError: name 'undefined_variable' is not defined
2022-03-24 12:33:03,079	ERROR trial_runner.py:878 -- Trial train_func_3517d_00000: Error processing event.
NoneType: None
Result for train_func_3517d_00000:
  date: 2022-03-24_12-33-00
  experiment_id: 2eb2048c695140bfaffddcdcecb3622d
  hostname: xw
  node_ip: 127.0.0.1
  pid: 45284
  timestamp: 1648150380
  trial_id: 3517d_00000

== Status ==
Current time: 2022-03-24 12:33:03 (running for 00:00:05.91)
Memory usage on this node: 34.5/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/25.01 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/xwjiang/ray_results/train_func_2022-03-24_12-32-57
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_3517d_00000 | ERROR    | 127.0.0.1:45284 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+------------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                       |
|------------------------+--------------+------------------------------------------------------------------------------------------------------------------|
| train_func_3517d_00000 |            1 | /Users/xwjiang/ray_results/train_func_2022-03-24_12-32-57/train_func_3517d_00000_0_2022-03-24_12-32-57/error.txt |
+------------------------+--------------+------------------------------------------------------------------------------------------------------------------+

== Status ==
Current time: 2022-03-24 12:33:03 (running for 00:00:05.91)
Memory usage on this node: 34.5/64.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/25.01 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/xwjiang/ray_results/train_func_2022-03-24_12-32-57
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_3517d_00000 | ERROR    | 127.0.0.1:45284 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+------------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                       |
|------------------------+--------------+------------------------------------------------------------------------------------------------------------------|
| train_func_3517d_00000 |            1 | /Users/xwjiang/ray_results/train_func_2022-03-24_12-32-57/train_func_3517d_00000_0_2022-03-24_12-32-57/error.txt |
+------------------------+--------------+------------------------------------------------------------------------------------------------------------------+

2022-03-24 12:33:03,192	ERROR tune.py:698 -- Trials did not complete: [train_func_3517d_00000]
2022-03-24 12:33:03,192	INFO tune.py:703 -- Total run time: 6.17 seconds (5.91 seconds for the tuning loop).
Traceback (most recent call last):
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 255, in fit
    raise result.error
ray.tune.error.TuneError: Failure # 1 (occurred at 2022-03-24_12-33-03)
Traceback (most recent call last):
  File "/Users/xwjiang/ray/python/ray/tune/ray_trial_executor.py", line 907, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/Users/xwjiang/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/xwjiang/ray/python/ray/worker.py", line 1809, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::TrainTrainable.train() (pid=45284, ip=127.0.0.1, repr=train_func)
  File "/Users/xwjiang/ray/python/ray/tune/trainable.py", line 349, in train
    result = self.step()
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 402, in step
    self._report_thread_runner_error(block=True)
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 567, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str))
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::TrainTrainable.train() (pid=45284, ip=127.0.0.1, repr=train_func)
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 272, in run
    self._entrypoint()
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 350, in entrypoint
    self._status_reporter.get_checkpoint(),
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 295, in _trainable_func
    super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
  File "/Users/xwjiang/ray/python/ray/tune/function_runner.py", line 639, in _trainable_func
    output = fn()
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 279, in train_func
    trainer.training_loop()
  File "/Users/xwjiang/ray/python/ray/ml/train/data_parallel_trainer.py", line 298, in training_loop
    for results in training_iterator:
  File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 720, in __next__
    self._finish_training
  File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
    return func()
  File "/Users/xwjiang/ray/python/ray/train/trainer.py", line 791, in _finish_training
    return self._backend_executor.finish_training()
  File "/Users/xwjiang/ray/python/ray/train/backend.py", line 544, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/Users/xwjiang/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
    success, failed_worker_indexes = check_for_failure(remote_values)
  File "/Users/xwjiang/ray/python/ray/train/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=45300, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fdfd84bea50>)
  File "/Users/xwjiang/ray/python/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/Users/xwjiang/ray/python/ray/train/backend.py", line 535, in end_training
    output = session.finish()
  File "/Users/xwjiang/ray/python/ray/train/session.py", line 117, in finish
    func_output = self.training_thread.join()
  File "/Users/xwjiang/ray/python/ray/train/utils.py", line 101, in join
    raise self.exc
  File "/Users/xwjiang/ray/python/ray/train/utils.py", line 94, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "tryout_error.py", line 4, in train_loop_for_worker
    undefined_variable
NameError: name 'undefined_variable' is not defined



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tryout_error.py", line 7, in <module>
    result = trainer.fit()
  File "/Users/xwjiang/ray/python/ray/ml/trainer.py", line 257, in fit
    raise TrainingFailedError
ray.ml.trainer.TrainingFailedError

@xwjiang2010
Copy link
Contributor Author

this part is printed multiple times as well:

(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=45284)     return func(*args, **kwargs)
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=45284)     output = session.finish()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=45284)     func_output = self.training_thread.join()
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=45284)     raise self.exc
(TrainTrainable pid=45284)   File "/Users/xwjiang/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=45284)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=45284)   File "tryout_error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=45284)     undefined_variable
(TrainTrainable pid=45284) NameError: name 'undefined_variable' is not defined

But I suspect this is somewhere in ray core code.

@xwjiang2010 xwjiang2010 changed the title [air] reduce unnecessary stacktrace - part 1. [air] reduce unnecessary stacktrace Mar 25, 2022
@xwjiang2010
Copy link
Contributor Author

@amogkam
I updated the PR description with final SxS comparison.
It looks a lot cleaner to me. But would like to see if you have any feedback.

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!!! Looks really good from my end! But will wait for other folks' review for final approval here.

@@ -58,9 +59,10 @@ def __getitem__(self, i) -> Result:

@staticmethod
def _populate_exception(trial: Trial) -> Optional[TuneError]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type is no longer just TuneError right?

@amogkam
Copy link
Contributor

amogkam commented Mar 25, 2022

Btw I think the assumption we are making here is that all exceptions are serializable.

Should we fall back to the original approach if this is not the case?

@@ -154,7 +155,7 @@ def fail(self):
raise ValueError

trainer = DummyTrainer(fail)
with pytest.raises(TrainingFailedError):
with pytest.raises(RayTaskError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(RayTaskError):
with pytest.raises(ValueError):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this to be ValueError? I thought you mentioned RayTaskError is ok as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to ValueError.

@@ -8,3 +11,10 @@ class AbortTrialExecution(TuneError):
"""Error that indicates a trial should not be retried."""

pass


def get_tb_from_exception(e: Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_tb_from_exception(e: Exception):
def get_tb_from_exception(e: Exception) -> str:

python/ray/tune/trial.py Show resolved Hide resolved
@amogkam
Copy link
Contributor

amogkam commented Mar 28, 2022

Thanks @xwjiang2010!

I'm not too familiar with this code, but a couple points

  1. Do we need to pass around both the exception and the error message string as separate arguments? Could we just pass around the exception and then extract error message from it as needed? I think this might help simplify the code.
  2. Do we need both pickled_error_file and error_file?

@xwjiang2010
Copy link
Contributor Author

  • Do we need to pass around both the exception and the error message string as separate arguments? Could we just pass around the exception and then extract error message from it as needed? I think this might help simplify the code.

Yeah, I was also thinking about that. Consolidating to one would be nice.
But I am not sure about passing Exception around naively. There are articles about Exceptions keeping stacktrace frame and end up having memory leak. I was wondering if this is the reasoning behind us traceback.format_exc every time an exception is caught. @krfricke @Yard1 any thoughts here?

  • Do we need both pickled_error_file and error_file?
    error_file is user facing API. So we should not change it. pickled_error_file is for loading RayTaskError specifically so that we can surface the original error type.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, I have a few questions.

Also, can we test this code with unpickable exceptions in a function trainable, a class trainable, and in a callback (so driver side)?

Example for unpickable exception:

import pickle


class UnpickableException(RuntimeError):
    def __init__(self, message: str):
        super(UnpickableException, self).__init__(message)

    def __reduce__(self):
        raise TypeError("Can't pickle this exception")


exception_blob = None
try:
    raise UnpickableException("Fail")
except Exception as e:
    exception_blob = pickle.dumps(e)

print("Exception blob:", exception_blob)

python/tryout_horovod.py Outdated Show resolved Hide resolved
python/ray/tune/trial.py Outdated Show resolved Hide resolved
python/ray/tune/result_grid.py Outdated Show resolved Hide resolved
python/ray/tune/result_grid.py Outdated Show resolved Hide resolved
python/ray/tune/ray_trial_executor.py Outdated Show resolved Hide resolved
@xwjiang2010
Copy link
Contributor Author

Hmmm, we only pickle the RayTaskError:

And if somehow trainable throws non-serializable errors, RayTaskError itself will have an issue. See the following stacktrace:

(TrainTrainable pid=72034) During handling of the above exception, another exception occurred:
(TrainTrainable pid=72034)
(TrainTrainable pid=72034) Traceback (most recent call last):
(TrainTrainable pid=72034)   File "python/ray/_raylet.pyx", line 797, in ray._raylet.task_execution_handler
(TrainTrainable pid=72034)   File "python/ray/_raylet.pyx", line 616, in ray._raylet.execute_task
(TrainTrainable pid=72034)   File "python/ray/_raylet.pyx", line 752, in ray._raylet.execute_task
(TrainTrainable pid=72034)   File "python/ray/_raylet.pyx", line 2015, in ray._raylet.CoreWorker.store_task_outputs
(TrainTrainable pid=72034)   File "/Users/xwjiang/ray/python/ray/serialization.py", line 413, in serialize
(TrainTrainable pid=72034)     return self._serialize_to_msgpack(value)
(TrainTrainable pid=72034)   File "/Users/xwjiang/ray/python/ray/serialization.py", line 368, in _serialize_to_msgpack
(TrainTrainable pid=72034)     value = value.to_bytes()
(TrainTrainable pid=72034)   File "/Users/xwjiang/ray/python/ray/exceptions.py", line 24, in to_bytes
(TrainTrainable pid=72034)     serialized_exception=pickle.dumps(self),
(TrainTrainable pid=72034)   File "/Users/xwjiang/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
(TrainTrainable pid=72034)     cp.dump(obj)
(TrainTrainable pid=72034)   File "/Users/xwjiang/ray/python/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
(TrainTrainable pid=72034)     return Pickler.dump(self, obj)
(TrainTrainable pid=72034)   File "/Users/xwjiang/ray/python/ray/tune/tests/test_tuner.py", line 183, in __reduce__
(TrainTrainable pid=72034)     raise TypeError("Can't pickle this exception")
(TrainTrainable pid=72034) TypeError: Can't pickle this exception
(TrainTrainable pid=72034)
(TrainTrainable pid=72034) During handling of the above exception, another exception occurred:
(TrainTrainable pid=72034)
(TrainTrainable pid=72034) Traceback (most recent call last):
(TrainTrainable pid=72034)   File "python/ray/_raylet.pyx", line 826, in ray._raylet.task_execution_handler
(TrainTrainable pid=72034) SystemExit

This means RayTaskError must be serializable itself. So as long as we stick with RayTaskError, we should be good.


def get_trial(self, tid):
trial = [t for t in self._trials if t.trial_id == tid]
return trial[0] if trial else None

def get_trials(self):
"""Returns the list of trials managed by this TrialRunner.
"""Returns the list of triaget_tb_from_exceptionls managed by this TrialRunner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit this var name seems off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch!

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor nit

@krfricke krfricke merged commit 378b669 into ray-project:master Mar 31, 2022
@richardliaw richardliaw added this to the Ray AIR milestone Apr 8, 2022
@xwjiang2010 xwjiang2010 deleted the stacktrace branch July 26, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants