[ml] Trainer failure stack traces are too long and hide the original error #23405

amogkam · 2022-03-22T17:51:47Z

Running this script

from ray.ml.train.integrations.torch import TorchTrainer

def train_loop_for_worker():
    undefined_variable

trainer = TorchTrainer(scaling_config={"num_workers": 2}, train_loop_per_worker=train_loop_for_worker)
result = trainer.fit()

results in this stack trace

(TrainTrainable pid=96138) 2022-03-22 10:41:28,079	ERROR function_runner.py:281 -- Runner Thread raised error.
(TrainTrainable pid=96138) Traceback (most recent call last):
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 272, in run
(TrainTrainable pid=96138)     self._entrypoint()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 348, in entrypoint
(TrainTrainable pid=96138)     return self._trainable_func(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=96138)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 306, in _trainable_func
(TrainTrainable pid=96138)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 640, in _trainable_func
(TrainTrainable pid=96138)     output = fn()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 290, in train_func
(TrainTrainable pid=96138)     trainer.training_loop()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/train/data_parallel_trainer.py", line 295, in training_loop
(TrainTrainable pid=96138)     for results in training_iterator:
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 719, in __next__
(TrainTrainable pid=96138)     self._final_results = self._run_with_error_handling(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
(TrainTrainable pid=96138)     return func()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 791, in _finish_training
(TrainTrainable pid=96138)     return self._backend_executor.finish_training()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 544, in finish_training
(TrainTrainable pid=96138)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
(TrainTrainable pid=96138)     success, failed_worker_indexes = check_for_failure(remote_values)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 53, in check_for_failure
(TrainTrainable pid=96138)     ray.get(object_ref)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/worker.py", line 1799, in get
(TrainTrainable pid=96138)     raise value.as_instanceof_cause()
(TrainTrainable pid=96138) ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=96143, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x190b21670>)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=96138)     output = session.finish()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=96138)     func_output = self.training_thread.join()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=96138)     raise self.exc
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=96138)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=96138)   File "air-error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=96138)     undefined_variable
(TrainTrainable pid=96138) NameError: name 'undefined_variable' is not defined
(TrainTrainable pid=96138) Exception in thread Thread-3:
(TrainTrainable pid=96138) Traceback (most recent call last):
(TrainTrainable pid=96138)   File "/usr/local/Cellar/[email protected]/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
(TrainTrainable pid=96138)     self.run()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 298, in run
(TrainTrainable pid=96138)     raise e
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 272, in run
(TrainTrainable pid=96138)     self._entrypoint()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 348, in entrypoint
(TrainTrainable pid=96138)     return self._trainable_func(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=96138)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 306, in _trainable_func
(TrainTrainable pid=96138)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 640, in _trainable_func
(TrainTrainable pid=96138)     output = fn()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 290, in train_func
(TrainTrainable pid=96138)     trainer.training_loop()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/train/data_parallel_trainer.py", line 295, in training_loop
(TrainTrainable pid=96138)     for results in training_iterator:
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 719, in __next__
(TrainTrainable pid=96138)     self._final_results = self._run_with_error_handling(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
(TrainTrainable pid=96138)     return func()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 791, in _finish_training
(TrainTrainable pid=96138)     return self._backend_executor.finish_training()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 544, in finish_training
(TrainTrainable pid=96138)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
(TrainTrainable pid=96138)     success, failed_worker_indexes = check_for_failure(remote_values)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 53, in check_for_failure
(TrainTrainable pid=96138)     ray.get(object_ref)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/worker.py", line 1799, in get
(TrainTrainable pid=96138)     raise value.as_instanceof_cause()
(TrainTrainable pid=96138) ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=96143, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x190b21670>)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=96138)     output = session.finish()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=96138)     func_output = self.training_thread.join()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=96138)     raise self.exc
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=96138)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=96138)   File "air-error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=96138)     undefined_variable
(TrainTrainable pid=96138) NameError: name 'undefined_variable' is not defined
(TrainTrainable pid=96138) 2022-03-22 10:41:28,088	ERROR worker.py:91 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=96142, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x1876886a0>)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=96138)     output = session.finish()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=96138)     func_output = self.training_thread.join()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=96138)     raise self.exc
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=96138)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=96138)   File "air-error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=96138)     undefined_variable
(TrainTrainable pid=96138) NameError: name 'undefined_variable' is not defined
2022-03-22 10:41:28,280	ERROR trial_runner.py:876 -- Trial train_func_49e4d_00000: Error processing event.
NoneType: None
Result for train_func_49e4d_00000:
  date: 2022-03-22_10-41-25
  experiment_id: 101e6f85fac944099fa8d346645e11fa
  hostname: MacBook-Pro-5.attlocal.net
  node_ip: 127.0.0.1
  pid: 96138
  timestamp: 1647970885
  trial_id: 49e4d_00000

== Status ==
Current time: 2022-03-22 10:41:28 (running for 00:00:05.80)
Memory usage on this node: 10.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.49 GiB heap, 0.0/1.74 GiB objects
Result logdir: /Users/amog/ray_results/train_func_2022-03-22_10-41-22
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_49e4d_00000 | ERROR    | 127.0.0.1:96138 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                    |
|------------------------+--------------+---------------------------------------------------------------------------------------------------------------|
| train_func_49e4d_00000 |            1 | /Users/amog/ray_results/train_func_2022-03-22_10-41-22/train_func_49e4d_00000_0_2022-03-22_10-41-22/error.txt |
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+

== Status ==
Current time: 2022-03-22 10:41:28 (running for 00:00:05.80)
Memory usage on this node: 10.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.49 GiB heap, 0.0/1.74 GiB objects
Result logdir: /Users/amog/ray_results/train_func_2022-03-22_10-41-22
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_49e4d_00000 | ERROR    | 127.0.0.1:96138 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                    |
|------------------------+--------------+---------------------------------------------------------------------------------------------------------------|
| train_func_49e4d_00000 |            1 | /Users/amog/ray_results/train_func_2022-03-22_10-41-22/train_func_49e4d_00000_0_2022-03-22_10-41-22/error.txt |
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 254, in fit
    analysis = tune.run(run_or_experiment=trainable)
  File "/Users/amog/dev/ray/python/ray/tune/tune.py", line 695, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_func_49e4d_00000])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "air-error.py", line 7, in <module>
    result = trainer.fit()
  File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 256, in fit
    raise TrainingFailedError
ray.ml.trainer.TrainingFailedError

The actual error (NameError) is repeated multiple times and is only shown as a log message and not raised as the actual error.

We should improve this by

Fixing the repeated NameError stack traces by only showing them once.
Raise the actual error instead of an arbitrary TuneError/TrainingFailedError.

One way to implement 2 is for Tune to use Ray Train's thread utility that allows you to raise errors from the child thread in the main thread: https://github.com/ray-project/ray/blob/master/python/ray/train/utils.py#L88-L102

The text was updated successfully, but these errors were encountered:

richardliaw · 2022-03-23T22:54:17Z

Seems like this issue is:

Not blocking critical user journey
Is persistent
Is hard to work around.

Thus labeling P1.

xwjiang2010 · 2022-03-24T19:30:42Z

Sounds good. Looking at it now. One low hanging fruit is #23475.

So basically we have:
runner thread (where it actually happens) --> main thread --> Tune driver (wrapped as a RayTaskError) --> Write to disk (error.txt) --> load up when creating ResultGrid

Amog's suggestion applies to the first transition. In addition, there are the rest of the piping that would need to be updated.

amogkam mentioned this issue Mar 22, 2022

[air] dogfooding: rendering issues #23304

Closed

2 tasks

amogkam changed the title ~~[ml] Trainer failure stack traces are too long~~ [ml] Trainer failure stack traces are too long and hide the original error Mar 22, 2022

richardliaw added the P1 Issue that should be fixed within a few weeks label Mar 23, 2022

richardliaw assigned amogkam and xwjiang2010 Mar 23, 2022

xwjiang2010 mentioned this issue Mar 24, 2022

[air] reduce unnecessary stacktrace #23475

Merged

6 tasks

xwjiang2010 closed this as completed Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ml] Trainer failure stack traces are too long and hide the original error #23405

[ml] Trainer failure stack traces are too long and hide the original error #23405

amogkam commented Mar 22, 2022 •

edited

Loading

richardliaw commented Mar 23, 2022

xwjiang2010 commented Mar 24, 2022

[ml] Trainer failure stack traces are too long and hide the original error #23405

[ml] Trainer failure stack traces are too long and hide the original error #23405

Comments

amogkam commented Mar 22, 2022 • edited Loading

richardliaw commented Mar 23, 2022

xwjiang2010 commented Mar 24, 2022

amogkam commented Mar 22, 2022 •

edited

Loading