Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ml] Trainer failure stack traces are too long and hide the original error #23405

Closed
amogkam opened this issue Mar 22, 2022 · 2 comments
Closed
Assignees
Labels
P1 Issue that should be fixed within a few weeks

Comments

@amogkam
Copy link
Contributor

amogkam commented Mar 22, 2022

Running this script

from ray.ml.train.integrations.torch import TorchTrainer

def train_loop_for_worker():
    undefined_variable

trainer = TorchTrainer(scaling_config={"num_workers": 2}, train_loop_per_worker=train_loop_for_worker)
result = trainer.fit()

results in this stack trace

(TrainTrainable pid=96138) 2022-03-22 10:41:28,079	ERROR function_runner.py:281 -- Runner Thread raised error.
(TrainTrainable pid=96138) Traceback (most recent call last):
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 272, in run
(TrainTrainable pid=96138)     self._entrypoint()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 348, in entrypoint
(TrainTrainable pid=96138)     return self._trainable_func(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=96138)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 306, in _trainable_func
(TrainTrainable pid=96138)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 640, in _trainable_func
(TrainTrainable pid=96138)     output = fn()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 290, in train_func
(TrainTrainable pid=96138)     trainer.training_loop()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/train/data_parallel_trainer.py", line 295, in training_loop
(TrainTrainable pid=96138)     for results in training_iterator:
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 719, in __next__
(TrainTrainable pid=96138)     self._final_results = self._run_with_error_handling(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
(TrainTrainable pid=96138)     return func()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 791, in _finish_training
(TrainTrainable pid=96138)     return self._backend_executor.finish_training()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 544, in finish_training
(TrainTrainable pid=96138)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
(TrainTrainable pid=96138)     success, failed_worker_indexes = check_for_failure(remote_values)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 53, in check_for_failure
(TrainTrainable pid=96138)     ray.get(object_ref)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/worker.py", line 1799, in get
(TrainTrainable pid=96138)     raise value.as_instanceof_cause()
(TrainTrainable pid=96138) ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=96143, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x190b21670>)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=96138)     output = session.finish()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=96138)     func_output = self.training_thread.join()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=96138)     raise self.exc
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=96138)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=96138)   File "air-error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=96138)     undefined_variable
(TrainTrainable pid=96138) NameError: name 'undefined_variable' is not defined
(TrainTrainable pid=96138) Exception in thread Thread-3:
(TrainTrainable pid=96138) Traceback (most recent call last):
(TrainTrainable pid=96138)   File "/usr/local/Cellar/[email protected]/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
(TrainTrainable pid=96138)     self.run()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 298, in run
(TrainTrainable pid=96138)     raise e
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 272, in run
(TrainTrainable pid=96138)     self._entrypoint()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 348, in entrypoint
(TrainTrainable pid=96138)     return self._trainable_func(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=96138)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 306, in _trainable_func
(TrainTrainable pid=96138)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/tune/function_runner.py", line 640, in _trainable_func
(TrainTrainable pid=96138)     output = fn()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 290, in train_func
(TrainTrainable pid=96138)     trainer.training_loop()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/ml/train/data_parallel_trainer.py", line 295, in training_loop
(TrainTrainable pid=96138)     for results in training_iterator:
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 719, in __next__
(TrainTrainable pid=96138)     self._final_results = self._run_with_error_handling(
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 687, in _run_with_error_handling
(TrainTrainable pid=96138)     return func()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/trainer.py", line 791, in _finish_training
(TrainTrainable pid=96138)     return self._backend_executor.finish_training()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 544, in finish_training
(TrainTrainable pid=96138)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 563, in get_with_failure_handling
(TrainTrainable pid=96138)     success, failed_worker_indexes = check_for_failure(remote_values)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 53, in check_for_failure
(TrainTrainable pid=96138)     ray.get(object_ref)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/worker.py", line 1799, in get
(TrainTrainable pid=96138)     raise value.as_instanceof_cause()
(TrainTrainable pid=96138) ray.exceptions.RayTaskError(NameError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=96143, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x190b21670>)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=96138)     output = session.finish()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=96138)     func_output = self.training_thread.join()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=96138)     raise self.exc
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=96138)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=96138)   File "air-error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=96138)     undefined_variable
(TrainTrainable pid=96138) NameError: name 'undefined_variable' is not defined
(TrainTrainable pid=96138) 2022-03-22 10:41:28,088	ERROR worker.py:91 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=96142, ip=127.0.0.1, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x1876886a0>)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=96138)     return func(*args, **kwargs)
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/backend.py", line 535, in end_training
(TrainTrainable pid=96138)     output = session.finish()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/session.py", line 117, in finish
(TrainTrainable pid=96138)     func_output = self.training_thread.join()
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 101, in join
(TrainTrainable pid=96138)     raise self.exc
(TrainTrainable pid=96138)   File "/Users/amog/dev/ray/python/ray/train/utils.py", line 94, in run
(TrainTrainable pid=96138)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=96138)   File "air-error.py", line 4, in train_loop_for_worker
(TrainTrainable pid=96138)     undefined_variable
(TrainTrainable pid=96138) NameError: name 'undefined_variable' is not defined
2022-03-22 10:41:28,280	ERROR trial_runner.py:876 -- Trial train_func_49e4d_00000: Error processing event.
NoneType: None
Result for train_func_49e4d_00000:
  date: 2022-03-22_10-41-25
  experiment_id: 101e6f85fac944099fa8d346645e11fa
  hostname: MacBook-Pro-5.attlocal.net
  node_ip: 127.0.0.1
  pid: 96138
  timestamp: 1647970885
  trial_id: 49e4d_00000

== Status ==
Current time: 2022-03-22 10:41:28 (running for 00:00:05.80)
Memory usage on this node: 10.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.49 GiB heap, 0.0/1.74 GiB objects
Result logdir: /Users/amog/ray_results/train_func_2022-03-22_10-41-22
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_49e4d_00000 | ERROR    | 127.0.0.1:96138 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                    |
|------------------------+--------------+---------------------------------------------------------------------------------------------------------------|
| train_func_49e4d_00000 |            1 | /Users/amog/ray_results/train_func_2022-03-22_10-41-22/train_func_49e4d_00000_0_2022-03-22_10-41-22/error.txt |
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+

== Status ==
Current time: 2022-03-22 10:41:28 (running for 00:00:05.80)
Memory usage on this node: 10.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/3.49 GiB heap, 0.0/1.74 GiB objects
Result logdir: /Users/amog/ray_results/train_func_2022-03-22_10-41-22
Number of trials: 1/1 (1 ERROR)
+------------------------+----------+-----------------+
| Trial name             | status   | loc             |
|------------------------+----------+-----------------|
| train_func_49e4d_00000 | ERROR    | 127.0.0.1:96138 |
+------------------------+----------+-----------------+
Number of errored trials: 1
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+
| Trial name             |   # failures | error file                                                                                                    |
|------------------------+--------------+---------------------------------------------------------------------------------------------------------------|
| train_func_49e4d_00000 |            1 | /Users/amog/ray_results/train_func_2022-03-22_10-41-22/train_func_49e4d_00000_0_2022-03-22_10-41-22/error.txt |
+------------------------+--------------+---------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 254, in fit
    analysis = tune.run(run_or_experiment=trainable)
  File "/Users/amog/dev/ray/python/ray/tune/tune.py", line 695, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_func_49e4d_00000])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "air-error.py", line 7, in <module>
    result = trainer.fit()
  File "/Users/amog/dev/ray/python/ray/ml/trainer.py", line 256, in fit
    raise TrainingFailedError
ray.ml.trainer.TrainingFailedError

The actual error (NameError) is repeated multiple times and is only shown as a log message and not raised as the actual error.

We should improve this by

  1. Fixing the repeated NameError stack traces by only showing them once.
  2. Raise the actual error instead of an arbitrary TuneError/TrainingFailedError.

One way to implement 2 is for Tune to use Ray Train's thread utility that allows you to raise errors from the child thread in the main thread: https://github.com/ray-project/ray/blob/master/python/ray/train/utils.py#L88-L102

@amogkam amogkam changed the title [ml] Trainer failure stack traces are too long [ml] Trainer failure stack traces are too long and hide the original error Mar 22, 2022
@richardliaw richardliaw added the P1 Issue that should be fixed within a few weeks label Mar 23, 2022
@richardliaw
Copy link
Contributor

Seems like this issue is:

  1. Not blocking critical user journey
  2. Is persistent
  3. Is hard to work around.

Thus labeling P1.

@xwjiang2010
Copy link
Contributor

Sounds good. Looking at it now. One low hanging fruit is #23475.

So basically we have:
runner thread (where it actually happens) --> main thread --> Tune driver (wrapped as a RayTaskError) --> Write to disk (error.txt) --> load up when creating ResultGrid

Amog's suggestion applies to the first transition. In addition, there are the rest of the piping that would need to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants