[Train] TorchTrainer does not free all GPUs on shutdown #32725

MahdiNazemi · 2023-02-22T02:13:22Z

What happened + What you expected to happen

I have set up an experiment where I use a TorchTrainer (to enable DDP with eight GPUs) with ASHAv2 scheduler. Each trial is allocated all eight GPUs available on the node.
The grace_period=1 so each trial is run for just one epoch before it is preempted by another PENDING trial.

After a few trials are run until the end of the first milestone, the trainer fails to clear the memory of only one of the GPUs, which causes a CUDA out of memory error for the next trial.
This error shows up at different times when I rerun the experiment, e.g., once the memory is cleared correctly for the first five experiments but fails on the sixth, and in another case, this issue occurs on the tenth trial.

To mitigate the issue, I added a wait_for_gpu() call at the beginning of my worker function. However, the GPU whose memory is not freed prints the following lines before the program is terminated:

(RayTrainWorker pid=142525) 2023-02-21 17:55:54,271     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:55:59,372     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:04,477     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:09,581     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:14,683     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:19,782     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:24,882     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:29,985     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:35,090     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:40,193     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:45,294     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:50,398     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:55,501     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:00,607     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:05,711     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:10,815     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:15,918     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:21,020     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:26,121     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781

The other seven GPUs don't suffer from the said issue.

I reran the experiment a few times, both with and without the wait_for_gpu() call, and experienced the same behavior every time.

Versions / Dependencies

Ray 2.2.0
Python 3.10.8
PyTorch 1.13.1
Ubuntu 22.04

Reproduction script

Will provide the script ASAP.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

richardliaw · 2023-02-22T02:34:20Z

@scv119 @matthewdeng @cadedaniel

Any ideas here?

cadedaniel · 2023-02-22T02:36:43Z

The repro script will help say definitively what the issue is. @MahdiNazemi could you say how you launched the job? E.g. was it from the head node directly, or submitted via ray jobs, or Ray Client?

scv119 · 2023-02-22T02:43:37Z

Potentially duplication of #32210. It will be great to have a repro script; also can you check when the GPU leaks, whether the ray worker process exited?

MahdiNazemi · 2023-02-22T02:46:16Z

@cadedaniel, I'm using Ray with a large repo so I have to try to trim the code a lot to prepare and provide a script. But to answer your question, I run Ray from the head node directly.

Here is some information that may be helpful:

num_samples = 350

metric = "accuracy"
mode = "max"

search_alg = HyperOptSearch(
    space=space, metric=metric, mode=mode, n_initial_points=20
)

scheduler = ASHAv2(
    time_attr="training_iteration",
    metric=metric,
    mode=mode,
    max_t=120,
    grace_period=1,
    reduction_factor=35,
)

num_gpus = len([int(s) for s in args.gpus.split(",")])

if args.parallel == "DDP":
    trainer = TorchTrainer(
        train_loop_per_worker=partial(run_worker_helper, args),
        torch_config=TorchConfig(backend="nccl"),
        scaling_config=ScalingConfig(
            trainer_resources={"CPU": 1},
            num_workers=num_gpus,
            use_gpu=True,
            resources_per_worker={"CPU": args.workers},
        ),
    )

tuner = tune.Tuner(
    trainable=trainer,
    param_space=param_space,
    tune_config=tune_config,
    run_config=run_config,
)
result = tuner.fit()

Please let me know if you need additional information.

MahdiNazemi · 2023-02-22T02:49:46Z

can you check when the GPU leaks, whether the ray worker process exited?

@scv119, could you please let me know how to do that? Should I try to find the worker process related to that trial and find its status in the dashboard?

cadedaniel · 2023-02-22T02:51:44Z

also can you check when the GPU leaks, whether the ray worker process exited?

One way to check this is to run nvidia-smi on the host using the 8 GPUs. It will print out a list of processes using GPUs; it will be interesting to see which Ray process is still present.

MahdiNazemi · 2023-02-22T03:45:21Z

There are no running worker processes when the issue occurs. For the problematic trial, nvidia-smi looks like this a few seconds before and after the out of memory error:

Tue Feb 21 19:30:26 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:01:00.0 Off |                  Off |
| 30%   45C    P8    27W / 300W |    797MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:25:00.0 Off |                  Off |
| 30%   38C    P8    22W / 300W |    773MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    On   | 00000000:41:00.0 Off |                  Off |
| 30%   43C    P8    23W / 300W |    821MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    On   | 00000000:61:00.0 Off |                  Off |
| 30%   48C    P2    76W / 300W |  38253MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000    On   | 00000000:81:00.0 Off |                  Off |
| 30%   43C    P8    22W / 300W |     27MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000    On   | 00000000:A1:00.0 Off |                  Off |
| 30%   39C    P8    22W / 300W |    773MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000    On   | 00000000:C1:00.0 Off |                  Off |
| 30%   46C    P8    22W / 300W |    773MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000    On   | 00000000:E1:00.0 Off |                  Off |
| 30%   39C    P8    26W / 300W |    773MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    191304      C   ...._RayTrainWorker__execute      770MiB |
|    1   N/A  N/A    191300      C   ...._RayTrainWorker__execute      770MiB |
|    2   N/A  N/A    191305      C   ...._RayTrainWorker__execute      770MiB |
|    3   N/A  N/A    191306      C   ...._RayTrainWorker__execute      770MiB |
|    5   N/A  N/A    191299      C   ...._RayTrainWorker__execute      770MiB |
|    6   N/A  N/A    191303      C   ...._RayTrainWorker__execute      770MiB |
|    7   N/A  N/A    191301      C   ...._RayTrainWorker__execute      770MiB |
+-----------------------------------------------------------------------------+

Tue Feb 21 19:30:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:01:00.0 Off |                  Off |
| 30%   45C    P8    34W / 300W |     27MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:25:00.0 Off |                  Off |
| 30%   38C    P8    30W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    On   | 00000000:41:00.0 Off |                  Off |
| 30%   43C    P8    30W / 300W |     51MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    On   | 00000000:61:00.0 Off |                  Off |
| 30%   47C    P2    77W / 300W |  37483MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000    On   | 00000000:81:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |     27MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000    On   | 00000000:A1:00.0 Off |                  Off |
| 30%   39C    P8    28W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000    On   | 00000000:C1:00.0 Off |                  Off |
| 30%   46C    P8    32W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000    On   | 00000000:E1:00.0 Off |                  Off |
| 30%   39C    P8    33W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The memory usage for all processes is around 38G during training. As you can see in the latter call to nvidia-smi, the GPU memory is not freed on GPU 3, but is almost completely freed on all others.

matthewdeng · 2023-02-22T04:50:27Z

@MahdiNazemi can you take a look at #31451 (comment) and see if this is applicable to your script? Specifically, do you use a DataLoader with num_workers>0?

MahdiNazemi · 2023-02-22T04:54:14Z

@matthewdeng, yes, I usually set num_workers to four or eight.
I can rerun the experiment with num_workers=0 and num_workers=1 and report back.

matthewdeng · 2023-02-22T04:56:46Z

Ah okay I think that's likely it - please try with num_workers=0 as I believe num_workers=1 will still end up launching 1 subprocess, which could run into the same issue.

MahdiNazemi · 2023-02-22T04:59:02Z

But num_workers=0 is known to slow down the training a lot. Is there an alternative to keep the number of workers high while avoiding this issue?

The experiment is running, but because each epoch is taking a lot longer, it will take some time before I can report back results.

matthewdeng · 2023-02-22T05:23:16Z

The team is looking into properly terminating subprocesses, but more investigation is needed to understand how to do so.

Though based on your original observations and the discussion in the other thread, I am wondering if there is a particular codepath in the trial pausing flow that is (sometimes) causing non-graceful termination. @Yard1 do you know? Something like what's controlled by TUNE_FORCE_TRIAL_CLEANUP_S.

MahdiNazemi · 2023-02-22T05:32:27Z

The team is looking into properly terminating subprocesses, but more investigation is needed to understand how to do so.

Great!

I terminated the experiment with num_workers=0 because it was taking forever to train ResNet18 for one epoch.
~~I have started a new experiment with persistent_workers=True to see if it makes the issue less likely to happen.~~ In theory, this shouldn't have an impact because I'm running all trials for one epoch at first.

Update 1:
The experiment is running fine so far and has evaluated at least one epoch for 43 trials.

justinvyu · 2023-02-22T18:29:45Z

Is it possible to share your run_worker_helper training script @MahdiNazemi?

MahdiNazemi · 2023-02-22T18:34:52Z

@justinvyu, sure!

def run_worker_helper(args, config):
    if not isinstance(config, dict):
        raise ValueError(
            f"Input 'config' is not a dict, recieved {type(config)}"
        )
    args.tune_config = config

    hyperopt_to_ray(config)
    checkpoint_to_args(config, args)

    rank = session.get_local_rank()
    world_size = session.get_local_world_size()
    run_worker(rank, world_size, args)

where run_worker() is a function I normally use for DDP in PyTorch. run_worker creates an object of a class that deals with the training loop, validation, testing, logging, checkpointing, etc.
My code uses a fork of this repository; you can find the class I refer to here.

Here is the gist of run_worker:

def run_worker(rank, world_size, args):
    process_group_params = dict(rank=rank, world_size=world_size)
    app = ClassifierCompressorSampleApp(
        args,
        script_dir=os.path.dirname(__file__),
        process_group_params=process_group_params,
    )
    app.run_training_loop()
    if args.tune == "":
        app.test()
        dist.destroy_process_group()

vsokolovskii · 2023-03-15T18:17:31Z

I'm experiencing the same issue.

olivierr42 · 2024-02-01T02:50:48Z

I am also experiencing the same issue. One of the GPUs is hanging at the end of training.

anyscalesam · 2024-05-15T18:20:14Z

@olivierr42 @vsokolovskii do you have a more recent repro script that we can run here?

MahdiNazemi added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 22, 2023

justinvyu added P1 Issue that should be fixed within a few weeks tune Tune-related issues train Ray Train Related Issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] TorchTrainer does not free all GPUs on shutdown #32725

[Train] TorchTrainer does not free all GPUs on shutdown #32725

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

richardliaw commented Feb 22, 2023

cadedaniel commented Feb 22, 2023

scv119 commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

cadedaniel commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

matthewdeng commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

matthewdeng commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

matthewdeng commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

justinvyu commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

vsokolovskii commented Mar 15, 2023

olivierr42 commented Feb 1, 2024

anyscalesam commented May 15, 2024

[Train] TorchTrainer does not free all GPUs on shutdown #32725

[Train] TorchTrainer does not free all GPUs on shutdown #32725

Comments

MahdiNazemi commented Feb 22, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

richardliaw commented Feb 22, 2023

cadedaniel commented Feb 22, 2023

scv119 commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

cadedaniel commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023 • edited Loading

matthewdeng commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

matthewdeng commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023

matthewdeng commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023 • edited Loading

justinvyu commented Feb 22, 2023

MahdiNazemi commented Feb 22, 2023 • edited Loading

vsokolovskii commented Mar 15, 2023

olivierr42 commented Feb 1, 2024

anyscalesam commented May 15, 2024

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

MahdiNazemi commented Feb 22, 2023 •

edited

Loading

MahdiNazemi commented Feb 22, 2023 •

edited

Loading