-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] TorchTrainer does not free all GPUs on shutdown #32725
Comments
@scv119 @matthewdeng @cadedaniel Any ideas here? |
The repro script will help say definitively what the issue is. @MahdiNazemi could you say how you launched the job? E.g. was it from the head node directly, or submitted via ray jobs, or Ray Client? |
Potentially duplication of #32210. It will be great to have a repro script; also can you check when the GPU leaks, whether the ray worker process exited? |
@cadedaniel, I'm using Ray with a large repo so I have to try to trim the code a lot to prepare and provide a script. But to answer your question, I run Ray from the head node directly. Here is some information that may be helpful: num_samples = 350
metric = "accuracy"
mode = "max"
search_alg = HyperOptSearch(
space=space, metric=metric, mode=mode, n_initial_points=20
)
scheduler = ASHAv2(
time_attr="training_iteration",
metric=metric,
mode=mode,
max_t=120,
grace_period=1,
reduction_factor=35,
) num_gpus = len([int(s) for s in args.gpus.split(",")])
if args.parallel == "DDP":
trainer = TorchTrainer(
train_loop_per_worker=partial(run_worker_helper, args),
torch_config=TorchConfig(backend="nccl"),
scaling_config=ScalingConfig(
trainer_resources={"CPU": 1},
num_workers=num_gpus,
use_gpu=True,
resources_per_worker={"CPU": args.workers},
),
)
tuner = tune.Tuner(
trainable=trainer,
param_space=param_space,
tune_config=tune_config,
run_config=run_config,
)
result = tuner.fit() Please let me know if you need additional information. |
@scv119, could you please let me know how to do that? Should I try to find the worker process related to that trial and find its status in the dashboard? |
One way to check this is to run |
There are no running worker processes when the issue occurs. For the problematic trial,
The memory usage for all processes is around 38G during training. As you can see in the latter call to |
@MahdiNazemi can you take a look at #31451 (comment) and see if this is applicable to your script? Specifically, do you use a |
@matthewdeng, yes, I usually set |
Ah okay I think that's likely it - please try with |
But The experiment is running, but because each epoch is taking a lot longer, it will take some time before I can report back results. |
The team is looking into properly terminating subprocesses, but more investigation is needed to understand how to do so. Though based on your original observations and the discussion in the other thread, I am wondering if there is a particular codepath in the trial pausing flow that is (sometimes) causing non-graceful termination. @Yard1 do you know? Something like what's controlled by |
Great! I terminated the experiment with Update 1: |
Is it possible to share your |
@justinvyu, sure! def run_worker_helper(args, config):
if not isinstance(config, dict):
raise ValueError(
f"Input 'config' is not a dict, recieved {type(config)}"
)
args.tune_config = config
hyperopt_to_ray(config)
checkpoint_to_args(config, args)
rank = session.get_local_rank()
world_size = session.get_local_world_size()
run_worker(rank, world_size, args) where Here is the gist of def run_worker(rank, world_size, args):
process_group_params = dict(rank=rank, world_size=world_size)
app = ClassifierCompressorSampleApp(
args,
script_dir=os.path.dirname(__file__),
process_group_params=process_group_params,
)
app.run_training_loop()
if args.tune == "":
app.test()
dist.destroy_process_group() |
I'm experiencing the same issue. |
I am also experiencing the same issue. One of the GPUs is hanging at the end of training. |
@olivierr42 @vsokolovskii do you have a more recent repro script that we can run here? |
What happened + What you expected to happen
I have set up an experiment where I use a
TorchTrainer
(to enable DDP with eight GPUs) withASHAv2
scheduler. Each trial is allocated all eight GPUs available on the node.The
grace_period=1
so each trial is run for just one epoch before it is preempted by another PENDING trial.After a few trials are run until the end of the first milestone, the trainer fails to clear the memory of only one of the GPUs, which causes a CUDA out of memory error for the next trial.
This error shows up at different times when I rerun the experiment, e.g., once the memory is cleared correctly for the first five experiments but fails on the sixth, and in another case, this issue occurs on the tenth trial.
To mitigate the issue, I added a
wait_for_gpu()
call at the beginning of my worker function. However, the GPU whose memory is not freed prints the following lines before the program is terminated:The other seven GPUs don't suffer from the said issue.
I reran the experiment a few times, both with and without the
wait_for_gpu()
call, and experienced the same behavior every time.Versions / Dependencies
Ray 2.2.0
Python 3.10.8
PyTorch 1.13.1
Ubuntu 22.04
Reproduction script
Will provide the script ASAP.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: