Enabling data-parallel multi-GPU training #1188

marcromeyn · 2023-07-06T12:30:58Z

This PR enables multi-GPU training, as well as add auto-initialization of a Model.

It also introduces singlegpu and multigpu pytest markers for splitting the GPU CI Github Actions workflow into two jobs: one for the 1GPU runner, and one for multi-gpu 2GPU runner.

Follow-up: The test in tests/integration is not complete because Lightning launches separte processes under the hood with the correct environment variables like LOCAL_RANK, but the pytest stays in the main process and tests only the LOCAL_RANK=0 case. To follow up with proper test that ensures dataloader is working properly with e.g., global_rank > 0.

github-actions · 2023-07-06T12:43:54Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1188

…o torch/multi-gpu

marcromeyn added 3 commits July 5, 2023 13:41

First pass over multi-GPU

e00799d

Multi-gpu test passes now locally (without metric calculations)

c972549

Introducing MultiLoader & add auto-initialization of Model

ba430d5

marcromeyn self-assigned this Jul 6, 2023

marcromeyn added enhancement New feature or request area/pytorch labels Jul 6, 2023

Merge branch 'main' into torch/multi-gpu

3829127

marcromeyn and others added 12 commits July 6, 2023 15:04

Enable multi-GPU with metric-calculations

fdd6adb

Remove un-used to method in ModelOutput

c1d97b7

Merge branch 'torch/multi-gpu' of github.com:NVIDIA-Merlin/models int…

12418c8

…o torch/multi-gpu

fix test for cpu

c317f88

use multigpu marker

1645bb9

Merge branch 'main' into torch/multi-gpu

0365bed

automatically reparition if repartition is not provided

2d1112e

test rank

8f308e9

Merge branch 'main' into torch/multi-gpu

1628557

Add comment for follow up tasks

49c2674

lint

3e5ac66

fix test for cpu

4e80c77

edknv marked this pull request as ready for review July 10, 2023 12:37

edknv approved these changes Jul 10, 2023

View reviewed changes

edknv mentioned this pull request Jul 10, 2023

[WIP] Split gpu ci workflow into single-gpu and multi-gpu #1190

Closed

edknv merged commit 145e592 into main Jul 10, 2023

edknv deleted the torch/multi-gpu branch July 10, 2023 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling data-parallel multi-GPU training #1188

Enabling data-parallel multi-GPU training #1188

marcromeyn commented Jul 6, 2023 •

edited by edknv

Loading

github-actions bot commented Jul 6, 2023

Enabling data-parallel multi-GPU training #1188

Enabling data-parallel multi-GPU training #1188

Conversation

marcromeyn commented Jul 6, 2023 • edited by edknv Loading

github-actions bot commented Jul 6, 2023

Documentation preview

marcromeyn commented Jul 6, 2023 •

edited by edknv

Loading