Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Decouple device-related modules and add Huawei NPU support to Ray Train #44086

Merged
merged 38 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
ebcccca
Introduce TorchDeviceManager to ray TrainSession and support NPU in R…
liuxsh9 Jun 7, 2024
e0b8117
Add higher abstract class to decouple device manager with torch.
liuxsh9 Jun 7, 2024
de34e81
fix
liuxsh9 Jun 7, 2024
e3ebd13
Merge branch 'master' into train-support-npu
liuxsh9 Jun 29, 2024
c80ec32
fix get_current_stream()
liuxsh9 Jun 29, 2024
ddee918
refine code
liuxsh9 Jul 3, 2024
d0a4e73
fix lint
liuxsh9 Jul 4, 2024
6583bd4
Merge branch 'master' into train-support-npu
liuxsh9 Jul 4, 2024
a4839d2
Enable share npu visible devices for local process in ddp.
liuxsh9 Jul 17, 2024
798304a
Merge branch 'master' into train-support-npu
liuxsh9 Jul 17, 2024
82eecc7
fix lint
liuxsh9 Jul 17, 2024
c88e29e
change the order of init device mananger and set env to enable huggin…
liuxsh9 Jul 17, 2024
9b1ada5
Merge branch 'master' into train-support-npu
liuxsh9 Jul 18, 2024
13e4914
Refactor code based on the comment and feedback.
liuxsh9 Jul 27, 2024
4665b22
Add unit tests for torch device mananger and npu accelerator ids sharing
liuxsh9 Jul 27, 2024
d51d5d0
Merge branch 'master' into train-support-npu
liuxsh9 Jul 27, 2024
2f7b9c6
add gpu-only tags for unit tests
liuxsh9 Jul 27, 2024
81ba9a4
remove resources_per_worker field in backend
liuxsh9 Jul 27, 2024
79f7d43
Edit error message
liuxsh9 Jul 27, 2024
15e99c6
refine code
liuxsh9 Jul 29, 2024
16927f2
Trigger a runtime error when npu is allocated but torch npu is not av…
liuxsh9 Jul 29, 2024
0e81c8e
revert hpu get device logic.
liuxsh9 Jul 29, 2024
dee4745
Refine the code based on the feedback from review.
liuxsh9 Aug 7, 2024
e6a2f4a
Introduce `CPUTorchDeviceManager` and change the value of^CDEFAULT_TO…
liuxsh9 Aug 7, 2024
9c5a296
delete fall back logic in CUDATorchDevicaManager
liuxsh9 Aug 7, 2024
68bc4e1
Merge branch 'master' into train-support-npu
liuxsh9 Aug 7, 2024
16094f1
fix
liuxsh9 Aug 7, 2024
82f27b1
refine code
liuxsh9 Aug 8, 2024
7973ddf
Refine code based on the review feedback.
liuxsh9 Aug 15, 2024
b83d1ff
fix
liuxsh9 Aug 15, 2024
199e572
Merge branch 'master' into train-support-npu
liuxsh9 Aug 15, 2024
1c0d115
Fix device manager logic in local environment.
liuxsh9 Aug 16, 2024
2c4c5cc
fix
liuxsh9 Aug 16, 2024
d8a0900
fix typo and remove unnecessary operation for npu.
liuxsh9 Aug 16, 2024
93a9bcf
move implementation into subclass
liuxsh9 Aug 19, 2024
761ac5c
Update code based on the review feedback.
liuxsh9 Aug 23, 2024
657b1f9
Merge branch 'master' into train-support-npu
matthewdeng Aug 27, 2024
a5549cc
fix lint
liuxsh9 Aug 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions python/ray/_private/ray_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,10 +423,13 @@ def env_set_by_user(key):
CUDA_VISIBLE_DEVICES_ENV_VAR = "CUDA_VISIBLE_DEVICES"
NEURON_RT_VISIBLE_CORES_ENV_VAR = "NEURON_RT_VISIBLE_CORES"
TPU_VISIBLE_CHIPS_ENV_VAR = "TPU_VISIBLE_CHIPS"
NPU_RT_VISIBLE_DEVICES_ENV_VAR = "ASCEND_RT_VISIBLE_DEVICES"

NEURON_CORES = "neuron_cores"
GPU = "GPU"
TPU = "TPU"
NPU = "NPU"
HPU = "HPU"


RAY_WORKER_NICENESS = "RAY_worker_niceness"
Expand Down
108 changes: 108 additions & 0 deletions python/ray/air/_internal/device_manager/__init__.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this high level DeviceManager abstraction and just keep the TorchDeviceManager. No need for the extra abstraction for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, move this entire folder to ray/train/torch/_internal instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this high level DeviceManager abstraction and just keep the TorchDeviceManager. No need for the extra abstraction for now.

OK, just introduce the TorchDeviceManager for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, move this entire folder to ray/train/torch/_internal instead.

We are trying to extend the DeviceManager in ray.air to support more third-party devices, and the plan is to not only use it for Ray Train, but also include RLlib and others. So it seems more reasonable to maintain it within air. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep it in Train for now.

Copy link
Contributor Author

@liuxsh9 liuxsh9 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Train rely on the get_devices in AIR, so it's natural for us to implement the DeviceManager in AIR to return the correct device to Train. If we move the DeviceManager to Train, it would create a weird dependency where Train calls AIR's get_devices, which in turn calls back to Train's DeviceManager. Would you mind elaborating on your thoughts about this part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdeng WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just put it in ray.air for now. Can restructure the package in the future if needed.

Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
import logging
import threading
from typing import Optional, Type

import ray
import ray._private.ray_constants as ray_constants
from ray.air._internal.device_manager.cpu import CPUTorchDeviceManager
from ray.air._internal.device_manager.hpu import HPUTorchDeviceManager
from ray.air._internal.device_manager.npu import NPUTorchDeviceManager
from ray.air._internal.device_manager.nvidia_gpu import CUDATorchDeviceManager
from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager

logger = logging.getLogger(__name__)


DEFAULT_TORCH_DEVICE_MANAGER_CLS = CPUTorchDeviceManager


SUPPORTED_ACCELERATOR_TORCH_DEVICE_MANAGER = {
ray_constants.GPU: CUDATorchDeviceManager,
ray_constants.HPU: HPUTorchDeviceManager,
ray_constants.NPU: NPUTorchDeviceManager,
}


def register_custom_torch_dist_backend(backend: Optional[str] = None) -> None:
if backend == "hccl":
# The name for the communication backend of Habana and torch-npu is the same.
HPUTorchDeviceManager.register_custom_torch_dist_backend()

NPUTorchDeviceManager.register_custom_torch_dist_backend()


def get_torch_device_manager_cls_by_resources(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to define these functions in torch_device_manager.py instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the dependency is that these methods depend on various XPUDeviceManager implementations, and the XPUDeviceManager depend on TorchDeviceManager in torch_device_manager.py.

If we put these methods into the torch_device_manager.py, it may lead to circular import issues.

We refer the design of AcceleratorManager, it put the higher-level API functions in the __init__.py.

My understanding may be limited, could you please provide more information?

resources: Optional[dict],
) -> Type[TorchDeviceManager]:
existing_device_manager = None

# input resources may be None
if not resources:
return DEFAULT_TORCH_DEVICE_MANAGER_CLS

# select correct accelerator type from resources
for resource_type, resource_value in resources.items():
device_manager = SUPPORTED_ACCELERATOR_TORCH_DEVICE_MANAGER.get(
resource_type, None
)
if resource_value and device_manager:
# An error will raise when multiple accelerators are specified.
if existing_device_manager:
raise RuntimeError(
"Unable to determine the appropriate DeviceManager "
f"for the specified resources {resources}."
)
else:
existing_device_manager = device_manager

return existing_device_manager or DEFAULT_TORCH_DEVICE_MANAGER_CLS


def get_torch_device_manager_cls_by_device_type(device_type: str):
if device_type.lower() == ray_constants.GPU.lower() or device_type == "cuda":
return CUDATorchDeviceManager
elif device_type.lower() == ray_constants.NPU.lower():
return NPUTorchDeviceManager
elif device_type.lower() == ray_constants.HPU.lower():
return HPUTorchDeviceManager
elif device_type.lower() == "cpu":
return CPUTorchDeviceManager

raise RuntimeError(f"Device type {device_type} cannot be recognized.")


_torch_device_manager = None
_torch_device_manager_lock = threading.Lock()
Comment on lines +34 to +35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a global variable to track this? Can this not just be tracked as an instance variable by the caller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with global variables is that we want these functions to be called by the individual workers, so they aren't pointing to these references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses previous reviewer's feedback. Additionally, we will check for _torch_device_manager within the worker and initialize it if it doesn't exist. Can this meet the users' requirements?



def get_torch_device_manager(device_type: Optional[str] = None) -> TorchDeviceManager:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should consider removing this function. It's convenient to have a single function like this, but for all the usages it seems we want one explicit path (either with or without a device), so just calling that explicit logic directly is easier to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, this has been modified to have two clear function entry points

if device_type:
# Specify the device type to retrieve the device manager directly,
# rather than relying on the remote environment to determine it.
return get_torch_device_manager_cls_by_device_type(device_type)()

with _torch_device_manager_lock:
if not _torch_device_manager:
init_torch_device_manager()

return _torch_device_manager


def init_torch_device_manager() -> None:
global _torch_device_manager

resources = ray.get_runtime_context().get_accelerator_ids()

_torch_device_manager = get_torch_device_manager_cls_by_resources(resources)()


__all__ = [
TorchDeviceManager,
CPUTorchDeviceManager,
CUDATorchDeviceManager,
HPUTorchDeviceManager,
NPUTorchDeviceManager,
register_custom_torch_dist_backend,
get_torch_device_manager,
init_torch_device_manager,
]
30 changes: 30 additions & 0 deletions python/ray/air/_internal/device_manager/cpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from contextlib import contextmanager
from typing import List

import torch

from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager


class CPUTorchDeviceManager(TorchDeviceManager):
"""CPU device manager"""

def is_available(self) -> bool():
return True

def get_devices(self) -> List[torch.device]:
"""Gets the correct torch device list configured for this process."""
return [torch.device("cpu")]

def supports_stream(self) -> bool:
"""Validate if the device type support create a stream"""
return False

def get_stream_context(self, stream):
"""Return empty context mananger for CPU."""

@contextmanager
def default_context_manager():
yield

return default_context_manager()
50 changes: 50 additions & 0 deletions python/ray/air/_internal/device_manager/hpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from contextlib import contextmanager
from typing import List, Union

import torch

from ray._private.accelerators.hpu import HPU_PACKAGE_AVAILABLE
from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager

if HPU_PACKAGE_AVAILABLE:
import habana_frameworks.torch.hpu as torch_hpu


class HPUTorchDeviceManager(TorchDeviceManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also invite @harborn @kira-lin to review the HPU device manager.

"""HPU device manager"""

@staticmethod
def register_custom_torch_dist_backend():
if HPU_PACKAGE_AVAILABLE:
import habana_frameworks.torch.core # noqa: F401
import habana_frameworks.torch.distributed.hccl # noqa: F401

def is_available(self) -> bool():
if not HPU_PACKAGE_AVAILABLE:
return False

return torch_hpu.is_available()

def get_devices(self) -> List[torch.device]:
if not self.is_available():
raise RuntimeError(
"Using HPUTorchDeviceManager but torch hpu is not available."
)

return [torch.device("hpu")]

def set_device(self, device: Union[torch.device, int, str, None]):
torch_hpu.set_device(device)

def supports_stream(self) -> bool:
"""Validate if the device type support create a stream"""
return False

def get_stream_context(self, stream):
"""Get HPU stream context manager, empty so far."""

@contextmanager
def default_context_manager():
yield

return default_context_manager()
105 changes: 105 additions & 0 deletions python/ray/air/_internal/device_manager/npu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
import os
from importlib.util import find_spec
from typing import List, Union

import torch

import ray
import ray._private.ray_constants as ray_constants
from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager


def is_package_present(package_name: str) -> bool:
try:
return find_spec(package_name) is not None
except ModuleNotFoundError:
return False


NPU_TORCH_PACKAGE_AVAILABLE = is_package_present("torch_npu")


if NPU_TORCH_PACKAGE_AVAILABLE:
import torch_npu # noqa: F401


class NPUTorchDeviceManager(TorchDeviceManager):
"""Ascend NPU device manager"""

@staticmethod
def register_custom_torch_dist_backend():
if NPU_TORCH_PACKAGE_AVAILABLE:
import torch_npu # noqa: F401, F811

def is_available(self) -> bool:
if not NPU_TORCH_PACKAGE_AVAILABLE:
return False

return torch.npu.is_available()

def get_devices(self) -> List[torch.device]:
"""Gets the correct torch device list configured for this process.

Returns a list of torch NPU devices allocated for the current worker.
If no NPUs are assigned, then it returns a list with a single CPU device.
"""
if NPU_TORCH_PACKAGE_AVAILABLE and torch.npu.is_available():
npu_ids = [
str(id)
for id in ray.get_runtime_context().get_accelerator_ids()[
ray_constants.NPU
]
]

device_ids = []

if len(npu_ids) > 0:
npu_visible_str = os.environ.get(
ray_constants.NPU_RT_VISIBLE_DEVICES_ENV_VAR, ""
)
if npu_visible_str and npu_visible_str != "NoDevFiles":
npu_visible_list = npu_visible_str.split(",")
else:
npu_visible_list = []

for npu_id in npu_ids:
try:
device_ids.append(npu_visible_list.index(npu_id))
except IndexError:
raise RuntimeError(
"ASCEND_RT_VISIBLE_DEVICES set incorrectly. "
f"Got {npu_visible_str}, expected to include {npu_id}. "
"Did you override the `ASCEND_RT_VISIBLE_DEVICES` "
"environment variable?"
)
else:
# If called on the driver or outside of Ray Train, return the
# 0th device.
device_ids.append(0)

devices = [torch.device(f"npu:{device_id}") for device_id in device_ids]
else:
raise RuntimeError(
"Using NPUTorchDeviceManager but torch npu is not available."
)

return devices

def set_device(self, device: Union[torch.device, int]):
torch.npu.set_device(device)

def supports_stream(self) -> bool:
"""Validate if the device type support to create a stream"""
return True

def create_stream(self, device):
"""Create a stream on NPU device"""
return torch.npu.Stream(device)

def get_stream_context(self, stream):
"""Get a torch.stream context on NPU device"""
return torch.npu.stream(stream)

def get_current_stream(self):
"""Get current stream for NPU device"""
return torch.npu.current_stream()
79 changes: 79 additions & 0 deletions python/ray/air/_internal/device_manager/nvidia_gpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import os
from typing import List, Union

import torch

import ray
from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager


class CUDATorchDeviceManager(TorchDeviceManager):
"""CUDA device manager"""

def is_available(self) -> bool():
return torch.cuda.is_available()

def get_devices(self) -> List[torch.device]:
"""Gets the correct torch device list configured for this process.

Returns a list of torch CUDA devices allocated for the current worker.
If no GPUs are assigned, then it returns a list with a single CPU device.

Assumes that `CUDA_VISIBLE_DEVICES` is set and is a
superset of the `ray.get_gpu_ids()`.
"""

# GPU IDs are assigned by Ray after you specify "use_gpu"
# GPU `ray.get_gpu_ids()` may return ints or may return strings.
# We should always convert to strings.
gpu_ids = [str(id) for id in ray.get_gpu_ids()]

device_ids = []

if len(gpu_ids) > 0:
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
cuda_visible_list = cuda_visible_str.split(",")
else:
cuda_visible_list = []

# By default, there should only be one GPU ID if `use_gpu=True`.
# If there are multiple GPUs, return a list of devices.
# If using fractional GPUs, these IDs are not guaranteed
# to be unique across different processes.
for gpu_id in gpu_ids:
try:
device_ids.append(cuda_visible_list.index(gpu_id))
except IndexError:
raise RuntimeError(
"CUDA_VISIBLE_DEVICES set incorrectly. "
f"Got {cuda_visible_str}, expected to include {gpu_id}. "
"Did you override the `CUDA_VISIBLE_DEVICES` environment"
" variable? If not, please help file an issue on Github."
)

else:
# If called on the driver or outside of Ray Train, return the
# 0th device.
device_ids.append(0)

return [torch.device(f"cuda:{device_id}") for device_id in device_ids]

def set_device(self, device: Union[torch.device, int, str, None]):
torch.cuda.set_device(device)

def supports_stream(self) -> bool:
"""Validate if the device type support create a stream"""
return True

def create_stream(self, device: torch.device) -> torch.cuda.Stream:
"""Create a stream on cuda device"""
return torch.cuda.Stream(device)

def get_stream_context(self, stream):
"""Get a stream context for cuda device"""
return torch.cuda.stream(stream)

def get_current_stream(self) -> torch.cuda.Stream:
"""Get current stream for cuda device"""
return torch.cuda.current_stream()
Loading