Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node check #234

Merged
merged 16 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -47,25 +47,22 @@ jobs:
steps:
- uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.11'

- name: Check out repository code
uses: actions/checkout@v3

- name: dependencies
run: |
if [[ ! -d "~/.cargo/bin" ]]; then
wget --no-check-certificate --secure-protocol=TLSv1_2 -qO- https://sh.rustup.rs | sh -s -- -y
fi
export PATH="~/.cargo/bin:${PATH}"
python -m pip install -U pip
python -m pip install -U poetry

- name: install
run: |
pip install pytest
poetry lock --no-update
pip install -e .
poetry install --with dev
source $(poetry env info -p)/bin/activate
pip install psycopg2-binary

- name: tests
env:
Expand All @@ -74,4 +71,4 @@ jobs:
POSTGRES_DB: milabench
POSTGRES_HOST: localhost
POSTGRES_PORT: 5432
run: pytest tests/integration
run: poetry run pytest tests/integration
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: tests
name: run

on:
# Runs every sunday
Expand Down Expand Up @@ -55,7 +55,7 @@ jobs:
- uses: conda-incubator/setup-miniconda@v2
with:
auto-activate-base: false
python-version: 3.10
python-version: 3.11
miniconda-version: "latest"
activate-environment: test

Expand Down
67 changes: 67 additions & 0 deletions .github/workflows/tests_unit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: unit

on:
push:

# Runs for pull requests
pull_request:
branches:
- master

# Runs on publish
release:
types:
[published]

# Allow manual triggers
workflow_dispatch:


jobs:
tests:
runs-on: ubuntu-latest

# Cancel previous jobs if a new version was pushed
concurrency:
group: "${{ github.ref }}-${{ matrix.arch }}"
cancel-in-progress: true

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: dependencies
run: |
pip install -U pip
pip install poetry
poetry env use python3.11
source $(poetry env info -p)/bin/activate
#
# poetry doesnot work when installing those !?
#
pip install antlr4-python3-runtime==4.9.3
pip install -e .
pip install -e benchmate
#
#
#
poetry install --with dev

- name: tests
run: |
source $(poetry env info -p)/bin/activate
coverage run --source=milabench -m pytest --ignore=tests/integration tests/
coverage report -m
coverage xml

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
env_vars: PLATFORM,PYTHON
name: codecov-umbrella
fail_ci_if_error: false
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ The benchmark suite has been validated on the following configurations:

| Python version | GPU | Configuration file |
| - | - | - |
| 3.9.12 (conda) | 2 node x 8xNVIDIA A100 80GB | config/standard.yaml |
| 3.11 (conda) | 2 node x 8xNVIDIA A100 80GB | config/standard.yaml |
| 3.9.12 (conda) | 8x NVIDIA RTX8000 48GB | config/standard.yaml |
| 3.9.16 (conda) | 2x NVIDIA K80 | config/ci.yaml |
| 3.9.16 (conda) | 2x AMD MI100 | config/ci.yaml |
Expand Down
10 changes: 0 additions & 10 deletions benchmarks/accelerate_opt/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,16 +126,6 @@ class CustomInitProcessGroupKwargs(InitProcessGroupKwargs):
world_size=int(os.environ["WORLD_SIZE"]),
)

# Accelerator SUCK, it is impossible to make it use hccl
# We can bypass Accelerator logic by initializing the group ourselves
if acc.device_type == "hpu":
acc.init_process_group(
init_method=f"tcp://{MASTER_ADDR}:{MASTER_PORT}",
timeout=timedelta(seconds=60),
rank=int(os.environ["RANK"]),
world_size=int(os.environ["WORLD_SIZE"]),
)

accelerator = Accelerator(kwargs_handlers=[init_process_group_kwargs])
else:
accelerator = Accelerator()
Expand Down
27 changes: 22 additions & 5 deletions benchmate/benchmate/datagen.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from collections import defaultdict
from pathlib import Path

import torchcompat.core as acc
import torch
from tqdm import tqdm

Expand Down Expand Up @@ -79,24 +80,40 @@ def generate_sets(root, sets, shape):
json.dump(sets, fp)


def device_count():
try:
return acc.device_count()
except:
return 1

def generate_fakeimagenet():
# config = json.loads(os.environ["MILABENCH_CONFIG"])

parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", default=512, type=int)
parser.add_argument("--batch-count", default=60, type=int)
parser.add_argument("--device-count", default=device_count(), type=int)
parser.add_argument("--device", default=None, type=str)
parser.add_argument("--image-size", default=[3, 384, 384], type=int, nargs="+")
parser.add_argument("--val", default=0.1, type=float, nargs="+")
parser.add_argument("--test", default=0.1, type=float, nargs="+")

args, _ = parser.parse_known_args()

if overrides := os.getenv("MILABENCH_TESTING_PREPARE"):
bs, bc = overrides.split(",")
args.batch_size, args.batch_count = int(bs), int(bc)

data_directory = os.environ["MILABENCH_DIR_DATA"]
dest = os.path.join(data_directory, "FakeImageNet")

dest = os.path.join(data_directory, f"FakeImageNet")
print(f"Generating fake data into {dest}...")

total_images = args.batch_size * args.batch_count
total_images = args.batch_size * args.batch_count * args.device_count
size_spec = {
"train": total_images,
"val": int(total_images * args.val),
"test": int(total_images * args.test),
f"train": total_images,
f"val": int(total_images * args.val),
f"test": int(total_images * args.test),
}

generate_sets(dest, size_spec, args.image_size)
Expand Down
17 changes: 17 additions & 0 deletions benchmate/benchmate/dataset.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
import os
from collections import defaultdict
import math

import torch
from torch.utils.data.distributed import DistributedSampler


def no_transform(args):
Expand Down Expand Up @@ -48,3 +52,16 @@ def __getitem__(self, item):

def __len__(self):
return len(self.clip)


class ExclusiveSetSampler(DistributedSampler):
def __init__(self, dataset, num_sets: int, set_id: int, shuffle: bool = True, seed: int = 0, drop_last: bool = False) -> None:
super().__init__(
dataset,
num_replicas=num_sets,
rank=set_id,
shuffle=shuffle,
seed=seed,
drop_last=drop_last
)

27 changes: 17 additions & 10 deletions benchmate/benchmate/warden.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,12 +233,12 @@ def __exit__(self, *args):
def destroy(*processes, step=1, timeout=30):
processes = list(processes)

def kill(proc, signal):
def kill(proc, sig):
try:
if getattr(proc, "did_setsid", False):
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
os.killpg(os.getpgid(proc.pid), sig)
else:
os.kill(proc.pid, signal.SIGTERM)
os.kill(proc.pid, sig)
except ProcessLookupError:
pass

Expand All @@ -249,11 +249,9 @@ def kill(proc, signal):
elapsed = 0
def wait(proc):
nonlocal elapsed

while (ret := proc.poll()) is None and elapsed < timeout:
time.sleep(step)
elapsed += step

return ret is None

k = 0
Expand All @@ -280,25 +278,34 @@ def wait(proc):


@contextmanager
def process_cleaner():
def process_cleaner(timeout=30):
"""Delay signal handling until all the processes have been killed"""

with Protected():
def kill_everything(processes):
def _():
destroy(*processes, timeout=timeout)

return _

with Protected() as signalhandler:
with GPUProcessWarden() as warden: # => SIGTERM all processes using GPUs
processes = []
try: # NOTE: we have not waited much between both signals

# when a signal is received kill the known processes first
# then handle the signal
signalhandler.stop = kill_everything(processes)

try: # NOTE: we have not waited much between both signals
warden.kill() # => SIGKILL all processes using GPUs

yield processes # => Run milabench, spawning processes for the benches

finally:
warden.terminate() # => SIGTERM all processes using GPUs

destroy(*processes) # => SIGTERM+SIGKILL milabench processes
destroy(*processes, timeout=timeout) # => SIGTERM+SIGKILL milabench processes

# destroy waited 30s

# warden.__exit__ # => SIGKILL all processes still using GPUs


4 changes: 2 additions & 2 deletions config/slurm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ multi-node-full:
# DGX run: 2 nodes x 8 A100 80Go SXM4
- --partition=staff-idt
- -w cn-d[003-004]
- --ntasks=1
- --ntasks=2
- --gpus-per-task=a100l:8
- --exclusive
- --nodes=2
- --cpus-per-task=128
- --time=1:30:00
- --time=2:00:00
- --ntasks-per-node=1
- --mem=0

Expand Down
6 changes: 3 additions & 3 deletions milabench/_version.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""This file is generated, do not modify"""

__tag__ = "v0.1.0-32-ge9e52501"
__commit__ = "e9e52501ad92d2ee2dac97e66f601a0458404986"
__date__ = "2024-06-26 02:37:50 -0400"
__tag__ = "v0.1.0-24-gdd7f3888"
__commit__ = "dd7f3888ac0524b3b587e415d1de0e2019cd751f"
__date__ = "2024-07-03 16:07:47 -0400"
Loading
Loading