Skip to content

Commit

Permalink
Add test coverage & add mock test for full milabench runs (#234)
Browse files Browse the repository at this point in the history
* Remove numpy 2.0.0

* Try to bypass HPU hangs

* Ensure that torch metadata gathering cannot fail

* Add unit test run

* Update tests

* Remove duplicated code

* Add dry command regression tests

* Do a mock run where everything except run is executed

* Add coverage

* Run mock milabench on a per bench basis to identify issues more easily

* Make sure install failure triggers test failure

* Retrieve exception raised inside asyncio

* ensure deterministic order

* Update Python to 3.11

---------

Co-authored-by: pierre.delaunay <[email protected]>
Co-authored-by: Pierre Delaunay <[email protected]>
  • Loading branch information
3 people authored Jul 10, 2024
1 parent d669feb commit 127d18c
Show file tree
Hide file tree
Showing 46 changed files with 3,030 additions and 1,168 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -47,25 +47,22 @@ jobs:
steps:
- uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.11'

- name: Check out repository code
uses: actions/checkout@v3

- name: dependencies
run: |
if [[ ! -d "~/.cargo/bin" ]]; then
wget --no-check-certificate --secure-protocol=TLSv1_2 -qO- https://sh.rustup.rs | sh -s -- -y
fi
export PATH="~/.cargo/bin:${PATH}"
python -m pip install -U pip
python -m pip install -U poetry
- name: install
run: |
pip install pytest
poetry lock --no-update
pip install -e .
poetry install --with dev
source $(poetry env info -p)/bin/activate
pip install psycopg2-binary
- name: tests
env:
Expand All @@ -74,4 +71,4 @@ jobs:
POSTGRES_DB: milabench
POSTGRES_HOST: localhost
POSTGRES_PORT: 5432
run: pytest tests/integration
run: poetry run pytest tests/integration
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: tests
name: run

on:
# Runs every sunday
Expand Down Expand Up @@ -55,7 +55,7 @@ jobs:
- uses: conda-incubator/setup-miniconda@v2
with:
auto-activate-base: false
python-version: 3.10
python-version: 3.11
miniconda-version: "latest"
activate-environment: test

Expand Down
67 changes: 67 additions & 0 deletions .github/workflows/tests_unit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: unit

on:
push:

# Runs for pull requests
pull_request:
branches:
- master

# Runs on publish
release:
types:
[published]

# Allow manual triggers
workflow_dispatch:


jobs:
tests:
runs-on: ubuntu-latest

# Cancel previous jobs if a new version was pushed
concurrency:
group: "${{ github.ref }}-${{ matrix.arch }}"
cancel-in-progress: true

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: dependencies
run: |
pip install -U pip
pip install poetry
poetry env use python3.11
source $(poetry env info -p)/bin/activate
#
# poetry doesnot work when installing those !?
#
pip install antlr4-python3-runtime==4.9.3
pip install -e .
pip install -e benchmate
#
#
#
poetry install --with dev
- name: tests
run: |
source $(poetry env info -p)/bin/activate
coverage run --source=milabench -m pytest --ignore=tests/integration tests/
coverage report -m
coverage xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
env_vars: PLATFORM,PYTHON
name: codecov-umbrella
fail_ci_if_error: false
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ The benchmark suite has been validated on the following configurations:

| Python version | GPU | Configuration file |
| - | - | - |
| 3.9.12 (conda) | 2 node x 8xNVIDIA A100 80GB | config/standard.yaml |
| 3.11 (conda) | 2 node x 8xNVIDIA A100 80GB | config/standard.yaml |
| 3.9.12 (conda) | 8x NVIDIA RTX8000 48GB | config/standard.yaml |
| 3.9.16 (conda) | 2x NVIDIA K80 | config/ci.yaml |
| 3.9.16 (conda) | 2x AMD MI100 | config/ci.yaml |
Expand Down
10 changes: 0 additions & 10 deletions benchmarks/accelerate_opt/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,16 +126,6 @@ class CustomInitProcessGroupKwargs(InitProcessGroupKwargs):
world_size=int(os.environ["WORLD_SIZE"]),
)

# Accelerator SUCK, it is impossible to make it use hccl
# We can bypass Accelerator logic by initializing the group ourselves
if acc.device_type == "hpu":
acc.init_process_group(
init_method=f"tcp://{MASTER_ADDR}:{MASTER_PORT}",
timeout=timedelta(seconds=60),
rank=int(os.environ["RANK"]),
world_size=int(os.environ["WORLD_SIZE"]),
)

accelerator = Accelerator(kwargs_handlers=[init_process_group_kwargs])
else:
accelerator = Accelerator()
Expand Down
27 changes: 22 additions & 5 deletions benchmate/benchmate/datagen.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from collections import defaultdict
from pathlib import Path

import torchcompat.core as acc
import torch
from tqdm import tqdm

Expand Down Expand Up @@ -79,24 +80,40 @@ def generate_sets(root, sets, shape):
json.dump(sets, fp)


def device_count():
try:
return acc.device_count()
except:
return 1

def generate_fakeimagenet():
# config = json.loads(os.environ["MILABENCH_CONFIG"])

parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", default=512, type=int)
parser.add_argument("--batch-count", default=60, type=int)
parser.add_argument("--device-count", default=device_count(), type=int)
parser.add_argument("--device", default=None, type=str)
parser.add_argument("--image-size", default=[3, 384, 384], type=int, nargs="+")
parser.add_argument("--val", default=0.1, type=float, nargs="+")
parser.add_argument("--test", default=0.1, type=float, nargs="+")

args, _ = parser.parse_known_args()

if overrides := os.getenv("MILABENCH_TESTING_PREPARE"):
bs, bc = overrides.split(",")
args.batch_size, args.batch_count = int(bs), int(bc)

data_directory = os.environ["MILABENCH_DIR_DATA"]
dest = os.path.join(data_directory, "FakeImageNet")

dest = os.path.join(data_directory, f"FakeImageNet")
print(f"Generating fake data into {dest}...")

total_images = args.batch_size * args.batch_count
total_images = args.batch_size * args.batch_count * args.device_count
size_spec = {
"train": total_images,
"val": int(total_images * args.val),
"test": int(total_images * args.test),
f"train": total_images,
f"val": int(total_images * args.val),
f"test": int(total_images * args.test),
}

generate_sets(dest, size_spec, args.image_size)
Expand Down
17 changes: 17 additions & 0 deletions benchmate/benchmate/dataset.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
import os
from collections import defaultdict
import math

import torch
from torch.utils.data.distributed import DistributedSampler


def no_transform(args):
Expand Down Expand Up @@ -48,3 +52,16 @@ def __getitem__(self, item):

def __len__(self):
return len(self.clip)


class ExclusiveSetSampler(DistributedSampler):
def __init__(self, dataset, num_sets: int, set_id: int, shuffle: bool = True, seed: int = 0, drop_last: bool = False) -> None:
super().__init__(
dataset,
num_replicas=num_sets,
rank=set_id,
shuffle=shuffle,
seed=seed,
drop_last=drop_last
)

27 changes: 17 additions & 10 deletions benchmate/benchmate/warden.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,12 +233,12 @@ def __exit__(self, *args):
def destroy(*processes, step=1, timeout=30):
processes = list(processes)

def kill(proc, signal):
def kill(proc, sig):
try:
if getattr(proc, "did_setsid", False):
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
os.killpg(os.getpgid(proc.pid), sig)
else:
os.kill(proc.pid, signal.SIGTERM)
os.kill(proc.pid, sig)
except ProcessLookupError:
pass

Expand All @@ -249,11 +249,9 @@ def kill(proc, signal):
elapsed = 0
def wait(proc):
nonlocal elapsed

while (ret := proc.poll()) is None and elapsed < timeout:
time.sleep(step)
elapsed += step

return ret is None

k = 0
Expand All @@ -280,25 +278,34 @@ def wait(proc):


@contextmanager
def process_cleaner():
def process_cleaner(timeout=30):
"""Delay signal handling until all the processes have been killed"""

with Protected():
def kill_everything(processes):
def _():
destroy(*processes, timeout=timeout)

return _

with Protected() as signalhandler:
with GPUProcessWarden() as warden: # => SIGTERM all processes using GPUs
processes = []
try: # NOTE: we have not waited much between both signals

# when a signal is received kill the known processes first
# then handle the signal
signalhandler.stop = kill_everything(processes)

try: # NOTE: we have not waited much between both signals
warden.kill() # => SIGKILL all processes using GPUs

yield processes # => Run milabench, spawning processes for the benches

finally:
warden.terminate() # => SIGTERM all processes using GPUs

destroy(*processes) # => SIGTERM+SIGKILL milabench processes
destroy(*processes, timeout=timeout) # => SIGTERM+SIGKILL milabench processes

# destroy waited 30s

# warden.__exit__ # => SIGKILL all processes still using GPUs


4 changes: 2 additions & 2 deletions config/slurm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ multi-node-full:
# DGX run: 2 nodes x 8 A100 80Go SXM4
- --partition=staff-idt
- -w cn-d[003-004]
- --ntasks=1
- --ntasks=2
- --gpus-per-task=a100l:8
- --exclusive
- --nodes=2
- --cpus-per-task=128
- --time=1:30:00
- --time=2:00:00
- --ntasks-per-node=1
- --mem=0

Expand Down
6 changes: 3 additions & 3 deletions milabench/_version.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""This file is generated, do not modify"""

__tag__ = "v0.1.0-32-ge9e52501"
__commit__ = "e9e52501ad92d2ee2dac97e66f601a0458404986"
__date__ = "2024-06-26 02:37:50 -0400"
__tag__ = "v0.1.0-24-gdd7f3888"
__commit__ = "dd7f3888ac0524b3b587e415d1de0e2019cd751f"
__date__ = "2024-07-03 16:07:47 -0400"
Loading

0 comments on commit 127d18c

Please sign in to comment.