Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][GCI/3] Add variations attribute to create tests in multiple cluster environment #33718

Merged
merged 85 commits into from
Mar 31, 2023
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
57d69f8
Fix 'Observed wheel commit () is not expected' issue (https://github.…
can-anyscale Mar 18, 2023
3184500
Merge branch 'not-expected-wheel-commit'
can-anyscale Mar 18, 2023
d085526
Merge branch 'ray-project:master' into master
can-anyscale Mar 20, 2023
5c17ef9
Improve wheel commit validation error message
can-anyscale Mar 20, 2023
98212de
Merge branch 'ray-project:master' into master
can-anyscale Mar 23, 2023
c4638a6
Setup dependencies and crendential for GCE in buildkite
can-anyscale Mar 23, 2023
eb1b6a2
Add google-cloud-storage package to requirements
can-anyscale Mar 23, 2023
048545e
Merge branch 'ray-project:master' into gce-support-01
can-anyscale Mar 24, 2023
d02abc5
Add new lines to some files
can-anyscale Mar 24, 2023
3e78d18
Merge branch 'ray-project:master' into gce-support-01
can-anyscale Mar 25, 2023
1fb1997
Support for gs:// in anyscale job runner
can-anyscale Mar 26, 2023
7db91f5
Correct adding gce tests
can-anyscale Mar 26, 2023
a37234f
Support test definition with multiple flavors
can-anyscale Mar 26, 2023
fbfcc92
Use not in to check key in dict
can-anyscale Mar 26, 2023
9cd8412
Debugging
can-anyscale Mar 26, 2023
6a8e36b
Debugging
can-anyscale Mar 26, 2023
22792cc
Debugging 2
can-anyscale Mar 26, 2023
accd686
Debugging 03
can-anyscale Mar 26, 2023
235e877
Remove temoprary logs
can-anyscale Mar 27, 2023
f57b95b
-s
can-anyscale Mar 27, 2023
3c98173
Update flavors
can-anyscale Mar 27, 2023
dc325a3
Only initialize gs client on gs host
can-anyscale Mar 26, 2023
65d0577
Lint
can-anyscale Mar 27, 2023
4121ec4
Update image for Sematic integration (#33469)
augray Mar 26, 2023
170ec1c
[RLlib] fix preprocessor test (#33719)
ArturNiederfahrenhorst Mar 26, 2023
a0c8f1e
[RLlib] APPO TF with RLModule and Learner API (#33310)
avnishn Mar 26, 2023
d9e00cd
[Java] Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to …
jiafuzha Mar 27, 2023
c019896
[serve] Fix serve HA test (#33699)
zcin Mar 27, 2023
99eaefa
Revert "[tune-telemetry] Tag searcher and scheduler types. (#33561)" …
rkooo567 Mar 27, 2023
8450787
[tune] add data to CI test dependencies (#33729)
matthewdeng Mar 27, 2023
b314f31
[Test] Fix test event test timeout (#33704)
rkooo567 Mar 27, 2023
5ae9abc
[RLlib] Fixed a typo in multi-agent definition using RLModules in tes…
kouroshHakha Mar 27, 2023
090b579
[RLlib][RLModule] Disabled rl_module in one of the subtests in test_c…
kouroshHakha Mar 27, 2023
e1e36cb
[RLlib][RLModule] Disabled RLModule in Two trainer workflow example (…
kouroshHakha Mar 27, 2023
d7e87cf
[RLlib][RLModule] Disabled RLModule API on cartpole_ppo_fake_gpu test…
kouroshHakha Mar 27, 2023
eef5240
[Datasets] [Operator Fusion - 3/N] Add operator fusion benchmark. (#3…
clarkzinzow Mar 27, 2023
6b45157
[runtime env] Close schema after loading and continue on error (#33535)
jamesclark-Zapata Mar 27, 2023
e5b6f78
[Jobs] Fix race condition on submitting multiple jobs with the same i…
architkulkarni Mar 27, 2023
b5f1c3c
Retry REDIS_REPLY_ERROR for RedisClient::GetNextJobID (#33733)
jjyao Mar 27, 2023
971a9c8
Revert "Revert "[tune-telemetry] Tag searcher and scheduler types. (#…
xwjiang2010 Mar 27, 2023
3e8e902
Deprecate RuntimeContext.get (#33734)
jjyao Mar 27, 2023
90c60da
[Serve] Fix the serve.batch api doc (#33588)
sihanwang41 Mar 27, 2023
b8596e8
[infra] increase Build timeout (#33756)
clarng Mar 27, 2023
c1aac60
[RLlib][RLModule] Use forward_exploration() inside the unit-test for …
kouroshHakha Mar 27, 2023
ed9d773
[data] [streaming] Dataset.cache() doesn't work properly for streamin…
ericl Mar 27, 2023
561fa53
[Test] Fix the failing workflow test_dataset after streaming executor…
rkooo567 Mar 27, 2023
d62dfe8
[Test] Fix out of disk error (#33732)
rkooo567 Mar 27, 2023
1041e81
[Data] Repurpose streaming CI to bulk CI(#33478)
jianoaix Mar 27, 2023
f6e0028
[Serve] Enable serve metrics lib working in ray actor (#33717)
sihanwang41 Mar 27, 2023
074396a
[RLlib] Fix: Recovered eval worker should use eval-config's policy_ma…
sven1977 Mar 27, 2023
3a98259
[Data] Don't automatically move batches to device if `collate_fn` is …
amogkam Mar 27, 2023
d487977
@aslonnie's comments
can-anyscale Mar 28, 2023
d894a97
@aslonnie's comments
can-anyscale Mar 28, 2023
8053b3b
Run linter
can-anyscale Mar 28, 2023
34ff5d1
Run linters
can-anyscale Mar 28, 2023
abad535
Change ray to 2.3.1 to work around the #ir-glorious-shape
can-anyscale Mar 28, 2023
15ec7f7
Revert to normal ray image
can-anyscale Mar 28, 2023
8173e4e
Fix delete_fn
can-anyscale Mar 28, 2023
1355867
Merge branch 'master' into gce03
can-anyscale Mar 28, 2023
a0001d5
[CI][GCI/2] Add the ability for AnyscaleJobRunner to run on GCE host …
can-anyscale Mar 28, 2023
f21b22f
Run lint
can-anyscale Mar 28, 2023
4d13b7b
[RLlib] Fix APEX-DQN deprecated `add_batch` call (replace with `add`)…
avnishn Mar 28, 2023
cc1268d
Setup dependencies and crendential for GCE in buildkite
can-anyscale Mar 23, 2023
aa3453e
Add google-cloud-storage package to requirements
can-anyscale Mar 23, 2023
15965b5
Support for gs:// in anyscale job runner
can-anyscale Mar 26, 2023
f0ea4c4
Correct adding gce tests
can-anyscale Mar 26, 2023
38ce312
[RLlib] APPO TF with RLModule and Learner API (#33310)
avnishn Mar 26, 2023
81472b7
[data] [streaming] Dataset.cache() doesn't work properly for streamin…
ericl Mar 27, 2023
66c2de8
@aslonnie's comments
can-anyscale Mar 28, 2023
41f7636
Run linter
can-anyscale Mar 28, 2023
e87a6ce
Run linters
can-anyscale Mar 28, 2023
56447ae
Merge branch 'ray-project:master' into gce03
can-anyscale Mar 28, 2023
c68097d
-s
can-anyscale Mar 28, 2023
b352ffc
Fix some tests
can-anyscale Mar 28, 2023
cdebd35
Add unit tests for test definition parser
can-anyscale Mar 29, 2023
30ae2e1
Fix lints
can-anyscale Mar 29, 2023
32f69f8
@aslonnie's comments
can-anyscale Mar 29, 2023
c34cf96
Check that parse_test_definition throws exception on empty variations
can-anyscale Mar 29, 2023
b49a078
Remove the constant test definition in test.
can-anyscale Mar 29, 2023
db0be11
[CI] Logic to create test variations in release test configs (#33920)
can-anyscale Mar 30, 2023
c5c3c66
The cluster environment name does not allow the character '.', so fix…
can-anyscale Mar 30, 2023
5fd05d9
Merge branch 'can-gce-01-test-variations' into gce03
can-anyscale Mar 30, 2023
58e72c3
Remove import copy from test file
can-anyscale Mar 30, 2023
b123bd2
Address @krfricke's comments. I keep the __suffix__ as it is based on…
can-anyscale Mar 31, 2023
690dd26
Fix lint. Thanks @krfricke for catching!
can-anyscale Mar 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion release/ray_release/cluster_manager/cluster_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def set_cluster_env(self, cluster_env: Dict[str, Any]):

self.cluster_env_name = (
f"{self.project_name}_{self.project_id[4:8]}"
f"__env__{self.test_name}__"
f"__env__{self.test_name.replace('.', '_')}__"
f"{dict_hash(self.cluster_env)}"
)

Expand Down
54 changes: 50 additions & 4 deletions release/ray_release/config.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import copy
import json
import os
import re
Expand All @@ -12,6 +13,16 @@


class Test(dict):
""" A class represents a test to run on buildkite """
pass


class TestDefinition(dict):
can-anyscale marked this conversation as resolved.
Show resolved Hide resolved
"""
A class represents a definition of a test, such as test name, group, etc. Comparing
to the test class, there are additional field, for example variations, which can be
used to define several variations of a test.
"""
pass


Expand Down Expand Up @@ -47,10 +58,45 @@ def read_and_validate_release_test_collection(
) -> List[Test]:
"""Read and validate test collection from config file"""
with open(config_file, "rt") as fp:
test_config = yaml.safe_load(fp)

validate_release_test_collection(test_config, schema_file=schema_file)
return test_config
tests = parse_test_definition(yaml.safe_load(fp))

validate_release_test_collection(tests, schema_file=schema_file)
return tests

def _test_definition_invariant(
test_definition: TestDefinition,
invariant: bool,
message: str,
) -> None:
if invariant:
return
raise ReleaseTestConfigError(
f'{test_definition["name"]} has invalid definition: {message}',
)

def parse_test_definition(test_definitions: List[TestDefinition]) -> List[Test]:
tests = []
for test_definition in test_definitions:
if "variations" not in test_definition:
can-anyscale marked this conversation as resolved.
Show resolved Hide resolved
tests.append(test_definition)
continue
variations = test_definition.pop("variations")
_test_definition_invariant(
test_definition,
variations,
'variations field cannot be empty in a test definition',
)
for variation in variations:
_test_definition_invariant(
test_definition,
'__suffix__' in variation,
'missing __suffix__ field in a variation'
)
test = copy.deepcopy(test_definition)
test["name"] = f'{test["name"]}.{variation.pop("__suffix__")}'
can-anyscale marked this conversation as resolved.
Show resolved Hide resolved
test.update(variation)
tests.append(test)
return tests


def load_schema_file(path: Optional[str] = None) -> Dict:
Expand Down
52 changes: 52 additions & 0 deletions release/ray_release/tests/test_config.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
import os
import sys
import yaml
import pytest

from ray_release.config import (
read_and_validate_release_test_collection,
Test,
TestDefinition,
validate_cluster_compute,
load_schema_file,
parse_test_definition,
validate_test,
)
from ray_release.exception import ReleaseTestConfigError

TEST_COLLECTION_FILE = os.path.join(
os.path.dirname(__file__), "..", "..", "release_tests.yaml"
Expand Down Expand Up @@ -39,6 +43,54 @@
}
)

"""
Unit test for the ray_release.config.parse_test_definition function. In particular,
we check that the code correctly parse a test definition that have the 'variations'
field.
"""
def test_parse_test_definition():
test_definitions = yaml.safe_load('''
- name: sample_test
working_dir: sample_dir
frequency: nightly
team: sample
cluster:
cluster_env: env.yaml
cluster_compute: compute.yaml
run:
timeout: 100
script: python script.py
variations:
- __suffix__: aws
- __suffix__: gce
cluster:
cluster_env: env_gce.yaml
cluster_compute: compute_gce.yaml
''')
# Check that parsing returns two tests, one for each variation (aws and gce). Check
# that both tests are valid, and their fields are populated correctly
tests = parse_test_definition(test_definitions)
aws_test = tests[0]
gce_test = tests[1]
schema = load_schema_file()
assert not validate_test(aws_test, schema)
assert not validate_test(gce_test, schema)
can-anyscale marked this conversation as resolved.
Show resolved Hide resolved
assert aws_test["name"] == "sample_test.aws"
assert gce_test["cluster"]["cluster_compute"] == "compute_gce.yaml"
invalid_test_definition = test_definitions[0]
# Intentionally make the test definition invalid by create an empty 'variations'
# field. Check that the parser throws exception at runtime
invalid_test_definition['variations'] = []
with pytest.raises(ReleaseTestConfigError):
parse_test_definition([invalid_test_definition])
# Intentionally make the test definition invalid by making one 'variation' entry
# missing the __suffix__ entry. Check that the parser throws exception at runtime
invalid_test_definition['variations'] = [
{'__suffix__': 'aws'},
{}
]
with pytest.raises(ReleaseTestConfigError):
parse_test_definition([invalid_test_definition])

def test_schema_validation():
test = VALID_TEST.copy()
Expand Down
109 changes: 29 additions & 80 deletions release/release_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1238,12 +1238,9 @@
#######################
# Tune cloud tests
#######################
- name: tune_cloud_aws_no_sync_down
- name: tune_cloud_no_sync_down
group: Tune cloud tests
working_dir: tune_tests/cloud_tests

stable: true

frequency: nightly
team: ml

Expand All @@ -1258,38 +1255,19 @@
wait_for_nodes:
num_nodes: 4

variations:
can-anyscale marked this conversation as resolved.
Show resolved Hide resolved
- __suffix__: aws
- __suffix__: gce
Comment on lines +1259 to +1260
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use suffix instead? I think it should be fine because we disallow this in the base definitions anyway. So we don't need to care about scope here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with either one, @aslonnie ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _suffix if you want it to be shorter? I would like to have something that indicates that this is a special field, where it will be eaten during the process. It is nice that it has a check now; I still think it is better to look a bit different from a normal field for readability purpose.

env: gce
cluster:
cluster_env: app_config.yaml
cluster_compute: tpl_gce_4x8.yaml

alert: tune_tests

- name: tune_cloud_gce_no_sync_down
group: Tune cloud tests
working_dir: tune_tests/cloud_tests

stable: true

frequency: nightly
team: ml
env: gce

cluster:
cluster_env: app_config.yaml
cluster_compute: tpl_gce_4x8.yaml
autosuspend_mins: 60

run:
timeout: 600
script: python workloads/run_cloud_test.py no_sync_down --cpus-per-trial 8
wait_for_nodes:
num_nodes: 4

alert: tune_tests

- name: tune_cloud_aws_ssh_sync
- name: tune_cloud_ssh_sync
group: Tune cloud tests
working_dir: tune_tests/cloud_tests

stable: true

frequency: nightly
team: ml

Expand All @@ -1300,42 +1278,22 @@
run:
timeout: 600
script: python workloads/run_cloud_test.py ssh_sync

wait_for_nodes:
num_nodes: 4

- name: tune_cloud_gce_ssh_sync
group: Tune cloud tests
working_dir: tune_tests/cloud_tests

stable: true
frequency: nightly
team: ml
env: gce

cluster:
cluster_env: app_config.yaml
cluster_compute: tpl_gce_4x8.yaml
autosuspend_mins: 60

run:
timeout: 600
script: python workloads/run_cloud_test.py ssh_sync --cpus-per-trial 8

wait_for_nodes:
num_nodes: 4

alert: tune_tests

variations:
- __suffix__: aws
- __suffix__: gce
env: gce
cluster:
cluster_env: app_config.yaml
cluster_compute: tpl_gce_4x8.yaml

alert: tune_tests

- name: tune_cloud_aws_durable_upload
- name: tune_cloud_durable_upload
group: Tune cloud tests
working_dir: tune_tests/cloud_tests

stable: true

frequency: nightly
team: ml

Expand All @@ -1350,27 +1308,18 @@
wait_for_nodes:
num_nodes: 4


alert: tune_tests

- name: tune_cloud_gce_durable_upload
group: Tune cloud tests
working_dir: tune_tests/cloud_tests
stable: true
frequency: nightly
team: ml
env: gce

cluster:
cluster_env: app_config.yaml
cluster_compute: tpl_gce_4x8.yaml
autosuspend_mins: 60

run:
timeout: 600
script: python workloads/run_cloud_test.py durable_upload --cpus-per-trial 8 --bucket gs://tune-cloud-tests/durable_upload
wait_for_nodes:
num_nodes: 4
variations:
- __suffix__: aws
- __suffix__: gce
env: gce
cluster:
cluster_env: app_config.yaml
cluster_compute: tpl_gce_4x8.yaml
run:
timeout: 600
script: python workloads/run_cloud_test.py durable_upload --bucket gs://tune-cloud-tests/durable_upload
wait_for_nodes:
num_nodes: 4

alert: tune_tests

Expand Down
4 changes: 2 additions & 2 deletions release/tune_tests/cloud_tests/tpl_gce_4x8.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ max_workers: 3

head_node_type:
name: head_node
instance_type: n2-standard-8
instance_type: n2-standard-2
can-anyscale marked this conversation as resolved.
Show resolved Hide resolved

worker_node_types:
- name: worker_node
instance_type: n2-standard-8
instance_type: n2-standard-2
min_workers: 3
max_workers: 3
use_spot: false