Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Move to new hierarchical docker structure + pipeline #28641

Merged
merged 65 commits into from
Sep 22, 2022
Merged

Conversation

krfricke
Copy link
Contributor

Why are these changes needed?

This PR moves our buildkite pipeline to a new hierarchical structure and will be used with the new buildkite pipeline.

When merging this PR, the old behavior will still work, i.e. the old pipeline is still in place.

After merging this PR, we can build the base images for the master branch, and then switch the CI pipelines to use the new build structure.

Once this switch has been done, the following files will be removed:

  • ./buildkite/pipeline.yml - this has been split into pipeline.test.yml and pipeline.build.yml
  • ./buildkite/Dockerfile - this has been moved (and split) to ./ci/docker/
  • ./buildkite/Dockerfile.gpu - this has been moved (and split) to ./ci/docker/

The new structure is as follow:

  • ./ci/docker contains hierarchical docker files that will be built by the pipeline.
  • Dockerfile.base_test contains common dependencies
  • Dockerfile.base_build inherits from it and adds build-specific dependencies, e.g. llvm, nvm, java
  • Dockerfile.base_ml inherits from base_test and adds ML dependencies, e.g. torch, tensorflow
  • Dockerfile.base_gpu depends on a cuda image and otherwise has the same contents as base_test and base_ml combined

In each build, we do the following

  • Dockerfile.build is built on top of Dockerfile.base_build. Dependencies are re-installed, which is mostly a no-op (except if they changed from when the base image was built)
  • Dockerfile.test is built on top of Dockerfile.base_test, and the extracted Ray installation fromDockerfile.build is injected
  • The same is true respectively for ml and gpu.

The pipelines have been split, and a new attribute NO_WHEELS_REQUIRED is added, identifying tests that can be early-started. Early start means that the last available branch image is used and the current code revision is checked out upon it.

See https://github.com/ray-project/buildkite-ci-pipelines/ for the pipeline logic.

Additionally, this PR identified two CI regressions that haven't been caught previously, namely the minimal install tests that didn't properly install the respective Python versions, and some runtime environment tests that don't work with later Ray versions. These should be addressed separately and I'll create issues for them once this PR is merged.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke and others added 30 commits September 15, 2022 09:45
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
…tests (#28535)

* [core/ci] Disallow protobuf 3.19.5 (#28504)

This leads to hangs in Ray client (e.g. test_dataclient_disconnect)

Signed-off-by: Kai Fricke <[email protected]>

* [tune] Fix trial checkpoint syncing after recovery from other node (#28470)

On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this.

Signed-off-by: Kai Fricke <[email protected]>

* [air] minor example fix. (#28379)

Signed-off-by: xwjiang2010 <[email protected]>

* [cleanup] Remove memory unit conversion (#28396)

The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore.

Recommendation: Review #28394 first, since this is stacked on top of it.

Co-authored-by: Alex <[email protected]>

* [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422)

* Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384)

* Cast rewards as tf.float32 to fix error in DQN in tf2

Signed-off-by: mgerstgrasser <[email protected]>

* Add test case for DQN with integer rewards

Signed-off-by: mgerstgrasser <[email protected]>

Signed-off-by: mgerstgrasser <[email protected]>

* [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)

This addresses some of the issues brought up in #28484

* [ci] Increase timeout on test_metrics (#28508)

10 milliseconds is ambitious for the CI to do anything.

Co-authored-by: Alex <[email protected]>

* [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503)

* Add imports to object-spilling.rst Python code (#28507)

* Add imports to object-spilling.rst Python code

Also adjust a couple descriptions, retaining the same general information

Signed-off-by: Jake <[email protected]>

* fix doc build / keep note formatting

Signed-off-by: Philipp Moritz <[email protected]>

* another tiny fix

Signed-off-by: Philipp Moritz <[email protected]>

Signed-off-by: Jake <[email protected]>
Signed-off-by: Philipp Moritz <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>

* [AIR] Make PathPartitionScheme a dataclass (#28390)

Signed-off-by: Balaji Veeramani <[email protected]>

* [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490)

Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes.

This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes:

KubeRay >= 0.4.0
Legacy Ray Operator with Ray >= 2.1.0
All other methods
The strategy is to have the operators inject an env variable into the Ray container's environment.
The variable identifies the deployment method.

This PR also modifies the legacy Ray operator to inject the relevant env variable.
A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562

Signed-off-by: Dmitri Gekhtman <[email protected]>

* [autoscaler][observability] Experimental verbose mode (#28392)

This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API.

Example output

======== Autoscaler status: 2020-12-28 01:02:03 ========
GCS request time: 3.141500s
Node Provider non_terminated_nodes time: 1.618000s

Node status
--------------------------------------------------------
Healthy:
 2 p3.2xlarge
 20 m4.4xlarge
Pending:
 m4.4xlarge, 2 launching
 1.2.3.4: m4.4xlarge, waiting-for-ssh
 1.2.3.5: m4.4xlarge, waiting-for-ssh
Recent failures:
 p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6)

Resources
--------------------------------------------------------
Total Usage:
 1/2 AcceleratorType:V100
 530.0/544.0 CPU
 2/2 GPU
 2.00/8.000 GiB memory
 3.14/16.000 GiB object_store_memory

Total Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups
 {'CPU': 16}: 100+ from request_resources()

Node: 192.168.1.1
 Usage:
  0.1/1 AcceleratorType:V100
  5.0/20.0 CPU
  0.7/1 GPU
  1.00/4.000 GiB memory
  3.14/4.000 GiB object_store_memory

Node: 192.168.1.2
 Usage:
  0.9/1 AcceleratorType:V100
  15.0/20.0 CPU
  0.3/1 GPU
  1.00/12.000 GiB memory
  0.00/4.000 GiB object_store_memory

Co-authored-by: Alex <[email protected]>

* [doc/tune] fix tune stopper attribute name (#28517)

* [doc] Fix tune stopper doctests (#28531)

* [air] Use self-hosted mirror for CIFAR10 dataset (#28480)

The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing.

Signed-off-by: Kai Fricke <[email protected]>

* draft

Signed-off-by: Artur Niederfahrenhorst <[email protected]>

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: mgerstgrasser <[email protected]>
Signed-off-by: Jake <[email protected]>
Signed-off-by: Philipp Moritz <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: xwjiang2010 <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Alex <[email protected]>
Co-authored-by: Jun Gong <[email protected]>
Co-authored-by: mgerstgrasser <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: Jake <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: Dmitri Gekhtman <[email protected]>
Co-authored-by: Árpád Rózsás <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
var
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
This reverts commit 881c76b.

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Run
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, three small nits:

  • We should still run the tests in 3.6 right? Did you end up making a special build for 3.6? If we still support 3.6, we should run all libraries test in it. (instead of bumping the base image python version)
  • Please add some readmes in several places (.buildkite, ci/, and the pipelines repo: https://github.com/ray-project/buildkite-ci-pipelines/tree/main/ray_ci)
  • Dockerfile.xyz naming seems to confuse GitHub and code editor, can we use xzy.Dockerfile for the new files.

Kai Fricke added 2 commits September 22, 2022 07:52
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@krfricke
Copy link
Contributor Author

Thanks for the review!

  • I've renamed the dockerfiles
  • I've added READMEs in ci/docker and .buildkite

For the 3.6 tests, we've been running over 75% of the tests on 3.7 already. My suggestion here is to go ahead with this change and use #28373 to introduce 2 or 3 build jobs that ensure 3.6 compatibility for specific tests.

I'll update #28373 after this has been merged.

@krfricke krfricke merged commit ee2a8da into master Sep 22, 2022
@krfricke krfricke deleted the ci/docker branch September 22, 2022 10:58
krfricke added a commit that referenced this pull request Sep 27, 2022
After #28641 and switching the pipeline, we should delete the unneeded files to avoid confusion when adding new test jobs.

Signed-off-by: Kai Fricke <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
)

After ray-project#28641 and switching the pipeline, we should delete the unneeded files to avoid confusion when adding new test jobs.

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants