Test CI changes #15201

vyasr · 2024-03-01T00:25:09Z

Description

This PR makes a number of changes in the hopes of alleviating the heavy load that we've been frequently observing in our CI queues of late. The changes are based on the following thoughts, and based on observations from PR #15194 (the most recent PR with passing CI) as the "other PR" below:

Builds are cheaper than tests because they don't require GPU nodes, but they're still not completely free.
Every PR still needs to build the full matrix to produce usable artifacts (we could reconsider this but was a more drastic step than I felt was necessary in this PR).
Both wheels and devcontainers are much faster to build than conda packages due to the substantial overhead of conda-build. devcontainer builds are slower than wheel builds, but they provide more complete coverage (devcontainers build C++ tests as well as all Python packages).
- For PRs that don't change C++ code, devcontainer builds should effectively be fully cached and complete very fast (about 10 mins on the first run of this PR). For PRs that break builds, devcontainers will show that just as quickly as other builds. The main downside is in cases where builds are not cached but also succeed, in which case devcontainer builds can take significantly longer (~50 mins on the other PR) and will prevent other build jobs from starting.
Wheel tests are much faster than conda tests because of various conda overheads.
- In the other PR the fastest full wheel test run on x86 took ~15 while the fastest conda test run took ~30).
There is currently significant overlap between wheel and conda tests, but we rarely see failures in one without seeing them in the other unless they are due to something like a conda packaging/solver issue.
Doc changes are infrequent but doc builds are expensive because they require GPU nodes
cudf_kafka and custreamz are essentially static

Based on those observations, and in conjunction with rapidsai/shared-workflows#184 (used in this PR), this PR makes the following changes:

devcontainer builds now require style checks.
devcontainer builds now gate all other builds.
Wheel tests now gate conda Python tests.
Reduce CI runs shared-workflows#184 also reduces the number of conda tests in PRs (nightlies are unchanged).
The conda tests of dask_cudf, cudf_kafka, and custreamz remain independent of the conda tests of cudf, but they are now gated by the dask_cudf wheel tests.
doc builds now require tests to pass.

The net result should be that while running the full CI workflow successfully for a single PR is now slower, CI failures are still observed almost as fast in almost all cases (exceptions include doc builds, python tests that only fail in conda, and build issues that only arise in conda-build or wheel builds) and overall CI resource usage by jobs that fail at any stage is substantially reduced.

I'm happy to discuss all of these changes, and I think implementing even a subset of them will make a meaningful impact on our CI turnaround time. Once we're happy with this and churn on other RAPIDS-wide CI changes reduces, we can port these changes to other repos as well.

Contributes to rapidsai/build-planning#5.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

jakirkham · 2024-03-01T02:16:34Z

In the other PR the fastest full wheel test run on x86 took ~15 while the fastest conda test run took ~30).

Think we should add units here

Could we get a better sense of what we are measuring here? What are the steps in the Conda process and how long they take?

For example if we are creating a lot of environments for different tests, that will take longer. We could create one environment once and do all the testing there. This comes with its own tradeoffs in terms of how contained/minimal the testing environment is, but this is a consideration

Anyways all this to say think we need more clarity on what we are measuring and what question it we are trying to answer before proposing a solution

vyasr · 2024-03-01T02:51:34Z

Think we should add units here

Could you clarify what you mean by units? Is that just referring to what you ask below about breaking the runtimes up, or something else?

Could we get a better sense of what we are measuring here? What are the steps in the Conda process and how long they take?

Sure that's a good question. Let's look at this job from the other PR:

23:41:17 - Make the environment
23:43:20 - Download artifacts
23:45:26 - Some trivial steps, then installing the downloaded artifacts
23:46:52 - Start running tests
00:00:50 - Start running benchmarks
00:06:22 - Start running the benchmarks with pandas
00:11:10 - End

Unfortunately there isn't equivalent data in the current wheel jobs for comparison. However, it's immediately evident that it's not an apples-to-apples comparison:

The wheel jobs aren't running either of the benchmark steps, which add about 11 mins here.
The actual test run takes about 14 minutes.
I would have liked to use a cuml PR for comparison after Bradley merged Install test dependencies at the same time as cuml packages. cuml#5781 since that would presumably reduce or even remove the extra 1.5 minutes used to install the downloaded artifacts, but unfortunately with cuml the wheel and conda tests use different levels of parallelization (something else we can probably improve there), again making the comparison unfair.

If we assume that combining the installation of cudf/libcudf into the initial solve drops that time to 0, then we have the runs taking about 18 minutes. That brings us near parity with the wheel jobs, which suggests that it doesn't matter too much which ones go first for tests (conda-build overhead is much clearer in build jobs since there wheel vs conda builds do essentially the same amount of work otherwise). However if even that 1-2 minute advantage for wheels is consistent, it gives us a reason to choose those going first (assuming that we're OK with the sequencing).

For example if we are creating a lot of environments for different tests, that will take longer. We could create one environment once and do all the testing there. This comes with its own tradeoffs in terms of how contained/minimal the testing environment is, but this is a consideration

We only create one environment, but the one improvement we could make is the one we've already done in cuml where we install the built artifacts during the initial solve instead of after the fact.

Anyways all this to say think we need more clarity on what we are measuring and what question it we are trying to answer before proposing a solution

Big picture what we want to solve is to reduce the amount of CI blockage. This PR tries to address that in two ways:

Reducing the number of redundant jobs (that's in rapidsai/shared_workflows#184). Whether we do that by reducing conda or wheel jobs should be determined by which is faster. The numbers above do suggest
Serializing more jobs to reduce the number of concurrent failures.

bdice · 2024-03-01T16:51:42Z

Weighing in on each proposed change:

✅ devcontainer builds now require style checks.

This is fine. It was probably an oversight originally.

❌ devcontainer builds now gate all other builds.
❌ Wheel tests now gate conda Python tests.

I want to strongly optimize for the straight-path (no queue) time. Introducing new gating between jobs will serialize tasks that we should execute in parallel if resources allow. The most important reason for this is that fixing CI-blocking issues will be dramatically slower if we introduce more gates between jobs.

⚠️ Reduce CI runs shared-workflows#184 also reduces the number of conda tests in PRs (nightlies are unchanged).

This reduced matrix must cover one job with CUDA 11 and one job with CUDA 12. Conda packaging across CUDA 11 and CUDA 12 is not at all the same, and the coverage should alert us of general conda packaging issues.

Proposal here: https://github.com/rapidsai/shared-workflows/pull/184/files#r1509255538

❌ The conda tests of dask_cudf, cudf_kafka, and custreamz remain independent of the conda tests of cudf, but they are now gated by the dask_cudf wheel tests.
❌ doc builds now require tests to pass.

As above, optimize for the straight-line path. Do not serialize jobs that can run in parallel.

bdice · 2024-03-01T16:54:13Z

Generally, I think rapidsai/shared-workflows#184 will be enough to have a huge impact on our CI turnaround time (and reduce queueing) without needing to introduce more gates between jobs. I am a fan of rapidsai/shared-workflows#184 (with the modification I proposed to cover CUDA 11/12) and would support moving forward on that first, before we try to reorder or gate jobs.

bdice · 2024-03-01T16:58:47Z

A potential improvement to reduce GPU time that I would support is fail-fast: true in more of our test jobs. Currently we only do that for wheels. (More accurately, it defaults to true and we just don't define fail-fast: false for our wheel test jobs.)

I think a lot of our test failures are correlated, i.e. a failure on one conda test job is likely to have a failure on another conda test job, too.

If a job fails-fast due to a "sibling" job having a flaky network connection (or whatever unexpected issues), it's generally no harder (and doesn't consume much more GPU time) to rerun the whole set of failed jobs.

bdice · 2024-03-01T17:07:16Z

Regarding the single-step conda solves for cuml, I don't think it saves a great deal of time. It's just a boost for correctness (knowing you'll get compatible packages and not force the solver into a downgrade that it can't resolve).

I looked at some recent CI logs. The time spent on the second step (installing libcuml/cuml) was typically quite small. The majority of the initial conda setup time is spent in downloading packages, it seems. It has to pull down multiple GB of data, which can take ~5 minutes compared to the conda solve times which are each less than 30 seconds. It's hard to do apples-to-apples comparisons because the test startup time is dominated by network activity.

vyasr · 2024-03-01T17:25:12Z

pip jobs will always be cheaper than conda jobs because wheels don't have to set up an entire set of libraries the way that conda does. We could improve this by caching (I plan to test that out next) but even then the conda cache will be much larger than the pip cache so the same applies, if reduced. As a result I don't know if apples-to-apples comparison is really what we should be optimizing for because we can safely indicate that the nature of how conda works always necessitates some unavoidable setup overhead.

I want to strongly optimize for the straight-path (no queue) time. Introducing new gating between jobs will serialize tasks that we > should execute in parallel if resources allow. The most important reason for this is that fixing CI-blocking issues will be dramatically slower if we introduce more gates between jobs.

Optimizing exclusively for the no queue time doesn't make much sense because in practice we never have no queue. Probably twice a week we end up in a situation where someone posts a Slack query asking why their job hasn't started for X hours and it's because the overall queue is large (and if it's being asked about on Slack by a couple of people it's certainly happening far more frequently when nobody comments). This frequently happens with CI-blocking issues too, and those also provide a great counterpoint because the backlog right after a CI-blocking issue is resolved is massive and therefore we don't get much parallelism in practice when everyone reruns all their PRs. Having workflows fail faster on early jobs would significantly improve that.

There's a balance to be struck here though. I think a good compromise would be to maintain parallelism across all build (read: CPU-only node) jobs while still serializing more of the jobs that run on GPU nodes. WDYT?

would support moving forward on that first, before we try to reorder or gate jobs.

Totally fine merging the shared workflows changes first and then coming back to this PR after a few weeks depending on what we observe.

I think a lot of our test failures are correlated, i.e. a failure on one conda test job is likely to have a failure on another conda test job, too.

I agree. The issue is that our failures are also frequently correlated across jobs within the same workflow: doc builds, conda Python tests and wheel tests all tend to fail together, as do conda "other" python tests and dask-cudf wheel tests. I don't think there's a way to fail-fast between different jobs.

bdice · 2024-03-11T15:00:19Z

We can closing this testing PR now that rapidsai/shared-workflows#184 is merged.

vyasr added 3 commits March 1, 2024 00:18

Serialize more runs to reduce overlapping failures

d07e604

Point at the new branch

1f3dc55

Also wait for devcontainer to run wheel builds

cb9e57a

vyasr added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 1, 2024

vyasr self-assigned this Mar 1, 2024

vyasr added 2 commits March 1, 2024 19:07

test

0fafbe2

test

23d05ff

bdice mentioned this pull request Mar 1, 2024

Explore ways to maximize coverage while minimizing cost of the CI test matrix rapidsai/build-planning#5

Closed

Parallelize again

b0391fc

bdice mentioned this pull request Mar 1, 2024

Consider consolidating conda solves in CI tests rapidsai/build-planning#22

Open

18 tasks

bdice closed this Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test CI changes #15201

Test CI changes #15201

vyasr commented Mar 1, 2024 •

edited

Loading

jakirkham commented Mar 1, 2024

vyasr commented Mar 1, 2024 •

edited

Loading

bdice commented Mar 1, 2024

bdice commented Mar 1, 2024

bdice commented Mar 1, 2024 •

edited

Loading

bdice commented Mar 1, 2024 •

edited

Loading

vyasr commented Mar 1, 2024 •

edited

Loading

bdice commented Mar 11, 2024

Test CI changes #15201

Test CI changes #15201

Conversation

vyasr commented Mar 1, 2024 • edited Loading

Description

Checklist

jakirkham commented Mar 1, 2024

vyasr commented Mar 1, 2024 • edited Loading

bdice commented Mar 1, 2024

bdice commented Mar 1, 2024

bdice commented Mar 1, 2024 • edited Loading

bdice commented Mar 1, 2024 • edited Loading

vyasr commented Mar 1, 2024 • edited Loading

bdice commented Mar 11, 2024

vyasr commented Mar 1, 2024 •

edited

Loading

vyasr commented Mar 1, 2024 •

edited

Loading

bdice commented Mar 1, 2024 •

edited

Loading

bdice commented Mar 1, 2024 •

edited

Loading

vyasr commented Mar 1, 2024 •

edited

Loading