-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test CI changes #15201
Test CI changes #15201
Conversation
Think we should add units here Could we get a better sense of what we are measuring here? What are the steps in the Conda process and how long they take? For example if we are creating a lot of environments for different tests, that will take longer. We could create one environment once and do all the testing there. This comes with its own tradeoffs in terms of how contained/minimal the testing environment is, but this is a consideration Anyways all this to say think we need more clarity on what we are measuring and what question it we are trying to answer before proposing a solution |
Could you clarify what you mean by units? Is that just referring to what you ask below about breaking the runtimes up, or something else?
Sure that's a good question. Let's look at this job from the other PR:
Unfortunately there isn't equivalent data in the current wheel jobs for comparison. However, it's immediately evident that it's not an apples-to-apples comparison:
If we assume that combining the installation of cudf/libcudf into the initial solve drops that time to 0, then we have the runs taking about 18 minutes. That brings us near parity with the wheel jobs, which suggests that it doesn't matter too much which ones go first for tests (conda-build overhead is much clearer in build jobs since there wheel vs conda builds do essentially the same amount of work otherwise). However if even that 1-2 minute advantage for wheels is consistent, it gives us a reason to choose those going first (assuming that we're OK with the sequencing).
We only create one environment, but the one improvement we could make is the one we've already done in cuml where we install the built artifacts during the initial solve instead of after the fact.
Big picture what we want to solve is to reduce the amount of CI blockage. This PR tries to address that in two ways:
|
Weighing in on each proposed change:
This is fine. It was probably an oversight originally.
I want to strongly optimize for the straight-path (no queue) time. Introducing new gating between jobs will serialize tasks that we should execute in parallel if resources allow. The most important reason for this is that fixing CI-blocking issues will be dramatically slower if we introduce more gates between jobs.
This reduced matrix must cover one job with CUDA 11 and one job with CUDA 12. Conda packaging across CUDA 11 and CUDA 12 is not at all the same, and the coverage should alert us of general conda packaging issues. Proposal here: https://github.com/rapidsai/shared-workflows/pull/184/files#r1509255538
As above, optimize for the straight-line path. Do not serialize jobs that can run in parallel. |
Generally, I think rapidsai/shared-workflows#184 will be enough to have a huge impact on our CI turnaround time (and reduce queueing) without needing to introduce more gates between jobs. I am a fan of rapidsai/shared-workflows#184 (with the modification I proposed to cover CUDA 11/12) and would support moving forward on that first, before we try to reorder or gate jobs. |
A potential improvement to reduce GPU time that I would support is I think a lot of our test failures are correlated, i.e. a failure on one conda test job is likely to have a failure on another conda test job, too. If a job fails-fast due to a "sibling" job having a flaky network connection (or whatever unexpected issues), it's generally no harder (and doesn't consume much more GPU time) to rerun the whole set of failed jobs. |
Regarding the single-step conda solves for cuml, I don't think it saves a great deal of time. It's just a boost for correctness (knowing you'll get compatible packages and not force the solver into a downgrade that it can't resolve). I looked at some recent CI logs. The time spent on the second step (installing libcuml/cuml) was typically quite small. The majority of the initial conda setup time is spent in downloading packages, it seems. It has to pull down multiple GB of data, which can take ~5 minutes compared to the conda solve times which are each less than 30 seconds. It's hard to do apples-to-apples comparisons because the test startup time is dominated by network activity. |
pip jobs will always be cheaper than conda jobs because wheels don't have to set up an entire set of libraries the way that conda does. We could improve this by caching (I plan to test that out next) but even then the conda cache will be much larger than the pip cache so the same applies, if reduced. As a result I don't know if apples-to-apples comparison is really what we should be optimizing for because we can safely indicate that the nature of how conda works always necessitates some unavoidable setup overhead.
Optimizing exclusively for the no queue time doesn't make much sense because in practice we never have no queue. Probably twice a week we end up in a situation where someone posts a Slack query asking why their job hasn't started for X hours and it's because the overall queue is large (and if it's being asked about on Slack by a couple of people it's certainly happening far more frequently when nobody comments). This frequently happens with CI-blocking issues too, and those also provide a great counterpoint because the backlog right after a CI-blocking issue is resolved is massive and therefore we don't get much parallelism in practice when everyone reruns all their PRs. Having workflows fail faster on early jobs would significantly improve that. There's a balance to be struck here though. I think a good compromise would be to maintain parallelism across all build (read: CPU-only node) jobs while still serializing more of the jobs that run on GPU nodes. WDYT?
Totally fine merging the shared workflows changes first and then coming back to this PR after a few weeks depending on what we observe.
I agree. The issue is that our failures are also frequently correlated across jobs within the same workflow: doc builds, conda Python tests and wheel tests all tend to fail together, as do conda "other" python tests and dask-cudf wheel tests. I don't think there's a way to fail-fast between different jobs. |
We can closing this testing PR now that rapidsai/shared-workflows#184 is merged. |
Description
This PR makes a number of changes in the hopes of alleviating the heavy load that we've been frequently observing in our CI queues of late. The changes are based on the following thoughts, and based on observations from PR #15194 (the most recent PR with passing CI) as the "other PR" below:
Based on those observations, and in conjunction with rapidsai/shared-workflows#184 (used in this PR), this PR makes the following changes:
The net result should be that while running the full CI workflow successfully for a single PR is now slower, CI failures are still observed almost as fast in almost all cases (exceptions include doc builds, python tests that only fail in conda, and build issues that only arise in conda-build or wheel builds) and overall CI resource usage by jobs that fail at any stage is substantially reduced.
I'm happy to discuss all of these changes, and I think implementing even a subset of them will make a meaningful impact on our CI turnaround time. Once we're happy with this and churn on other RAPIDS-wide CI changes reduces, we can port these changes to other repos as well.
Contributes to rapidsai/build-planning#5.
Checklist