-
-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit max number of parallel ci runs for docker-based runs #36610
base: develop
Are you sure you want to change the base?
Conversation
@@ -101,6 +102,7 @@ jobs: | |||
tox_packages_factors: >- | |||
["standard"] | |||
docker_push_repository: ghcr.io/${{ github.repository }}/ | |||
max_parallel: 12 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you picked up these numbers heuristically. But what is the reason that 12 for "standard" is less than 15 for "standard-pre"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The heuristic was that "standard-pre" is rather important as you want to be notified early that there are profound compilation issues on a certain system. "standard" is a bit less important (and results seem to be more uniform across the systems tested), also there are a few systems dropping out as their pre step failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the workflows do not fail when there are doctest failures.
It's crucial for "standard" to be completed as quickly as possible so that developers can inspect the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that there is an overall walltime limit of 3 days (?) for the whole workflow to pass. So whatever job restrictions we put in, we need to make sure that this does not cause the later workflows to never run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing crucial about these workflow runs. If they fail, then its just an ordinary bug that needs to be fixed. There is no need to have them finished as soon as possible. You can always remove some older/similar systems from the ci if things are then taking too long.
In my opinion it's not acceptable that other workflows are queued for 14+ hours because of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. 60 runners. Then increasing "max-parallel" to 50 is a good solution!
Also we may decrease "max-parallel" for "minimal", "maximal", "experimental" to allow rooms for other PR workflow runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. When I added the "standard-sitepackages" run (with a limit of 10 and running parallel with "standard"), I didn't decrease the max-parallel of "standard". We can reduce it a bit, from 30 perhaps to 25. I wouldn't go much lower than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about basing your PR on Tobias's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've instead pushed a small change to #36616 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking; then the time to the release (~ 7 days) is the closer deadline
According to https://github.com/sagemath/sage/actions/runs/6598197147/usage, the overall run time is 28d, so an average number of 5 runners should be more than enough to make that deadline. With the proposed 10-15 runners, you get it in 3 to 4 days.
Anyway I like the idea. |
From the moment that a release has been pushed and the merged PRs are closed:
This is the situation until the "default-pre" and "default" jobs have completed and pushed the new Docker image. So what we should do is ensure that this new Docker image is available as soon as possible. #36430, by making "default-pre" and "default" jobs separate from "standard-pre" and "standard", already did part of this, but we can do better: Currently, because "default" is queued after the "standard-pre" jobs, it has to wait until a runner becomes available. So I would suggest the following:
Edit: That's now #36616 |
Let's close this one here in favor of #36616. |
#36616 adjusted "standard" reasonably. Decreasing max-parallel for "minimal" and others will make more rooms for other PRs. But as we have 60 runners, max-parallel 30 (current default) for them already gives enough room for other PRs. On the other hand, decreasing max-parallel for "minimal" and others will result in taking more time to finish "minimal" and others. So let's see how things work after #36616. If there needs more adjustment, then let's use this PR. Otherwise close this PR as Matthias suggested. |
I don't agree that this is enough reduction.
Sadly not true. Already during the first hours after a new release you have a long queue. Normally the conda workflow takes 2 hours, but as you see from https://github.com/sagemath/sage/actions/workflows/ci-conda.yml?page=4 and page 3 there are a lot of workflows that take 5 hours, so they are queued for 3h. Some are even in the queue for 12 hours, and that's usually the case when one of the "blocks" in the ci-linux workflow is finished so that the next block gets almost all runners. |
… decrease for `standard`, `minimal-pre` <!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> <!-- Why is this change required? What problem does it solve? --> As proposed in sagemath#36610 (comment) <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36616 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee
… decrease for `standard`, `minimal-pre` <!-- ^^^^^ Please provide a concise, informative and self-explanatory title. Don't put issue numbers in there, do this in the PR body below. For example, instead of "Fixes sagemath#1234" use "Introduce new method to calculate 1+1" --> <!-- Describe your changes here in detail --> <!-- Why is this change required? What problem does it solve? --> As proposed in sagemath#36610 (comment) <!-- If this PR resolves an open issue, please link to it here. For example "Fixes sagemath#12345". --> <!-- If your change requires a documentation PR, please link it appropriately. --> ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> <!-- If your change requires a documentation PR, please link it appropriately --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> <!-- Feel free to remove irrelevant items. --> - [x] The title is concise, informative, and self-explanatory. - [ ] The description explains in detail what this PR is about. - [x] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation accordingly. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on - sagemath#12345: short description why this is a dependency - sagemath#34567: ... --> <!-- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> URL: sagemath#36616 Reported by: Matthias Köppe Reviewer(s): Kwankyu Lee
I don't think this PR is a serious proposal. If Tobias wants to demonstrate that after this radical cut of number of allotted runners by a factor 3 is suitable for running the portability tests, he can do so by running it in his own repo so we can have a look at the run. |
Please stop these comments. This PR is clearly not an April fools joke.
I don't want to demonstrate that this is possible, I want to reduce the completion time of PR-related workflows after a release. |
The other tests that are part of the CI Linux. |
Proposing to close this PR. As I said in #36610 (comment), this PR is not a serious proposal. |
There is nothing radical about this PR. It just gives priority to the PR workflows instead of the ci-linux, by reducing the priority of the later by roughly a factor of 2 (relative to the number of runners assigned at the point of creation of this PR). |
Maybe start with answering the following question instead of insulting me:
|
For reference: #36572 (comment) |
I fail to see how this answers my questions. |
As I explained elsewhere: labels "invalid" + "needs review" = motion to close. |
As discussed in #36726 (comment) |
That doesn't address my question at all. |
Documentation preview for this PR (built with commit 31f8802; changes) is ready! 🎉 |
@kcrisman responding to your suggestion in https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ/m/ciVrZ7x0AQAJ: Confirming my vote of -1 on this PR. |
Perhaps surprisingly, here I agree more that it's reasonable to ask for a concrete number and have discussion about that before closing the ticket. See the end of my comment here: #36725 (comment) Am I correct in inferring that some of the CI limit tickets are, at their core, about a disagreement in philosophy on whether PRs or "standard" builds should get priority? I'm a little in the dark here (hence my initial reluctance to get drawn in ...). But anyway, I feel like that is all a sideshow. It seems like this should be a number that people can come to some consensus on. Can we at least agree on a hard upper and lower bound imposed by hardware/price/something? (And then maybe make a random choice based on some interesting distribution ... I'm totally not joking, randomized elections of this type are definitely proposed from time to time, and maybe it will help end the bad feelings here a tiny bit.) |
-1 from me |
Currently, the ci-linux workflow runs a lot of parallel runs which prevents other workflows (from PR) to be executed. We add a few more
max_parallel
statements to spread the stress on the ci over a longer time.📝 Checklist
⌛ Dependencies