Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit max number of parallel ci runs for docker-based runs #36610

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 9 additions & 11 deletions .github/workflows/ci-linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,7 @@ jobs:
tox_packages_factors: >-
["standard"]
docker_push_repository: ghcr.io/${{ github.repository }}/
# Make sure that all "standard-pre" jobs can start simultaneously,
# so that runners are available by the time that "default" starts.
max_parallel: 50
max_parallel: 15

standard:
if: ${{ success() || failure() }}
Expand All @@ -85,9 +83,7 @@ jobs:
tox_packages_factors: >-
["standard"]
docker_push_repository: ghcr.io/${{ github.repository }}/
# Reduce from 30 to 20 because it runs in parallel with 'standard-sitepackages'
# and 'minimal-pre' below
max_parallel: 20
max_parallel: 12

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you picked up these numbers heuristically. But what is the reason that 12 for "standard" is less than 15 for "standard-pre"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heuristic was that "standard-pre" is rather important as you want to be notified early that there are profound compilation issues on a certain system. "standard" is a bit less important (and results seem to be more uniform across the systems tested), also there are a few systems dropping out as their pre step failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the workflows do not fail when there are doctest failures.
It's crucial for "standard" to be completed as quickly as possible so that developers can inspect the results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that there is an overall walltime limit of 3 days (?) for the whole workflow to pass. So whatever job restrictions we put in, we need to make sure that this does not cause the later workflows to never run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing crucial about these workflow runs. If they fail, then its just an ordinary bug that needs to be fixed. There is no need to have them finished as soon as possible. You can always remove some older/similar systems from the ci if things are then taking too long.

In my opinion it's not acceptable that other workflows are queued for 14+ hours because of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. 60 runners. Then increasing "max-parallel" to 50 is a good solution!

Also we may decrease "max-parallel" for "minimal", "maximal", "experimental" to allow rooms for other PR workflow runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. When I added the "standard-sitepackages" run (with a limit of 10 and running parallel with "standard"), I didn't decrease the max-parallel of "standard". We can reduce it a bit, from 30 perhaps to 25. I wouldn't go much lower than that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about basing your PR on Tobias's?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've instead pushed a small change to #36616 (comment)

Copy link
Contributor Author

@tobiasdiez tobiasdiez Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking; then the time to the release (~ 7 days) is the closer deadline

According to https://github.com/sagemath/sage/actions/runs/6598197147/usage, the overall run time is 28d, so an average number of 5 runners should be more than enough to make that deadline. With the proposed 10-15 runners, you get it in 3 to 4 days.

standard-sitepackages:
if: ${{ success() || failure() }}
Expand Down Expand Up @@ -149,9 +145,7 @@ jobs:
tox_packages_factors: >-
["minimal"]
docker_push_repository: ghcr.io/${{ github.repository }}/
# Reduced from 30 because it may run in parallel with 'standard' and 'standard-sitepackages' above.
# Calibrated for clogging the job pipeline until the "default" job has finished.
max_parallel: 24
max_parallel: 10

minimal:
if: ${{ success() || failure() }}
Expand All @@ -170,7 +164,7 @@ jobs:
tox_packages_factors: >-
["minimal"]
docker_push_repository: ghcr.io/${{ github.repository }}/
max_parallel: 24
max_parallel: 10

maximal-pre:
if: ${{ success() || failure() }}
Expand All @@ -185,6 +179,7 @@ jobs:
tox_packages_factors: >-
["maximal"]
docker_push_repository: ghcr.io/${{ github.repository }}/
max_parallel: 10

optional-0-o:
if: ${{ success() || failure() }}
Expand All @@ -200,7 +195,7 @@ jobs:
docker_targets: "with-targets-optional"
# [0-9a-o] excludes _, in particular package _develop
targets_optional: '$(echo $(export PATH=build/bin:$PATH && sage-package list :optional: --has-file "spkg-install.in|spkg-install|requirements.txt" --no-file "huge|has_nonfree_dependencies" | grep -v sagemath_doc | grep ^[0-9a-o]))'

max_parallel: 10

optional-p-z:
if: ${{ success() || failure() }}
Expand All @@ -216,6 +211,7 @@ jobs:
docker_targets: "with-targets-optional"
# [0-9a-o] excludes _, in particular package _develop
targets_optional: '$(echo $(export PATH=build/bin:$PATH && sage-package list :optional: --has-file "spkg-install.in|spkg-install|requirements.txt" --no-file "huge|has_nonfree_dependencies" | grep -v sagemath_doc | grep ^[p-z]))'
max_parallel: 10

experimental-0-o:
if: ${{ success() || failure() }}
Expand All @@ -230,6 +226,7 @@ jobs:
["maximal"]
docker_targets: "with-targets-optional"
targets_optional: '$(echo $(export PATH=build/bin:$PATH && sage-package list :experimental: --has-file "spkg-install.in|spkg-install|requirements.txt" --no-file "huge|has_nonfree_dependencies" | grep -v sagemath_doc | grep ^[0-9a-o]))'
max_parallel: 10

experimental-p-z:
if: ${{ success() || failure() }}
Expand All @@ -244,3 +241,4 @@ jobs:
["maximal"]
docker_targets: "with-targets-optional"
targets_optional: '$(echo $(export PATH=build/bin:$PATH && sage-package list :experimental: --has-file "spkg-install.in|spkg-install|requirements.txt" --no-file "huge|has_nonfree_dependencies" | grep -v sagemath_doc | grep ^[p-z]))'
max_parallel: 10
Loading