-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected delays building small plugin on linux agent #3916
Comments
This is related to the artifact caching proxies indeed. I don't know yet if all of them are concerned but replaying jenkinsci/xshell-plugin#143 with I've started analysing Datadog logs but didn't find anything suspect yet, will continue tomorrow. What I can do while we're identifying the root cause and fixing it is to globally disable ACP on ci.jenkins.io by setting its WDYT? |
I would rather leave the ACP enabled for at least another day so that you can evaluate it. We're still in the first month after completing the bandwidth reduction project and I would like to keep bandwidth use lower even at the expense of another day of slow builds. |
Ran some other tests, DO seems to be the only concerned providers, build times from repo.aws.jenkins.io or repo.azure.jenkins.io are fine. |
Proposal: let's prepare the Kubernetes 1.27 migration starting with DO
|
As we have more urgent issues to deal with, temporary disabling DigitalOcean agents on ci.jenkins.io as a workaround: jenkins-infra/jenkins-infra#3266 |
Blocked by #3948 |
As DigitalOcean cluster has been upgraded to Kubernetes 1.27 cf #3948 (comment), here is the PR to restore DigitalOcean agents on ci.jenkins.io: jenkins-infra/jenkins-infra#3316 |
Rerunning 3 times jenkinsci/xshell-plugin#146 I've got the "scanning for projects..." step executed in between 4 minutes 20 secondes and 4 minutes and 40 secondes with ACP on DigitalOcean agents. Does it look like acceptable @MarkEWaite ? |
I fail to see how an outdated Kubernetes cluster (now upgraded to version 1.27) was the cause of this issue. https://ci.jenkins.io/job/Tools/job/bom/job/PR-2991/1/consoleFull on Digital Ocean has been sitting here for 20 minutes:
The same took 12 minutes in https://ci.jenkins.io/job/Tools/job/bom/job/PR-2993/1/consoleFull on AWS:
Neither the AWS nor the DIgital Ocean build times look acceptable to me. Local builds with a populated Maven cache do not take dozens of minutes. |
DigitalOcean tends to deprecate their Kubernetes versions really quickly to stay on "latest" versions. It implies a lot of deprecation on the CSI drivers: each time a given version enters the last 3 months of it life (e.g. December, January, February) we see an increase in I/O latency on persistent volumes. As the ACP seemed slow only on Digital Ocean, we hypothesized it could be a cause, without spending much more time on the analysis.
As you underline, we have been mistaken. All ACP instances are slow after a certain amount of time and we still can't understand and can't find why. I start to believe that this whole ACP component has been a waste of efforts given the amount of problem it caused. I wondered if we should not start thinking to move it to dedicated VMs to have a better control on all elements as the managed Kubernetes are a waste of resources for running such a service. |
Part of the problem is surely the nondeterminism added by #3969. I would expect the resolution of that issue to make performance more consistent, either in a positive direction (better performance caused by exclusive use of ACP) or in a negative direction (worse performance caused by exclusive use of ACP).
The system is functioning correctly, which is a great start. But yes performance is still an issue, as it often is with the initial release of a major architectural change. These major architectural changes are usually followed up with performance improvements once correctness has been achieved. |
From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:
|
Thanks for the data @basil . We are focusing first (if it makes sense) on #3969 and we'll see the impact on these builds. |
Triggered a new bom build on master following #3969 (comment). Might need a 2nd one once the cache is loaded in ACP (Maven central artifacts) before seeing a real effect. |
I watched some plugin builds today, and performance was back to the level I remember it being a few months ago. |
The 3 xshell plugin builds that I ran today took less than 90 seconds in their "Build" stage on Linux. The build from 6 days ago took 13 minutes. Performance has improved very nicely. |
Service(s)
ci.jenkins.io
Summary
An xshell plugin pull request had unexpected 2 minute and 8 minute delays during its Linux build
Reproduction steps
There is a significant delay between the 11:45:13 entry that ends the output of
--show-version
and the 11:47:31 entry that starts thehelp:evaluate
. That may be time when dependencies are being transferred from the artifact caching proxy on DigitalOcean to the agent.I am reasonably confident this specific job was run on DigitalOcean because the log includes an entry:
There is a significant delay between the 11:47:34 start of scanning for projects and the 11:55:32 start of the next step.
When I run those steps locally, the delays do not happen. My local build with no maven cache completes in less than 4 minutes with the commands:
The text was updated successfully, but these errors were encountered: