Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected delays building small plugin on linux agent #3916

Closed
MarkEWaite opened this issue Jan 20, 2024 · 16 comments
Closed

Unexpected delays building small plugin on linux agent #3916

MarkEWaite opened this issue Jan 20, 2024 · 16 comments

Comments

@MarkEWaite
Copy link

MarkEWaite commented Jan 20, 2024

Service(s)

ci.jenkins.io

Summary

An xshell plugin pull request had unexpected 2 minute and 8 minute delays during its Linux build

Reproduction steps

  1. XShell plugin PR 142 (update to most recent plugin BOM version) shows a 19 minute Linux build even though its typical Linux build process is 3-4 minutes. The build log included the following entries:
11:45:13  + mvn --batch-mode --show-version --errors --no-transfer-progress --update-snapshots \
            -Dmaven.repo.local=/home/jenkins/agent/workspace/Plugins_xshell-plugin_PR-142@tmp/m2repo \
            -Dmaven.test.failure.ignore -Dspotbugs.failOnError=false \
            -Dcheckstyle.failOnViolation=false -Dcheckstyle.failsOnError=false \
            -Penable-jacoco -Dset.changelist \
            help:evaluate -Dexpression=changelist \
            -Doutput=/home/jenkins/agent/workspace/Plugins_xshell-plugin_PR-142@tmp/changelist \
            -DforkCount=1C clean install \
            -XX:ConcGCThreads=1 -XX:G1ConcRefinementThreads=4 -XX:GCDrainStackTargetSize=64 \
            -XX:InitialHeapSize=125000000 -XX:MarkStackSize=4194304 \
            -XX:MaxHeapSize=2000000000 -XX:MinHeapSize=6815736 -XX:+PrintCommandLineFlags \
            -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache \
            -XX:+UseCompressedOops -XX:+UseG1GC
11:45:13  Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)
11:45:13  Maven home: /usr/share/apache-maven-3.9.6
11:45:13  Java version: 21.0.1, vendor: Eclipse Adoptium, runtime: /opt/jdk-21
11:45:13  Default locale: en_US, platform encoding: UTF-8
11:45:13  OS name: "linux", version: "6.1.0-0.deb11.11-amd64", arch: "amd64", family: "unix"
11:47:31  [INFO] Error stacktraces are turned on.
11:47:34  [INFO] Setting: -Dchangelist=270.vda_0e70622168 -DscmTag=da0e7062216838d07f8f15e59f084a3a3ebd5492
11:47:34  [INFO] No information available to set -DgitHubRepo
11:47:34  [INFO] Scanning for projects...
11:55:32  [INFO]
11:55:32  [INFO] -------------------< org.jenkins-ci.plugins:xshell >--------------------

There is a significant delay between the 11:45:13 entry that ends the output of --show-version and the 11:47:31 entry that starts the help:evaluate. That may be time when dependencies are being transferred from the artifact caching proxy on DigitalOcean to the agent.

I am reasonably confident this specific job was run on DigitalOcean because the log includes an entry:

INFO: using artifact caching proxy from 'do' provider.

There is a significant delay between the 11:47:34 start of scanning for projects and the 11:55:32 start of the next step.

When I run those steps locally, the delays do not happen. My local build with no maven cache completes in less than 4 minutes with the commands:

git clone https://github.com/jenkinsci/xshell-plugin.git
cd xshell-plugin
time docker run -it --rm --name xshell -v "$(pwd)":/usr/src/mymaven -w /usr/src/mymaven maven:3.9.6-eclipse-temurin-21 mvn clean install
@MarkEWaite MarkEWaite added the triage Incoming issues that need review label Jan 20, 2024
@lemeurherve lemeurherve self-assigned this Jan 22, 2024
@lemeurherve lemeurherve removed the triage Incoming issues that need review label Jan 22, 2024
@lemeurherve
Copy link
Member

lemeurherve commented Jan 22, 2024

This is related to the artifact caching proxies indeed.

I don't know yet if all of them are concerned but replaying jenkinsci/xshell-plugin#143 with useArtifactCachingProxy: false I confirm that the build time decreased from 17~18 minutes to 3~4 minutes.

I've started analysing Datadog logs but didn't find anything suspect yet, will continue tomorrow.

What I can do while we're identifying the root cause and fixing it is to globally disable ACP on ci.jenkins.io by setting its ARTIFACT_CACHING_PROXY_AVAILABLE_PROVIDERS global env var value to none:
It will increase JFrog bandwith consumption temporarily but it would also decrease the induced charge from plugins builds.

WDYT?

@MarkEWaite
Copy link
Author

It will increase JFrog bandwith consumption temporarily but it would also decrease the induced charge from plugins builds.

WDYT?

I would rather leave the ACP enabled for at least another day so that you can evaluate it. We're still in the first month after completing the bandwidth reduction project and I would like to keep bandwidth use lower even at the expense of another day of slow builds.

@lemeurherve
Copy link
Member

Ran some other tests, DO seems to be the only concerned providers, build times from repo.aws.jenkins.io or repo.azure.jenkins.io are fine.

@dduportal
Copy link
Contributor

Ran some other tests, DO seems to be the only concerned providers, build times from repo.aws.jenkins.io or repo.azure.jenkins.io are fine.

Proposal: let's prepare the Kubernetes 1.27 migration starting with DO

  • Migrating to Kubernetes 1.27 will ensure nodes are recycled and updated, along with the CSI volume drivers

@lemeurherve
Copy link
Member

As we have more urgent issues to deal with, temporary disabling DigitalOcean agents on ci.jenkins.io as a workaround: jenkins-infra/jenkins-infra#3266

@dduportal dduportal modified the milestones: infra-team-sync-2024-01-23, infra-team-sync-2024-01-30 Jan 24, 2024
@smerle33 smerle33 modified the milestones: infra-team-sync-2024-01-30, infra-team-sync-2024-02-13 Jan 30, 2024
@dduportal
Copy link
Contributor

Blocked by #3948

@lemeurherve
Copy link
Member

As DigitalOcean cluster has been upgraded to Kubernetes 1.27 cf #3948 (comment), here is the PR to restore DigitalOcean agents on ci.jenkins.io: jenkins-infra/jenkins-infra#3316
Then I'll rerun a job to see if the issue is still present or not.

@lemeurherve
Copy link
Member

lemeurherve commented Mar 6, 2024

Rerunning 3 times jenkinsci/xshell-plugin#146 I've got the "scanning for projects..." step executed in between 4 minutes 20 secondes and 4 minutes and 40 secondes with ACP on DigitalOcean agents.

Does it look like acceptable @MarkEWaite ?

@basil
Copy link
Collaborator

basil commented Mar 6, 2024

As DigitalOcean cluster has been upgraded to Kubernetes 1.27 cf #3948 (comment)

I fail to see how an outdated Kubernetes cluster (now upgraded to version 1.27) was the cause of this issue.

https://ci.jenkins.io/job/Tools/job/bom/job/PR-2991/1/consoleFull on Digital Ocean has been sitting here for 20 minutes:

09:42:37  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
09:42:37  [INFO] Building sample 2873.v15f36183ee4f                                 [6/6]
09:42:37  [INFO]   from sample-plugin/pom.xml
09:42:37  [INFO] --------------------------------[ hpi ]---------------------------------
10:02:14  [INFO] 
10:02:14  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

The same took 12 minutes in https://ci.jenkins.io/job/Tools/job/bom/job/PR-2993/1/consoleFull on AWS:

09:26:19  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
09:26:19  [INFO] Building sample 2873.vd5f7eb_fa_d165                               [6/6]
09:26:19  [INFO]   from sample-plugin/pom.xml
09:26:19  [INFO] --------------------------------[ hpi ]---------------------------------
09:38:52  [INFO] 
09:38:52  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

Does it look like acceptable @MarkEWaite ?

Neither the AWS nor the DIgital Ocean build times look acceptable to me. Local builds with a populated Maven cache do not take dozens of minutes.

@dduportal
Copy link
Contributor

I fail to see how an outdated Kubernetes cluster (now upgraded to version 1.27) was the cause of this issue.

DigitalOcean tends to deprecate their Kubernetes versions really quickly to stay on "latest" versions. It implies a lot of deprecation on the CSI drivers: each time a given version enters the last 3 months of it life (e.g. December, January, February) we see an increase in I/O latency on persistent volumes.

As the ACP seemed slow only on Digital Ocean, we hypothesized it could be a cause, without spending much more time on the analysis.

Neither the AWS nor the DIgital Ocean build times look acceptable to me. Local builds with a populated Maven cache do not take dozens of minutes.

As you underline, we have been mistaken. All ACP instances are slow after a certain amount of time and we still can't understand and can't find why. I start to believe that this whole ACP component has been a waste of efforts given the amount of problem it caused. I wondered if we should not start thinking to move it to dedicated VMs to have a better control on all elements as the managed Kubernetes are a waste of resources for running such a service.

@basil
Copy link
Collaborator

basil commented Mar 7, 2024

Part of the problem is surely the nondeterminism added by #3969. I would expect the resolution of that issue to make performance more consistent, either in a positive direction (better performance caused by exclusive use of ACP) or in a negative direction (worse performance caused by exclusive use of ACP).

I start to believe that this whole ACP component has been a waste of efforts given the amount of problem it caused.

The system is functioning correctly, which is a great start. But yes performance is still an issue, as it often is with the initial release of a major architectural change. These major architectural changes are usually followed up with performance improvements once correctness has been achieved.

@smerle33 smerle33 removed this from the infra-team-sync-2024-03-12 milestone Mar 12, 2024
@basil
Copy link
Collaborator

basil commented Mar 26, 2024

From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:

11:15:26  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
11:15:26  [INFO] Building sample 2935.v7b_08b_e8a_962f                              [6/6]
11:15:26  [INFO]   from sample-plugin/pom.xml
11:15:26  [INFO] --------------------------------[ hpi ]---------------------------------
11:28:30  [INFO] 
11:28:30  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

@dduportal
Copy link
Contributor

From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:

11:15:26  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
11:15:26  [INFO] Building sample 2935.v7b_08b_e8a_962f                              [6/6]
11:15:26  [INFO]   from sample-plugin/pom.xml
11:15:26  [INFO] --------------------------------[ hpi ]---------------------------------
11:28:30  [INFO] 
11:28:30  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

Thanks for the data @basil . We are focusing first (if it makes sense) on #3969 and we'll see the impact on these builds.

@dduportal
Copy link
Contributor

From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:

11:15:26  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
11:15:26  [INFO] Building sample 2935.v7b_08b_e8a_962f                              [6/6]
11:15:26  [INFO]   from sample-plugin/pom.xml
11:15:26  [INFO] --------------------------------[ hpi ]---------------------------------
11:28:30  [INFO] 
11:28:30  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

Triggered a new bom build on master following #3969 (comment). Might need a 2nd one once the cache is loaded in ACP (Maven central artifacts) before seeing a real effect.

@basil
Copy link
Collaborator

basil commented Mar 28, 2024

I watched some plugin builds today, and performance was back to the level I remember it being a few months ago.

@MarkEWaite
Copy link
Author

The 3 xshell plugin builds that I ran today took less than 90 seconds in their "Build" stage on Linux. The build from 6 days ago took 13 minutes. Performance has improved very nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants