Unexpected delays building small plugin on linux agent #3916

MarkEWaite · 2024-01-20T19:25:37Z

Service(s)

ci.jenkins.io

Summary

An xshell plugin pull request had unexpected 2 minute and 8 minute delays during its Linux build

Reproduction steps

XShell plugin PR 142 (update to most recent plugin BOM version) shows a 19 minute Linux build even though its typical Linux build process is 3-4 minutes. The build log included the following entries:

11:45:13  + mvn --batch-mode --show-version --errors --no-transfer-progress --update-snapshots \
            -Dmaven.repo.local=/home/jenkins/agent/workspace/Plugins_xshell-plugin_PR-142@tmp/m2repo \
            -Dmaven.test.failure.ignore -Dspotbugs.failOnError=false \
            -Dcheckstyle.failOnViolation=false -Dcheckstyle.failsOnError=false \
            -Penable-jacoco -Dset.changelist \
            help:evaluate -Dexpression=changelist \
            -Doutput=/home/jenkins/agent/workspace/Plugins_xshell-plugin_PR-142@tmp/changelist \
            -DforkCount=1C clean install \
            -XX:ConcGCThreads=1 -XX:G1ConcRefinementThreads=4 -XX:GCDrainStackTargetSize=64 \
            -XX:InitialHeapSize=125000000 -XX:MarkStackSize=4194304 \
            -XX:MaxHeapSize=2000000000 -XX:MinHeapSize=6815736 -XX:+PrintCommandLineFlags \
            -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache \
            -XX:+UseCompressedOops -XX:+UseG1GC
11:45:13  Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)
11:45:13  Maven home: /usr/share/apache-maven-3.9.6
11:45:13  Java version: 21.0.1, vendor: Eclipse Adoptium, runtime: /opt/jdk-21
11:45:13  Default locale: en_US, platform encoding: UTF-8
11:45:13  OS name: "linux", version: "6.1.0-0.deb11.11-amd64", arch: "amd64", family: "unix"
11:47:31  [INFO] Error stacktraces are turned on.
11:47:34  [INFO] Setting: -Dchangelist=270.vda_0e70622168 -DscmTag=da0e7062216838d07f8f15e59f084a3a3ebd5492
11:47:34  [INFO] No information available to set -DgitHubRepo
11:47:34  [INFO] Scanning for projects...
11:55:32  [INFO]
11:55:32  [INFO] -------------------< org.jenkins-ci.plugins:xshell >--------------------

There is a significant delay between the 11:45:13 entry that ends the output of --show-version and the 11:47:31 entry that starts the help:evaluate. That may be time when dependencies are being transferred from the artifact caching proxy on DigitalOcean to the agent.

I am reasonably confident this specific job was run on DigitalOcean because the log includes an entry:

INFO: using artifact caching proxy from 'do' provider.

There is a significant delay between the 11:47:34 start of scanning for projects and the 11:55:32 start of the next step.

When I run those steps locally, the delays do not happen. My local build with no maven cache completes in less than 4 minutes with the commands:

git clone https://github.com/jenkinsci/xshell-plugin.git
cd xshell-plugin
time docker run -it --rm --name xshell -v "$(pwd)":/usr/src/mymaven -w /usr/src/mymaven maven:3.9.6-eclipse-temurin-21 mvn clean install

The text was updated successfully, but these errors were encountered:

lemeurherve · 2024-01-22T19:20:38Z

This is related to the artifact caching proxies indeed.

I don't know yet if all of them are concerned but replaying jenkinsci/xshell-plugin#143 with useArtifactCachingProxy: false I confirm that the build time decreased from 17~18 minutes to 3~4 minutes.

I've started analysing Datadog logs but didn't find anything suspect yet, will continue tomorrow.

What I can do while we're identifying the root cause and fixing it is to globally disable ACP on ci.jenkins.io by setting its ARTIFACT_CACHING_PROXY_AVAILABLE_PROVIDERS global env var value to none:
It will increase JFrog bandwith consumption temporarily but it would also decrease the induced charge from plugins builds.

WDYT?

MarkEWaite · 2024-01-22T19:34:01Z

It will increase JFrog bandwith consumption temporarily but it would also decrease the induced charge from plugins builds.

WDYT?

I would rather leave the ACP enabled for at least another day so that you can evaluate it. We're still in the first month after completing the bandwidth reduction project and I would like to keep bandwidth use lower even at the expense of another day of slow builds.

lemeurherve · 2024-01-23T10:25:34Z

Ran some other tests, DO seems to be the only concerned providers, build times from repo.aws.jenkins.io or repo.azure.jenkins.io are fine.

dduportal · 2024-01-23T11:29:44Z

Ran some other tests, DO seems to be the only concerned providers, build times from repo.aws.jenkins.io or repo.azure.jenkins.io are fine.

Proposal: let's prepare the Kubernetes 1.27 migration starting with DO

Migrating to Kubernetes 1.27 will ensure nodes are recycled and updated, along with the CSI volume drivers

lemeurherve · 2024-01-23T15:56:16Z

As we have more urgent issues to deal with, temporary disabling DigitalOcean agents on ci.jenkins.io as a workaround: jenkins-infra/jenkins-infra#3266

dduportal · 2024-02-13T10:56:57Z

Blocked by #3948

lemeurherve · 2024-03-06T08:12:22Z

As DigitalOcean cluster has been upgraded to Kubernetes 1.27 cf #3948 (comment), here is the PR to restore DigitalOcean agents on ci.jenkins.io: jenkins-infra/jenkins-infra#3316
Then I'll rerun a job to see if the issue is still present or not.

lemeurherve · 2024-03-06T10:43:36Z

Rerunning 3 times jenkinsci/xshell-plugin#146 I've got the "scanning for projects..." step executed in between 4 minutes 20 secondes and 4 minutes and 40 secondes with ACP on DigitalOcean agents.

Does it look like acceptable @MarkEWaite ?

basil · 2024-03-06T18:05:09Z

As DigitalOcean cluster has been upgraded to Kubernetes 1.27 cf #3948 (comment)

I fail to see how an outdated Kubernetes cluster (now upgraded to version 1.27) was the cause of this issue.

https://ci.jenkins.io/job/Tools/job/bom/job/PR-2991/1/consoleFull on Digital Ocean has been sitting here for 20 minutes:

09:42:37  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
09:42:37  [INFO] Building sample 2873.v15f36183ee4f                                 [6/6]
09:42:37  [INFO]   from sample-plugin/pom.xml
09:42:37  [INFO] --------------------------------[ hpi ]---------------------------------
10:02:14  [INFO] 
10:02:14  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

The same took 12 minutes in https://ci.jenkins.io/job/Tools/job/bom/job/PR-2993/1/consoleFull on AWS:

09:26:19  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
09:26:19  [INFO] Building sample 2873.vd5f7eb_fa_d165                               [6/6]
09:26:19  [INFO]   from sample-plugin/pom.xml
09:26:19  [INFO] --------------------------------[ hpi ]---------------------------------
09:38:52  [INFO] 
09:38:52  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

Does it look like acceptable @MarkEWaite ?

Neither the AWS nor the DIgital Ocean build times look acceptable to me. Local builds with a populated Maven cache do not take dozens of minutes.

dduportal · 2024-03-06T21:43:04Z

I fail to see how an outdated Kubernetes cluster (now upgraded to version 1.27) was the cause of this issue.

DigitalOcean tends to deprecate their Kubernetes versions really quickly to stay on "latest" versions. It implies a lot of deprecation on the CSI drivers: each time a given version enters the last 3 months of it life (e.g. December, January, February) we see an increase in I/O latency on persistent volumes.

As the ACP seemed slow only on Digital Ocean, we hypothesized it could be a cause, without spending much more time on the analysis.

Neither the AWS nor the DIgital Ocean build times look acceptable to me. Local builds with a populated Maven cache do not take dozens of minutes.

As you underline, we have been mistaken. All ACP instances are slow after a certain amount of time and we still can't understand and can't find why. I start to believe that this whole ACP component has been a waste of efforts given the amount of problem it caused. I wondered if we should not start thinking to move it to dedicated VMs to have a better control on all elements as the managed Kubernetes are a waste of resources for running such a service.

basil · 2024-03-07T16:54:25Z

Part of the problem is surely the nondeterminism added by #3969. I would expect the resolution of that issue to make performance more consistent, either in a positive direction (better performance caused by exclusive use of ACP) or in a negative direction (worse performance caused by exclusive use of ACP).

I start to believe that this whole ACP component has been a waste of efforts given the amount of problem it caused.

The system is functioning correctly, which is a great start. But yes performance is still an issue, as it often is with the initial release of a major architectural change. These major architectural changes are usually followed up with performance improvements once correctness has been achieved.

basil · 2024-03-26T18:31:27Z

From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:

11:15:26  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
11:15:26  [INFO] Building sample 2935.v7b_08b_e8a_962f                              [6/6]
11:15:26  [INFO]   from sample-plugin/pom.xml
11:15:26  [INFO] --------------------------------[ hpi ]---------------------------------
11:28:30  [INFO] 
11:28:30  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

dduportal · 2024-03-27T07:52:07Z

From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:

11:15:26  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
11:15:26  [INFO] Building sample 2935.v7b_08b_e8a_962f                              [6/6]
11:15:26  [INFO]   from sample-plugin/pom.xml
11:15:26  [INFO] --------------------------------[ hpi ]---------------------------------
11:28:30  [INFO] 
11:28:30  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

Thanks for the data @basil . We are focusing first (if it makes sense) on #3969 and we'll see the impact on these builds.

dduportal · 2024-03-28T08:00:26Z

From https://ci.jenkins.io/job/Tools/job/bom/job/PR-3062/2/consoleFull on Digital Ocean, a 13-minute delay downloading artifacts:

11:15:26  [INFO] --------------------< io.jenkins.tools.bom:sample >---------------------
11:15:26  [INFO] Building sample 2935.v7b_08b_e8a_962f                              [6/6]
11:15:26  [INFO]   from sample-plugin/pom.xml
11:15:26  [INFO] --------------------------------[ hpi ]---------------------------------
11:28:30  [INFO] 
11:28:30  [INFO] --- clean:3.3.2:clean (default-clean) @ sample ---

Triggered a new bom build on master following #3969 (comment). Might need a 2nd one once the cache is loaded in ACP (Maven central artifacts) before seeing a real effect.

basil · 2024-03-28T22:38:11Z

I watched some plugin builds today, and performance was back to the level I remember it being a few months ago.

MarkEWaite · 2024-03-29T12:34:05Z

The 3 xshell plugin builds that I ran today took less than 90 seconds in their "Build" stage on Linux. The build from 6 days ago took 13 minutes. Performance has improved very nicely.

MarkEWaite added the triage Incoming issues that need review label Jan 20, 2024

jenkins-infra-helpdesk-app bot added the ci.jenkins.io label Jan 20, 2024

lemeurherve self-assigned this Jan 22, 2024

lemeurherve added this to the infra-team-sync-2024-01-23 milestone Jan 22, 2024

lemeurherve removed the triage Incoming issues that need review label Jan 22, 2024

lemeurherve mentioned this issue Jan 22, 2024

Test PR for jenkins-infra/helpdesk#3916 jenkinsci/xshell-plugin#143

Closed

lemeurherve added the artifact-caching-proxy label Jan 22, 2024

lemeurherve mentioned this issue Jan 23, 2024

chore(controller.ci.jenkins.io): disable DigitalOcean agents jenkins-infra/jenkins-infra#3266

Merged

dduportal modified the milestones: infra-team-sync-2024-01-23, infra-team-sync-2024-01-30 Jan 24, 2024

smerle33 modified the milestones: infra-team-sync-2024-01-30, infra-team-sync-2024-02-13 Jan 30, 2024

dduportal modified the milestones: infra-team-sync-2024-02-13, infra-team-sync-2024-02-20 Feb 13, 2024

dduportal modified the milestones: infra-team-sync-2024-02-20, infra-team-sync-2024-03-12 Feb 22, 2024

lemeurherve mentioned this issue Mar 6, 2024

Revert "chore(controller.ci.jenkins.io): disable DigitalOcean agents" jenkins-infra/jenkins-infra#3316

Merged

lemeurherve mentioned this issue Mar 6, 2024

Test PR for jenkins-infra/helpdesk#3916 jenkinsci/xshell-plugin#146

Closed

smerle33 removed this from the infra-team-sync-2024-03-12 milestone Mar 12, 2024

smerle33 added this to the infra-team-sync-2024-03-19 milestone Mar 12, 2024

dduportal modified the milestones: infra-team-sync-2024-03-19, infra-team-sync-2024-03-26 Mar 20, 2024

dduportal mentioned this issue Mar 29, 2024

Artifact Caching Proxy (ACP) does not cache artifacts from Maven Central #3969

Closed

MarkEWaite closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected delays building small plugin on linux agent #3916

Unexpected delays building small plugin on linux agent #3916

MarkEWaite commented Jan 20, 2024 •

edited

Loading

lemeurherve commented Jan 22, 2024 •

edited

Loading

MarkEWaite commented Jan 22, 2024

lemeurherve commented Jan 23, 2024

dduportal commented Jan 23, 2024

lemeurherve commented Jan 23, 2024

dduportal commented Feb 13, 2024

lemeurherve commented Mar 6, 2024

lemeurherve commented Mar 6, 2024 •

edited

Loading

basil commented Mar 6, 2024

dduportal commented Mar 6, 2024

basil commented Mar 7, 2024

basil commented Mar 26, 2024

dduportal commented Mar 27, 2024

dduportal commented Mar 28, 2024

basil commented Mar 28, 2024

MarkEWaite commented Mar 29, 2024

Unexpected delays building small plugin on linux agent #3916

Unexpected delays building small plugin on linux agent #3916

Comments

MarkEWaite commented Jan 20, 2024 • edited Loading

Service(s)

Summary

Reproduction steps

lemeurherve commented Jan 22, 2024 • edited Loading

MarkEWaite commented Jan 22, 2024

lemeurherve commented Jan 23, 2024

dduportal commented Jan 23, 2024

lemeurherve commented Jan 23, 2024

dduportal commented Feb 13, 2024

lemeurherve commented Mar 6, 2024

lemeurherve commented Mar 6, 2024 • edited Loading

basil commented Mar 6, 2024

dduportal commented Mar 6, 2024

basil commented Mar 7, 2024

basil commented Mar 26, 2024

dduportal commented Mar 27, 2024

dduportal commented Mar 28, 2024

basil commented Mar 28, 2024

MarkEWaite commented Mar 29, 2024

MarkEWaite commented Jan 20, 2024 •

edited

Loading

lemeurherve commented Jan 22, 2024 •

edited

Loading

lemeurherve commented Mar 6, 2024 •

edited

Loading