Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent load testing #1467

Merged
merged 37 commits into from
Nov 5, 2020
Merged

Agent load testing #1467

merged 37 commits into from
Nov 5, 2020

Conversation

cachedout
Copy link
Contributor

@cachedout cachedout commented Nov 2, 2020

Summary

This adds support for load-testing the Java agent.

Details

This is a new pipeline which introduces the ability to provision any version of the Java agent, instrument a test application with it, and then apply load to that application in order to test the stability and performance characteristics of the instrumented application over time.

When launching the pipeline, users are presented with the following:

pipe

As you can see, it allows the user to select the version of the agent to deploy, along with the JDK to use. You can also select the duration of the test. Load is generated via Locust and if you wish to specify particulars about how load generation should be conducted, you can do so.

The tests runs on bare-metal, with both load-generation and the test application residing on separate bare-metal machines.

success

At the conclusion of the test run, a file is produced which shows the performance of the test application as instrumented with JFR.

If you wish, you may also enable the metrics collection checkbox, which periodically collects system metrics directly from the operating system. (This is useful if you don't wish to rely on JFR or you simple want the perspective from the OS itself instead of from inside the Java process.)

Deployment

This PR may be merged at any time, but it will not be ready for use until the Bandstand application is also deployed. (This is a separate application developed in conjunction with this one that eases the burden of orchestrating multiple bare-metal machines and handles service discovery, etc.) It is not being released as an OSS application and as such, is not linked from this PR.

Future enhancements and caveats

Presently, this relies on a dedicated machine to receive requests from the agent. Nothing is done with this data, and we would like to find a way to run these tests without requiring a dedicated APM server or perhaps making a version of APM Server which does not require Elasticsearch to run. This is a topic for a follow-up discussion.

Additionally, we would like to make the orchestration application something fully managed in the lifetime of the pipeline instead of being a persistent service. This isn't urgent, however, and can be done later.

Finally, we are interested in potentially gathering metrics from a few other sources which can be displayed with the Elastic Stack. Specifically, it would be nice to include an option to monitor the host with Metricbeat and it would also be nice to find a way to collect and display the load generation metrics. Presently, this requires just looking at the logs and can clearly be improved.

@cachedout cachedout added automation Tests & automation that help build & maintain the project ci labels Nov 2, 2020
@apmmachine
Copy link
Contributor

apmmachine commented Nov 2, 2020

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #1467 updated]

  • Start Time: 2020-11-05T15:46:21.520+0000

  • Duration: 22 min 52 sec

Test stats 🧪

Test Results
Failed 1
Passed 787
Skipped 0
Total 788

Test errors 1

Expand to view the tests failures

  • Name: Tests / Unit Tests / testExecutorExecute_Transaction[0] – co.elastic.apm.agent.concurrent.ExecutorInstrumentationTest

    • Age: 1 (took 0.253 sec)
    • Error Details: [pool should have all its items recycled : instance = ['' 00-00000000000000000000000000000000-0000000000000000-00 (8581267)], class = co.elastic.apm.agent.impl.transaction.Transaction] Expecting empty but was:<['' 00-00000000000000000000000000000000-0000000000000000-00 (8581267)]>
    • Error Stacktrace:
java.lang.AssertionError: 
[pool should have all its items recycled : instance = ['' 00-00000000000000000000000000000000-0000000000000000-00 (8581267)], class = co.elastic.apm.agent.impl.transaction.Transaction] 
Expecting empty but was:<['' 00-00000000000000000000000000000000-0000000000000000-00 (8581267)]>
   at co.elastic.apm.agent.objectpool.TestObjectPoolFactory.checkAllPooledObjectsHaveBeenRecycled(TestObjectPoolFactory.java:70)
   at co.elastic.apm.agent.AbstractInstrumentationTest.cleanUp(AbstractInstrumentationTest.java:92)
   at jdk.internal.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   at org.junit.internal.runners.statements.RunAfters.invokeMethod(RunAfters.java:46)
   at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
   at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
   at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
   at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
   at org.junit.runners.Suite.runChild(Suite.java:128)
   at org.junit.runners.Suite.runChild(Suite.java:27)
   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
   at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
   at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
   at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
   at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
   at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
   at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
   at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
   at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
   at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
   at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
   at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
   at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
   at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
   at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
   at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
   at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
   at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
   at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
   at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
   at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150)
   at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
   at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
   at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
   at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
   at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

Steps errors 1

Expand to view the steps failures

  • Name: Shell Script (took 4 min 6 sec) . View log details on here
    • Description: #!/bin/bash set -euxo pipefail ./mvnw test

Log output

Expand to view the last 100 lines of log output

[2020-11-05T16:08:08.716Z] Generating /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-quartz-job-plugin/target/site/apidocs/deprecated-list.html...
[2020-11-05T16:08:08.716Z] Building index for all classes...
[2020-11-05T16:08:08.716Z] Generating /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-quartz-job-plugin/target/site/apidocs/allclasses.html...
[2020-11-05T16:08:08.717Z] Generating /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-quartz-job-plugin/target/site/apidocs/allclasses.html...
[2020-11-05T16:08:08.717Z] Generating /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-quartz-job-plugin/target/site/apidocs/index.html...
[2020-11-05T16:08:08.717Z] Generating /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-quartz-job-plugin/target/site/apidocs/help-doc.html...
[2020-11-05T16:08:08.717Z] [INFO] 
[2020-11-05T16:08:08.717Z] [INFO] -----------------< co.elastic.apm:apm-process-plugin >------------------
[2020-11-05T16:08:08.717Z] [INFO] Building co.elastic.apm:apm-process-plugin 1.18.2-SNAPSHOT       [52/80]
[2020-11-05T16:08:08.717Z] [INFO] --------------------------------[ jar ]---------------------------------
[2020-11-05T16:08:08.717Z] [INFO] 
[2020-11-05T16:08:08.717Z] [INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce-java) @ apm-process-plugin ---
[2020-11-05T16:08:08.717Z] [INFO] 
[2020-11-05T16:08:08.717Z] [INFO] --- jacoco-maven-plugin:0.8.5:prepare-agent (default) @ apm-process-plugin ---
[2020-11-05T16:08:08.717Z] [INFO] argLine set to -javaagent:/var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/.m2/org/jacoco/org.jacoco.agent/0.8.5/org.jacoco.agent-0.8.5-runtime.jar=destfile=/var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-process-plugin/target/jacoco.exec
[2020-11-05T16:08:08.717Z] [INFO] 
[2020-11-05T16:08:08.717Z] [INFO] --- license-maven-plugin:1.19:update-file-header (first) @ apm-process-plugin ---
[2020-11-05T16:08:08.717Z] [WARNING] The failOnMissingHeader has no effect if the property dryRun is not set.
[2020-11-05T16:08:08.717Z] [WARNING] The failOnNotUptodateHeader has no effect if the property dryRun is not set.
[2020-11-05T16:08:08.717Z] [INFO] adding extra resolver file:////var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-process-plugin/../../licenses
[2020-11-05T16:08:08.717Z] [INFO] Will search files to update from root /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-process-plugin/src/main/java
[2020-11-05T16:08:08.717Z] [INFO] Will search files to update from root /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java/apm-agent-plugins/apm-process-plugin/src/test/java
[2020-11-05T16:08:08.717Z] [INFO] Scan 8 files header done in 5.21ms.
[2020-11-05T16:08:08.717Z] [INFO] All files are up-to-date.
[2020-11-05T16:08:08.717Z] [INFO] 
[2020-11-05T16:08:08.717Z] [INFO] --- maven-resources-plugin:3.1.0:resources (default-resources) @ apm-process-plugin ---
[2020-11-05T16:08:08.717Z] [INFO] Using 'UTF-8' encoding to copy filtered resources.
[2020-11-05T16:08:08.717Z] [INFO] Copying 1 resource
[2020-11-05T16:08:08.717Z] [INFO] 
[2020-11-05T16:08:08.717Z] [INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ apm-process-plugin ---
[2020-11-05T16:08:08.746Z] script returned exit code 143
[2020-11-05T16:08:09.414Z] Post stage
[2020-11-05T16:08:09.430Z] Recording test results
[2020-11-05T16:08:09.436Z] Post stage
[2020-11-05T16:08:09.450Z] Recording test results
[2020-11-05T16:08:09.964Z] Failed in branch Javadoc
[2020-11-05T16:08:10.560Z] Running in /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java
[2020-11-05T16:08:10.573Z] [INFO] Codecov: Getting branch ref...
[2020-11-05T16:08:10.613Z] Masking supported pattern matches of $GITHUB_TOKEN
[2020-11-05T16:08:10.696Z] [INFO] Codecov: Sending data...
[2020-11-05T16:08:10.972Z] Running in /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467/src/github.com/elastic/apm-agent-java
[2020-11-05T16:08:10.984Z] [INFO] Codecov: Getting branch ref...
[2020-11-05T16:08:11.017Z] Masking supported pattern matches of $GITHUB_TOKEN
[2020-11-05T16:08:11.073Z] + curl -sSLo codecov.sh https://codecov.io/bash
[2020-11-05T16:08:11.139Z] [INFO] Codecov: Sending data...
[2020-11-05T16:08:11.383Z] + bash codecov.sh
[2020-11-05T16:08:11.383Z] 
[2020-11-05T16:08:11.383Z]   _____          _
[2020-11-05T16:08:11.383Z]  / ____|        | |
[2020-11-05T16:08:11.383Z] | |     ___   __| | ___  ___ _____   __
[2020-11-05T16:08:11.383Z] | |    / _ \ / _` |/ _ \/ __/ _ \ \ / /
[2020-11-05T16:08:11.383Z] | |___| (_) | (_| |  __/ (_| (_) \ V /
[2020-11-05T16:08:11.383Z]  \_____\___/ \__,_|\___|\___\___/ \_/
[2020-11-05T16:08:11.383Z]                               Bash-20201104-3e6f2fc
[2020-11-05T16:08:11.383Z] 
[2020-11-05T16:08:11.383Z] 
[2020-11-05T16:08:11.383Z] ==> Jenkins CI detected.
[2020-11-05T16:08:11.384Z]     project root: .
[2020-11-05T16:08:11.384Z] --> token set from env
[2020-11-05T16:08:11.384Z]     Yaml not found, that's ok! Learn more at http://docs.codecov.io/docs/codecov-yaml
[2020-11-05T16:08:11.384Z] ==> Running gcov in . (disable via -X gcov)
[2020-11-05T16:08:11.384Z] ==> Python coveragepy not found
[2020-11-05T16:08:11.384Z] ==> Searching for coverage reports in:
[2020-11-05T16:08:11.384Z]     + .
[2020-11-05T16:08:11.553Z] + curl -sSLo codecov.sh https://codecov.io/bash
[2020-11-05T16:08:11.899Z] + bash codecov.sh
[2020-11-05T16:08:11.899Z] 
[2020-11-05T16:08:11.899Z]   _____          _
[2020-11-05T16:08:11.899Z]  / ____|        | |
[2020-11-05T16:08:11.899Z] | |     ___   __| | ___  ___ _____   __
[2020-11-05T16:08:11.899Z] | |    / _ \ / _` |/ _ \/ __/ _ \ \ / /
[2020-11-05T16:08:11.899Z] | |___| (_) | (_| |  __/ (_| (_) \ V /
[2020-11-05T16:08:11.899Z]  \_____\___/ \__,_|\___|\___\___/ \_/
[2020-11-05T16:08:11.899Z]                               Bash-20201104-3e6f2fc
[2020-11-05T16:08:11.899Z] 
[2020-11-05T16:08:11.899Z] 
[2020-11-05T16:08:11.899Z] ==> Jenkins CI detected.
[2020-11-05T16:08:11.899Z]     project root: .
[2020-11-05T16:08:11.899Z] --> token set from env
[2020-11-05T16:08:11.899Z]     Yaml not found, that's ok! Learn more at http://docs.codecov.io/docs/codecov-yaml
[2020-11-05T16:08:11.900Z] ==> Running gcov in . (disable via -X gcov)
[2020-11-05T16:08:11.953Z] --> No coverage report found.
[2020-11-05T16:08:11.953Z]     Please visit http://docs.codecov.io/docs/supported-languages
[2020-11-05T16:08:12.081Z] Failed in branch Smoke Tests 01
[2020-11-05T16:08:12.158Z] ==> Python coveragepy not found
[2020-11-05T16:08:12.159Z] ==> Searching for coverage reports in:
[2020-11-05T16:08:12.159Z]     + .
[2020-11-05T16:08:12.728Z] --> No coverage report found.
[2020-11-05T16:08:12.728Z]     Please visit http://docs.codecov.io/docs/supported-languages
[2020-11-05T16:08:12.940Z] Failed in branch Smoke Tests 02
[2020-11-05T16:08:13.006Z] Stage "Integration Tests" skipped due to earlier failure(s)
[2020-11-05T16:08:13.035Z] Stage "AfterRelease" skipped due to earlier failure(s)
[2020-11-05T16:08:13.047Z] Stage "AfterRelease" skipped due to earlier failure(s)
[2020-11-05T16:08:13.285Z] Running on Jenkins in /var/lib/jenkins/workspace/_java_apm-agent-java-mbp_PR-1467
[2020-11-05T16:08:13.319Z] [INFO] getVaultSecret: Getting secrets
[2020-11-05T16:08:13.430Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2020-11-05T16:08:14.081Z] + chmod 755 generate-build-data.sh
[2020-11-05T16:08:14.081Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-java/apm-agent-java-mbp/PR-1467/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-java/apm-agent-java-mbp/PR-1467/runs/14 FAILURE 1312303
[2020-11-05T16:08:14.632Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-java/apm-agent-java-mbp/PR-1467/runs/14/steps/?limit=10000 -o steps-info.json
[2020-11-05T16:08:15.182Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-java/apm-agent-java-mbp/PR-1467/runs/14/tests/?status=FAILED -o tests-errors.json

Copy link
Contributor

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few typos, but LGTM!

.ci/load/README.md Outdated Show resolved Hide resolved
.ci/load/README.md Outdated Show resolved Hide resolved
.ci/load/README.md Outdated Show resolved Hide resolved
Co-authored-by: Manuel de la Peña <[email protected]>
Copy link
Member

@felixbarny felixbarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Let's try to merge this in soon so we can give it a try and iterate from there.

.ci/load/scripts/param_gen/gen_params.py Outdated Show resolved Hide resolved
.ci/load/scripts/locustfile.py Outdated Show resolved Hide resolved
.ci/load/README.md Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
cachedout and others added 4 commits November 2, 2020 17:49
Co-authored-by: Victor Martinez <[email protected]>
Co-authored-by: Victor Martinez <[email protected]>
Co-authored-by: Victor Martinez <[email protected]>
Co-authored-by: Victor Martinez <[email protected]>
@v1v
Copy link
Member

v1v commented Nov 2, 2020

Is JJBB required?

@cachedout
Copy link
Contributor Author

Is JJBB required?

Yes, it is. I was going to do a follow-up PR with those changes.

@felixbarny
Copy link
Member

To add to the wishlist for a future iteration, it would be great to be able to compare two runs with different agent configurations and otherwise exact same settings. The most common use case would be to determine the overhead of the agent or a particular agent feature.

@cachedout
Copy link
Contributor Author

To add to the wishlist for a future iteration, it would be great to be able to compare two runs with different agent configurations and otherwise exact same settings.

@felixbarny Absolutely! I'd like do this along with the (already requested) ability to do multiple runs for each scenario and then have the ability to also have it output an average out each set of runs for each scenario. (Of course it would still output individual runs as well.)

.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
@cachedout
Copy link
Contributor Author

@felixbarny I think all the review feedback has been addressed at this point so if you'd like to give this a final review and 👍 we should be able to get this in and start testing.

.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
.ci/load/Jenkinsfile Outdated Show resolved Hide resolved
}
}
stage('Test application') {
agent { label 'metal' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the workers guaranteed to be the same run-to-run? how many cpu cores do they have?

Copy link
Contributor Author

@cachedout cachedout Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important discussion point. Thanks for bringing it up!

The tl;dr here is that they should be the same but an absolute guarantee is challenging. For example, the ES Performance Testing team has a fleet of machines but have discovered over time that there are going to be variances even when they try hard to avoid them. SSDs in arrays fail, machines aren't tagged the same way by the provider but aren't physically identical, etc, etc.

So, where does that leave us? I think the best thing to do here is to be cautious about comparing results between runs but that we continue to enhance this pipeline to support scenarios where we can run multiple invocations of a single test scenario multiple times on what we can guarantee to be the same machine(s) and maybe even some sort of comparative logic as well. (So, run scenario A and the scenario B on the same machine and output the results.)

I'm also going to file an issue in the infra repo to try to get an audit underway so we can know a bit more about what divergence we do have currently. As mentioned earlier, it should be a lot but it may be some. Additionally, we'll investigate the possibility of creating some dedicated groups of machines which we can try to ensure are as similar as we can make them instead of just assuming that they're similar, which is essentially the strategy that's in place right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional data here:

We have four workers right now, all of which vary in subtle ways. I propose that we make the following changes:

  1. Pin the stage that runs the application to the same worker every time. We can do this by using the benchmark label, which currently is only assigned to one machine. That machine has the following specs:
  • Ubuntu 18.04
  • 6CPUs
  • 64 GB
  1. We keep the load-generation stage marked as metal which will allow it to float between the other bare-metal machines which have slightly varying specs. However, there's not much reason to believe that they vary enough to modify the behavior of the load-generation script, which doesn't consume a great deal of resources at present.

  2. We decide on a plan to order some additional machines which we can put into the benchmark pool which will ensure better consistency going forward.

(I will link backward from a ticket which provides more info, so we don't link from a public repo into a private one.)

Let me know how this sounds @felixbarny and @v1v and if you give a 👍 I will make the necessary change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.ci/load/Jenkinsfile Show resolved Hide resolved
@cachedout cachedout merged commit b0f3854 into elastic:master Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation Tests & automation that help build & maintain the project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants