Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent load testing #1467

Merged
merged 37 commits into from
Nov 5, 2020
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
cd1b20b
Skeleton of perf Jenkinsfile
cachedout Oct 21, 2020
d3e93d2
load generation wait script
cachedout Oct 22, 2020
0521d52
More orch
cachedout Oct 26, 2020
6c979b9
Working PoC
cachedout Oct 27, 2020
6d725d3
Script to generate parameters for the load generation pipeline
cachedout Oct 28, 2020
a0d2574
Add provisioning for agent
cachedout Oct 28, 2020
68f3251
Add licenses for load gen scripts
cachedout Oct 28, 2020
96f624d
Dynamic Java provisioning for load testing
cachedout Oct 28, 2020
51ab260
Script for dynamic Java provisioning for load testing
cachedout Oct 28, 2020
7c6f52f
Fix terrible bug with JDK selection and add more JDKs
cachedout Oct 29, 2020
8e7cf64
Remove TODO
cachedout Oct 29, 2020
7f4b66d
Formatting and lint
cachedout Oct 29, 2020
148cb63
Add agent_config param
cachedout Oct 30, 2020
90dd8f6
Add min JDK version and custom config option
cachedout Oct 30, 2020
288e566
Allow concurrent users setting
cachedout Oct 30, 2020
6898351
Disable num_of_runs in rev1
cachedout Oct 30, 2020
049d33a
Use production orchestrator
cachedout Oct 30, 2020
72a8039
More documentation and minor changes
cachedout Oct 30, 2020
64643e6
Add bare-metal settings
cachedout Oct 30, 2020
f080229
Prep for production
cachedout Nov 2, 2020
eed86f7
Add metal tag
cachedout Nov 2, 2020
7de0045
Add metrics collection
cachedout Nov 2, 2020
4d9aa1b
Update .ci/load/README.md
cachedout Nov 2, 2020
500f89b
Update .ci/load/README.md
cachedout Nov 2, 2020
61f9670
Update .ci/load/README.md
cachedout Nov 2, 2020
e51c8e4
Update .ci/load/Jenkinsfile
cachedout Nov 2, 2020
9c6bb6f
Update .ci/load/Jenkinsfile
cachedout Nov 2, 2020
0939974
Update .ci/load/Jenkinsfile
cachedout Nov 2, 2020
e7012da
Update .ci/load/Jenkinsfile
cachedout Nov 2, 2020
547e001
Fixup of review changes
cachedout Nov 3, 2020
090a259
Set interval to fixed time
cachedout Nov 3, 2020
8329f48
constant pacing for load gen
cachedout Nov 3, 2020
b7d53c2
Fix bad merge
cachedout Nov 3, 2020
cbe6351
Bring number of jdks below 250
cachedout Nov 5, 2020
7ce0452
Increase timeout
cachedout Nov 5, 2020
b76a251
Hide server URL
cachedout Nov 5, 2020
719b585
Switch app to benchmark label
cachedout Nov 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions .ci/load/Jenkinsfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
// For documentation on this pipeline, please see the README.md in this directory
pipeline {
agent any
environment {
REPO = 'apm-agent-java'
APP = 'spring-petclinic'
APP_BASE_DIR = "src/${APP}"
METRICS_BASE_DIR="metrics/"
AGENT_BASE_DIR = "agent/"
ORCH_URL = 'obs-load-orch.app.elastic.co:8000'
// Set below for local development
// ORCH_URL='10.0.2.2:8000'
DEBUG_MODE = '0' // set to '0' for production
// This is a placeholder server. This will change to a dummy APM Server when it is available.
APM_SERVER_URL = 'https://2a2bd0e2806a47e5996eeeec6d22e6df.apm.eu-west-3.aws.elastic-cloud.com:443'
cachedout marked this conversation as resolved.
Show resolved Hide resolved
LOCUST_RUN_TIME = "${params.duration}"
LOCUST_USERS = "${params.concurrent_requests}"

cachedout marked this conversation as resolved.
Show resolved Hide resolved
}
options {
timeout(time: 3, unit: 'HOURS')
cachedout marked this conversation as resolved.
Show resolved Hide resolved
buildDiscarder(logRotator(numToKeepStr: '20', artifactNumToKeepStr: '20', daysToKeepStr: '30'))
timestamps()
ansiColor('xterm')
durabilityHint('PERFORMANCE_OPTIMIZED')

}
parameters {
// The following snippet is auto-generated. To update it, run the script located in .ci/load/scripts/param_gen and copy in the output
choice(choices: ['1.18.1', '1.18.0', '1.18.0.RC1', '1.17.0', '1.16.0', '1.15.0', '1.14.0', '1.13.0', '1.12.0', '1.11.0', '1.10.0', '1.9.0', '1.8.0', '1.7.0', '1.6.1', '1.6.0', '1.5.0', '1.4.0', '1.3.0', '1.2.0', '1.1.0', '1.0.1', '1.0.0', '1.0.0.RC1', '0.7.1', '0.7.0', '0.6.2', '0.6.1', '0.6.0', '0.5.1', '0.1.2', '0.1.1'], name: "apm_version", description: "APM Java Agent version")
choice(choices: ['adoptopenjdk-11+28-linux', 'adoptopenjdk-11.0.1+13-linux', 'adoptopenjdk-11.0.1+13-linux-aarch64', 'adoptopenjdk-11.0.2+7-linux', 'adoptopenjdk-11.0.2+7-linux-aarch64', 'adoptopenjdk-11.0.2+9-linux', 'adoptopenjdk-11.0.2+9-linux-aarch64', 'adoptopenjdk-11.0.3+7-linux', 'adoptopenjdk-11.0.3+7-linux-aarch64', 'adoptopenjdk-11.0.4+11-linux', 'adoptopenjdk-11.0.4+11-linux-aarch64', 'adoptopenjdk-11.0.5+10-linux', 'adoptopenjdk-11.0.6+10-linux', 'adoptopenjdk-11.0.6+10-linux-aarch64', 'adoptopenjdk-11.0.7+10-linux', 'adoptopenjdk-11.0.7+10-linux-aarch64', 'adoptopenjdk-11.0.8+10-linux', 'adoptopenjdk-11.0.8+10-linux-aarch64', 'adoptopenjdk-11.0.9+11-linux', 'adoptopenjdk-11.0.9+11-linux-aarch64', 'adoptopenjdk-12+33-linux', 'adoptopenjdk-12.0.1+12-linux', 'adoptopenjdk-12.0.1+12-linux-aarch64', 'adoptopenjdk-12.0.2+10-linux', 'adoptopenjdk-12.0.2+10-linux-aarch64', 'adoptopenjdk-13.0.1+9-linux', 'adoptopenjdk-13.0.2+8-linux', 'adoptopenjdk-13.0.2+8-linux-aarch64', 'adoptopenjdk-14.0.1+7-linux', 'adoptopenjdk-14.0.1+7-linux-aarch64', 'adoptopenjdk-14.0.2+12-linux', 'adoptopenjdk-14.0.2+12-linux-aarch64', 'adoptopenjdk-15+36-linux', 'adoptopenjdk-15+36-linux-aarch64', 'adoptopenjdk-15.0.1+9-linux', 'adoptopenjdk-15.0.1+9-linux-aarch64', 'openjdk-11+11-linux', 'openjdk-11+12-linux', 'openjdk-11+13-linux', 'openjdk-11+14-linux', 'openjdk-11+15-linux', 'openjdk-11+16-linux', 'openjdk-11+17-linux', 'openjdk-11+18-linux', 'openjdk-11+19-linux', 'openjdk-11+20-linux', 'openjdk-11+21-linux', 'openjdk-11+22-linux', 'openjdk-11+23-linux', 'openjdk-11+24-linux', 'openjdk-11+25-linux', 'openjdk-11+26-linux', 'openjdk-11+27-linux', 'openjdk-11+28-linux', 'openjdk-11+5-linux', 'openjdk-11-linux', 'openjdk-11.0.1-linux', 'openjdk-11.0.2-linux', 'openjdk-12+23-linux', 'openjdk-12+24-linux', 'openjdk-12+25-linux', 'openjdk-12+27-linux', 'openjdk-12+28-linux', 'openjdk-12+29-linux', 'openjdk-12+30-linux', 'openjdk-12+31-linux', 'openjdk-12+32-linux', 'openjdk-12+33-linux', 'openjdk-12-linux', 'openjdk-12.0.1-linux', 'openjdk-12.0.2-linux', 'openjdk-13+14-linux', 'openjdk-13+15-linux', 'openjdk-13+16-linux', 'openjdk-13+17-linux', 'openjdk-13+18-linux', 'openjdk-13+19-linux', 'openjdk-13+20-linux', 'openjdk-13+21-linux', 'openjdk-13+22-linux', 'openjdk-13+23-linux', 'openjdk-13+24-linux', 'openjdk-13+25-linux', 'openjdk-13+26-linux', 'openjdk-13+27-linux', 'openjdk-13+28-linux', 'openjdk-13+29-linux', 'openjdk-13+30-linux', 'openjdk-13+31-linux', 'openjdk-13+32-linux', 'openjdk-13-linux', 'openjdk-13.0.1-linux', 'openjdk-13.0.2-linux', 'openjdk-14+10-linux', 'openjdk-14+11-linux', 'openjdk-14+12-linux', 'openjdk-14+13-linux', 'openjdk-14+14-linux', 'openjdk-14+15-linux', 'openjdk-14+16-linux', 'openjdk-14+17-linux', 'openjdk-14+25-linux', 'openjdk-14+26-linux', 'openjdk-14+27-linux', 'openjdk-14+28-linux', 'openjdk-14+30-linux', 'openjdk-14+31-linux', 'openjdk-14+32-linux', 'openjdk-14+33-linux', 'openjdk-14+34-linux', 'openjdk-14+9-linux', 'openjdk-14-linux', 'openjdk-14.0.1-linux', 'openjdk-14.0.2+12-linux', 'openjdk-14.0.2-linux', 'openjdk-15+10-linux', 'openjdk-15+11-linux', 'openjdk-15+12-linux', 'openjdk-15+13-linux', 'openjdk-15+14-linux', 'openjdk-15+15-linux', 'openjdk-15+16-linux', 'openjdk-15+17-linux', 'openjdk-15+18-linux', 'openjdk-15+19-linux', 'openjdk-15+20-linux', 'openjdk-15+21-linux', 'openjdk-15+22-linux', 'openjdk-15+23-linux', 'openjdk-15+24-linux', 'openjdk-15+25-linux', 'openjdk-15+26-linux', 'openjdk-15+27-linux', 'openjdk-15+28-linux', 'openjdk-15+29-linux', 'openjdk-15+30-linux', 'openjdk-15+31-linux', 'openjdk-15+32-linux', 'openjdk-15+33-linux', 'openjdk-15+34-linux', 'openjdk-15+36-linux', 'openjdk-15+4-linux', 'openjdk-15+5-linux', 'openjdk-15+6-linux', 'openjdk-15+7-linux', 'openjdk-15+8-linux', 'openjdk-15+9-linux', 'openjdk-15-linux', 'openjdk-15.0.1+9-linux', 'oracle-11+11-linux', 'oracle-11+12-linux', 'oracle-11+13-linux', 'oracle-11+14-linux', 'oracle-11+15-linux', 'oracle-11+16-linux', 'oracle-11+17-linux', 'oracle-11+18-linux', 'oracle-11+19-linux', 'oracle-11+20-linux', 'oracle-11+21-linux', 'oracle-11+22-linux', 'oracle-11+23-linux', 'oracle-11+24-linux', 'oracle-11+25-linux', 'oracle-11+26-linux', 'oracle-11+27-linux', 'oracle-11+28-linux', 'oracle-11+5-linux', 'oracle-11.0.2+7-linux', 'oracle-11.0.2+9-linux', 'oracle-11.0.3+12-linux', 'oracle-11.0.4+10-linux', 'oracle-11.0.5+10-linux', 'oracle-11.0.6+8-linux', 'oracle-12+33-linux', 'oracle-12.0.1+12-linux', 'oracle-12.0.2+10-linux', 'oracle-13+33-linux', 'oracle-13.0.1+9-linux', 'oracle-13.0.2+8-linux', 'zulu-11.0.1-linux', 'zulu-11.0.2-linux', 'zulu-11.0.3-linux', 'zulu-11.0.4-linux', 'zulu-11.0.5-linux', 'zulu-11.0.6-linux', 'zulu-11.0.7-linux', 'zulu-11.0.8-linux', 'zulu-11.0.9-linux', 'zulu-12-linux', 'zulu-12.0.0-linux', 'zulu-12.0.1-linux', 'zulu-12.0.2-linux', 'zulu-13-linux', 'zulu-13.0.0-linux', 'zulu-13.0.1-linux', 'zulu-13.0.2-linux', 'zulu-13.0.3-linux', 'zulu-13.0.4-linux', 'zulu-13.0.5-linux', 'zulu-14-linux', 'zulu-14.0.0-linux', 'zulu-14.0.1-linux', 'zulu-14.0.2-linux', 'zulu-15.0.0-linux', 'zulu-15.0.1-linux'], name: "jvm_version", description: "JVM")
cachedout marked this conversation as resolved.
Show resolved Hide resolved
cachedout marked this conversation as resolved.
Show resolved Hide resolved
string(name: "concurrent_requests", defaultValue: "100", description: "The number of concurrent requests to test with")
string(name: "duration", defaultValue: "10m", description: "Test duration in minutes. Max: 280 minutes")
// num_of_runs currently unsupported
// string(name: "num_of_runs", defaultValue: "1", description: "Number of test runs to execute")
text(name: "agent_config", "defaultValue": "", description: "Custom APM Agent configuration. (WARNING: May echo to console. Do not supply sensitive data.)")
text(name: "locustfile", "defaultValue": "", description: "Locust load-generator plan")
booleanParam(name: "local_metrics", description: "Enable local metrics collection?", defaultValue: false)
// End script auto-generation
}

stages {
stage('Pre-flight'){
steps {
echo 'Getting authentication information from Vault'
withSecretVault(secret: 'secret/apm-team/ci/bandstand', user_var_name: 'APP_TOKEN_TYPE', pass_var_name: 'APP_TOKEN'){
sh(script: ".ci/load/scripts/start.sh", returnStdout: true)
setEnvVar('SESSION_TOKEN', sh(script: ".ci/load/scripts/start.sh", returnStdout: true).trim())
kuisathaverat marked this conversation as resolved.
Show resolved Hide resolved
cachedout marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
stage('Load test') {
parallel {
stage('Load generation') {
agent { label 'metal' }
steps {
withSecretVault(secret: 'secret/apm-team/ci/bandstand', user_var_name: 'APP_TOKEN_TYPE', pass_var_name: 'APP_TOKEN'){
echo 'Preparing load generation..'
whenTrue(Boolean.valueOf(params.locustfile)) {
echo 'Using user-supplied plan for load-generation with Locust'
sh script: "echo \"${params.locustfile}\">.ci/load/scripts/locustfile.py"
}
sh(script: ".ci/load/scripts/load_agent.sh")
}
}
}
stage('Test application') {
agent { label 'metal' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the workers guaranteed to be the same run-to-run? how many cpu cores do they have?

Copy link
Contributor Author

@cachedout cachedout Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important discussion point. Thanks for bringing it up!

The tl;dr here is that they should be the same but an absolute guarantee is challenging. For example, the ES Performance Testing team has a fleet of machines but have discovered over time that there are going to be variances even when they try hard to avoid them. SSDs in arrays fail, machines aren't tagged the same way by the provider but aren't physically identical, etc, etc.

So, where does that leave us? I think the best thing to do here is to be cautious about comparing results between runs but that we continue to enhance this pipeline to support scenarios where we can run multiple invocations of a single test scenario multiple times on what we can guarantee to be the same machine(s) and maybe even some sort of comparative logic as well. (So, run scenario A and the scenario B on the same machine and output the results.)

I'm also going to file an issue in the infra repo to try to get an audit underway so we can know a bit more about what divergence we do have currently. As mentioned earlier, it should be a lot but it may be some. Additionally, we'll investigate the possibility of creating some dedicated groups of machines which we can try to ensure are as similar as we can make them instead of just assuming that they're similar, which is essentially the strategy that's in place right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional data here:

We have four workers right now, all of which vary in subtle ways. I propose that we make the following changes:

  1. Pin the stage that runs the application to the same worker every time. We can do this by using the benchmark label, which currently is only assigned to one machine. That machine has the following specs:
  • Ubuntu 18.04
  • 6CPUs
  • 64 GB
  1. We keep the load-generation stage marked as metal which will allow it to float between the other bare-metal machines which have slightly varying specs. However, there's not much reason to believe that they vary enough to modify the behavior of the load-generation script, which doesn't consume a great deal of resources at present.

  2. We decide on a plan to order some additional machines which we can put into the benchmark pool which will ensure better consistency going forward.

(I will link backward from a ticket which provides more info, so we don't link from a public repo into a private one.)

Let me know how this sounds @felixbarny and @v1v and if you give a 👍 I will make the necessary change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

stages{
stage('Provision Java') {
steps {
echo "Provisioning Java version: ${params.jvm_version}"
setEnvVar('JAVA_HOME', sh(script: ".ci/load/scripts/fetch_sdk.sh ${params.jvm_version}", returnStdout: true).trim())
setEnvVar('JAVACMD', "${env.JAVA_HOME}/bin/java")
setEnvVar('PATH', "${env.JAVA_HOME}/bin:$PATH")
}
}
stage ('Provision agent') {
steps {
echo 'Checking out master branch'
dir("${AGENT_BASE_DIR}") {
gitCheckout(
basedir: "apm-agent-java",
branch: 'master',
repo: "https://github.com/elastic/${REPO}.git",
credentialsId: 'f6c7695a-671e-4f4f-a331-acdce44ff9ba',
shallow: false
)
dir("apm-agent-java"){
echo 'Switching to requested version'
sh(script: "git checkout v${params.apm_version}")
cachedout marked this conversation as resolved.
Show resolved Hide resolved
echo 'Building agent'
sh(script: './mvnw clean install -DskipTests=true -Dmaven.javadoc.skip=true')
}
}
whenTrue(Boolean.valueOf(params.agent_config)) {
echo 'Writing user-supplied agent configuration'
dir("${AGENT_BASE_DIR}") {
sh script: "echo \"${params.agent_config}\">custom_config.cfg"
}
}
}
}
stage('Provision test application') {
steps {
echo 'Checking out test application'
gitCheckout(
basedir: "${APP_BASE_DIR}",
branch: 'main',
repo: "https://github.com/spring-projects/${APP}.git",
credentialsId: 'f6c7695a-671e-4f4f-a331-acdce44ff9ba',
shallow: false
)
}
}
stage('Provision local metrics collection') {
when {
expression {
return params.local_metrics
}
}
steps {
echo 'Enable local metric collection'
gitCheckout(
basedir: "${METRICS_BASE_DIR}",
branch: 'master',
repo: "https://github.com/pstadler/metrics.sh",
credentialsId: 'f6c7695a-671e-4f4f-a331-acdce44ff9ba',
shallow: false
)
sh(script: "touch metrics.out")
dir("${METRICS_BASE_DIR}"){
withEnv(["FILE_LOCATION=./metrics.out"]) {
sh(script: "./metrics.sh -r file &")
}
}
}
}
stage('Application load') {
steps {
echo 'Starting test application in background..'
dir("${APP_BASE_DIR}"){
// Launch app in background
withSecretVault(secret: 'secret/apm-team/ci/apm-load-test-server', user_var_name: 'APM_TOKEN_TYPE', pass_var_name: 'ELASTIC_APM_API_KEY'){
// Start with packaging things up
sh(script: "./mvnw package")
sh(script: "java -jar -javaagent:${WORKSPACE}/${AGENT_BASE_DIR}/apm-agent-java/elastic-apm-agent/target/elastic-apm-agent-${params.apm_version}.jar -Delastic.apm.server_urls=${env.APM_SERVER_URL} -Delastic.apm.secret_token=${env.ELASTIC_APM_API_KEY} -XX:+FlightRecorder -XX:StartFlightRecording=filename=flight.jfr ./target/spring-petclinic-*.jar &")
}
}
echo 'Starting bandstand client..'
// Foreground the orchestrator script for execution control
withSecretVault(secret: 'secret/apm-team/ci/bandstand', user_var_name: 'APP_TOKEN_TYPE', pass_var_name: 'APP_TOKEN'){
sh(script: ".ci/load/scripts/app.sh")
}
}
}
stage('Collecting results') {
steps {
echo "To view results, JMC is required. Get it here: https://jdk.java.net/jmc/"
archiveArtifacts(allowEmptyArchive: true,
artifacts: "${APP_BASE_DIR}/**/*.jfr,${METRICS_BASE_DIR}/**/*.out",
onlyIfSuccessful: false)
}
}
}
}
}
}
}
}
Loading