Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default APM Configurations for monitoring of kibana clusters #117492

Closed
lizozom opened this issue Nov 4, 2021 · 40 comments
Closed

Default APM Configurations for monitoring of kibana clusters #117492

lizozom opened this issue Nov 4, 2021 · 40 comments
Labels
discuss needs-team Issues missing a team label performance

Comments

@lizozom
Copy link
Contributor

lizozom commented Nov 4, 2021

We are in the process of enabling APM monitoring on some of our own internal clusters (such as telemetry, gh-stats, ci-builds etc).

To clarify, enabling APM in this context means that data collected by apm-rum and apm-node is sent out to another kibana cluster ("the monitoring cluster") where we can review the actual performance of each monitored cluster.

We want to have a well defined, clear and uniform, set of default configurations for these clusters.

APM Agent

Post 7.16
Once #117749 is merged, we can use a minimal config and only define the unique default configs:

Full configuration

elastic.apm.active: true
// Report apm info to central location - long term into regional instances
elastic.apm.serverUrl: <Regional APM cluster> OR "https://kibana-cloud-apm.apm.us-east-1.aws.found.io" 
elastic.apm.secretToken: <Regional secret> OR "JpBCcOQxN81D5yucs2"

// Labeling and discoverability
elastic.apm.globalLabels.deploymentId: <ESS Cluster UUID>
elastic.apm.globalLabels.deploymentName: <ESS Cluster Name>

// Performance fine tuning
// Avoid pinging kibana server for new configs. Get the configs from cloud user_settings_json instead.
elastic.apm.centralConfig: false 
// Send spans for only 10% of transactions
elastic.apm.transactionSampleRate: 0.1 
// Disable metrics breakdown due to performance constraints (node - HTTP \ Elasticsearch, RUM - various phases of page load and other transactions)
elastic.apm.breakdownMetrics: false
// Disable span stack traces due to performance constraints (RUM only)
elastic.apm.captureSpanStackTraces: false
// Propagate tracestate header to apm server. Requires CORS setup on APM server.
elastic.apm.propagateTracestate: true
// Optimize itnterval of metric reporting 
elastic.apm.metricsInterval: '120s'

// sanitize data (eg. redact auth headers)
elastic.apm.sanitizeFieldNames: 'password,passwd,pwd,secret,*key,*token*,*session*,*credit*,*card*,*auth*,set-cookie,pw,pass,connect.sid'

Once we make sure the defaults in Kibana are good, we will be able to use a more minimal configuration:

elastic.apm.active: true

elastic.apm.serverUrl: <Regional APM cluster>
elastic.apm.secretToken: <Regional secret>
elastic.apm.globalLabels.deploymentId: <ESS Cluster UUID>
elastic.apm.globalLabels.deploymentName: <ESS Cluster Name>

APM Server

Following a chat with @felixbarny, we should set keep_unsampled: false on the APM servers so we don't save unsampled transactions.

In 8.0 this will be the default behavior and APM agents will stop sending the unsampled transactions, hence helping us better control the volumes of ingested data.

Read More

@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 4, 2021
@lizozom
Copy link
Contributor Author

lizozom commented Nov 4, 2021

@tylersmalley once you get that subsomain for our server url, mind updating it here? https://github.com/elastic/infra/pull/32868
@dgieselaar can you approve that these settings are correct and sufficient?

@joshdover
Copy link
Contributor

I believe we need to have captureSpanStackTraces: false unless the performance and overhead of this has recently improved?

@lizozom
Copy link
Contributor Author

lizozom commented Nov 4, 2021

@joshdover what does captureSpanStackTraces change that effects performance?

@joshdover
Copy link
Contributor

what does captureSpanStackTraces change that effects performance?

@trentm can probably comment in more detail, but in performance testing we did earlier this year, we saw a pretty dramatic increase in overhead when this is enabled. I believe it creates stack traces for every span and sub-span which has memory, network, and CPU overhead.

@lizozom lizozom added discuss Team:APM All issues that need APM UI Team support labels Nov 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 4, 2021
@vigneshshanmugam
Copy link
Member

@lizozom Approving from the RUM agent side. Looks good interns of perf and data concerns.

@lizozom
Copy link
Contributor Author

lizozom commented Nov 4, 2021

@vigneshshanmugam can you help me understand the parts marked with ??? - I don't fully understand the reasoning behind:

  • elastic.apm.breakdownMetrics: false
  • elastic.apm.captureSpanStackTraces: false
  • elastic.apm.metricsInterval: '120s'

(read the docs, still not sure)

Also, what is the intent behind the environment option. Should we use it to store the name of each cluster (and have a drop down with potentially thousands of items?)? What would be your recommendation for it? (not sure who's the mastermind behind the APM UI)

@mshustov
Copy link
Contributor

mshustov commented Nov 4, 2021

elastic.apm.breakdownMetrics: false

#78792 (comment)

elastic.apm.captureSpanStackTraces: false

#78792 (comment)

elastic.apm.metricsInterval: '120s'

It reduces the network communication overhead at the cost of higher memory consumption
#78792 (comment)

Are you going to use propagateTracestate: true as discussed today in Slack? https://www.elastic.co/guide/en/apm/agent/rum-js/current/configuration.html#propagate-tracestate according propagate sample rating value to the server https://www.elastic.co/guide/en/apm/agent/rum-js/current/distributed-tracing-guide.html#enable-tracestate

@lizozom
Copy link
Contributor Author

lizozom commented Nov 4, 2021

Thanks @mshustov I will add propagateTracestate to the config.

However I still would love a real clarification of what each thing does. For example breakdownMetrics is explained as: Used to disable reporting of breakdown metrics which records self time spent in each unique type of span.. But what is the "self time"? What can those metrics be used for?

I think that the existing docs require a large amount of context I don't have ATM.

@mshustov
Copy link
Contributor

mshustov commented Nov 4, 2021

For example breakdownMetrics is explained as... But what is the "self time"? What can those metrics be used for?

yes, I've had the same problem so it required some time to dig through the APM repo, but I found this doc describing this specific setting
https://docs.google.com/document/d/1-_LuC9zhmva0VvLgtI0KcHuLzNztPHbcM0ZdlcPUl64/edit#heading=h.2k0fyzig00jf

@trentm
Copy link
Member

trentm commented Nov 4, 2021

// Report 10% of transactions
elastic.apm.transactionSampleRate: 0.1

This is a value that we can tune up and down between 0.0 and 1.0 based on a balance between lower measured load (on the traced Kibana and on the receiving kibana-cloud-apm.elastic.dev infra) and having more complete trace data.

// ??
elastic.apm.breakdownMetrics: false

@mshustov Found the original design document.
The breakdownMetrics: false setting tells both APM agents to stop collecting/calculating breakdown metrics.
For the Node.js agent that means not getting the "Time spent by span type" chart in APM UI, e.g.:

Screen Shot 2021-11-04 at 9 28 54 AM

When we were doing perf work on the Node.js agent at the start of 2021, breakdown metrics calculation was ~2.5% of CPU time in a simulated app that did nothing but tracing, i.e. approaching worst case.

I don't know the RUM agent that well, but its https://www.elastic.co/guide/en/apm/agent/rum-js/current/breakdown-metrics-docs.html document suggests it might be more valuable data for front-end analysis. As @mshustov pointed out above, @vigneshshanmugam commented earlier that setting this false "certainly helps a lot in the RUM agent for custom transactions vs page-load transactions."

// ??
elastic.apm.captureSpanStackTraces: false

This setting is only relevant for the Node.js APM agent. When initial perf work was being done, this setting was the huge contributor to APM perf impact. Since then, Node.js APM Agent version 3.16.0, fixed a few perf issues that were likely the major contributors to this impact.

However, since then, we have also just stuck with elastic.apm.captureSpanStackTraces: false. Capturing a stack trace for every span has some unavoidable perf overhead, keeping this false as a starting point is reasonable. The feature loss is the "Stack Trace" section of the "Span details" fly-over when inspecting trace data in APM UI:

Screen Shot 2021-11-04 at 9 50 58 AM

// ??
elastic.apm.propagateTracestate: true

This is relevant only to the RUM agent. When support for the 'tracestate' header was added to our agents, the RUM agent could not just start sending it, because that could break some customer usage because of CORS. In our case, we would need to make sure that https://kibana-cloud-apm.elastic.dev is configured to accept CORS requests including the tracestate header from any Kibana frontends using the RUM agent. https://www.elastic.co/guide/en/apm/agent/rum-js/current/distributed-tracing-guide.html#enable-tracestate discusses this. I don't have experience configuring this myself.

// ??
elastic.apm.metricsInterval: '120s'

This is relevant only to the Node.js agent. It tells the agent to only gather, serialize and send metrics to APM server every 120s, rather than the default of 30s. I don't believe we have particular data that shows whether and how much perf impact this has. The majority of metrics perf impact was in the breakdown metrics calculation, and we have separately disabled that.

// Labeling and discoverability
elastic.apm.environment:

@joshdover expressed a preference for leaving this out. The APM configuration code in Kibana will set this to one of "development", "production" or "ci". There had been recent discussion on the related Telemetry issue about setting environment to the deployment name, because the APM UI has a menulist to switch between environments.

@vigneshshanmugam
Copy link
Member

vigneshshanmugam commented Nov 4, 2021

@trentm Has covered most of the RUM stuff as well.

elastic.apm.breakdownMetrics: false

Breakdown metrics calculatiion depends totally on the number of different types of spans a given transaction contains when it comes to RUM agent. Kibana being heavy on the number of network requests, Its good to keep it simple and not do that. Page load transactions are different category as we don't rely on number of spans, instead use the metrics from Browsers Navigation Timing API.

We do heavy CPU and memory benchmarks on the RUM agent, if we ever need to check any RUM related stuff. we can always use this - https://github.com/elastic/apm-agent-rum-js/blob/master/packages/rum/test/benchmarks/BENCHMARKS.md

@mshustov
Copy link
Contributor

mshustov commented Nov 5, 2021

is configured to accept CORS requests including the tracestate header from any Kibana frontends using the RUM agent.

As a side note: Kibana allows users to configure CORS.

server.cors.enabled:..
server.cors.allowCredentials:...
server.cors.allowOrigin:...

use https://www.elastic.co/guide/en/kibana/current/settings.html for references.

Note that CORS is disabled by default.

@dgieselaar
Copy link
Member

@alex-fedotyev care to comment about the configuration proposed here?

@tylersmalley
Copy link
Contributor

tylersmalley commented Nov 5, 2021

I am updating the APM server in #117749 and checking that our settings match, which looks like they do. I think we should avoid setting things that are the same as the default. With that PR, we should be able to just set:

elastic.apm.active: true
elastic.apm.globalLabels.deploymentId: <ESS Cluster UUID>
elastic.apm.globalLabels.deploymentName: <ESS Cluster Name>
elastic.apm.serverUrl: "https://kibana-cloud-apm-server.elastic.dev" (awaiting DNS update)
elastic.apm.secretToken: "JpBCcOQxN81D5yucs2"

v7.16 > will be able to drop serverUrl and secretToken since it's a default.

I was going to add propagateTracestate to the default, but it made me realize we're treating both the Node and RUM agent config as one. Would it be safe to merge the types and pass the same configuration to both, or do we need additional pre-agent configuration options?

@tylersmalley
Copy link
Contributor

A clarification, https://kibana-cloud-apm.elastic.dev/ is the Kibana instance for us to access the data, while https://kibana-cloud-apm-server.elastic.dev is the underlying APM server we will configure to collect the data.

@lizozom lizozom changed the title APM Configurations for monitoring of kibana clusters Default APM Configurations for monitoring of kibana clusters Nov 7, 2021
@lizozom
Copy link
Contributor Author

lizozom commented Nov 7, 2021

@tylersmalley I updated the issue with the smaller block for 7.16+. I saw that your PR includes the serverUrl and secretToken so I ommitted those too.
Are you also the owners of that APM server we're reporting to? Could you please enable CORS there as @mshustov had suggested?

@mshustov @trentm thanks for the explanations! Let me know if you think the configs above are satisfactory in your opinion?
Do you think we could run the profiling scripts again and benchmark current performance? (not a blocker for this, but will be good to run in parallel, so we have the data once we can to the larger, more sensitive clusters)

@tylersmalley
Copy link
Contributor

@lizozom, yeah we currently manage that cluster for now.

Regarding CORS, isn't that supposed to be set on the instance doing the collection? So, propagateTracestate would need to be specifically set on a cluster, but only if CORS is enabled? I could be mistaken here. Maybe @trentm can weigh in.

@tylersmalley
Copy link
Contributor

tylersmalley commented Nov 8, 2021

@lizozom, I also check the pre 7.16 (examples, 7.11, 7.14, 7.15) configs, and looks like we only need to set serverUrl, secretToken, and transactionSampleRate, captureSpanStackTraces.

Pre 7.16:

elastic.apm.active: true
// Report apm info to central location
elastic.apm.serverUrl: "https://kibana-cloud-apm.apm.us-east-1.aws.found.io" 
elastic.apm.secretToken: "JpBCcOQxN81D5yucs2"

// Labeling and discoverability
elastic.apm.globalLabels.deploymentId: <ESS Cluster UUID>
elastic.apm.globalLabels.deploymentName: <ESS Cluster Name>

// Performance fine tuning
elastic.apm.transactionSampleRate: 0.1
elastic.apm.captureSpanStackTraces: false

@mshustov
Copy link
Contributor

mshustov commented Nov 8, 2021

I was going to add propagateTracestate to the default, but it made me realize we're treating both the Node and RUM agent config as one. Would it be safe to merge the types and pass the same configuration to both, or do we need additional pre-agent configuration options?

As of now, Kibana supports a merged configuration only. We can revisit this decision later, I'm going to create an issue to discuss.

Regarding CORS, isn't that supposed to be set on the instance doing the collection?

From this docs https://www.elastic.co/guide/en/apm/agent/rum-js/current/distributed-tracing-guide.html#enable-cors it seems that if the Kibana browser app is served from the same domain, where it sends ajax requests, no additional CORS setup is required.

@trentm
Copy link
Member

trentm commented Nov 8, 2021

See also https://github.com/elastic/telemetry/issues/1319 regarding perhaps wanting to set captureHeaders: false and/or sanitizeFieldNames: ....

@sorenlouv
Copy link
Member

I've added elastic.apm.sanitizeFieldNames to the "Pre 7.16" section per https://github.com/elastic/telemetry/issues/1319

@trentm
Copy link
Member

trentm commented Nov 15, 2021

My main concern (if I read the code correctly) is confusion on whether and when the CENTRALIZED_*_CONFIG values will apply. I think changing serverUrl to a non-default value (e.g. one of the regional APM clusters) could have the side-effect of skipping all these:

const CENTRALIZED_SERVICE_BASE_CONFIG: AgentConfigOptions = {
serverUrl: 'https://kibana-ci-apm.apm.us-central1.gcp.cloud.es.io',
// The secretToken below is intended to be hardcoded in this file even though
// it makes it public. This is not a security/privacy issue. Normally we'd
// instead disable the need for a secretToken in the APM Server config where
// the data is transmitted to, but due to how it's being hosted, it's easier,
// for now, to simply leave it in.
secretToken: '7YKhoXsO4MzjhXjx2c',
centralConfig: false,
metricsInterval: '30s',
captureSpanStackTraces: false,
transactionSampleRate: 1.0,
breakdownMetrics: true,
};
const CENTRALIZED_SERVICE_DIST_CONFIG: AgentConfigOptions = {
metricsInterval: '120s',
captureBody: 'off',
captureHeaders: false,
breakdownMetrics: false,
};

I'm not sure the best place to discuss it (perhaps on #117749 ?), but what about moving all the CENTRALIZED_*_CONFIG APM config values (except serverUrl and secretToken) to DEFAULT_CONFIG. Then there is less logic/magic going on in APM configuration. A developer using a non-dist build of Kibana would have to explicitly turn somethings like breakdownMetrics: true back on if they wanted them.

I've added elastic.apm.sanitizeFieldNames to the "Pre 7.16" section

Until Kibana is updated to a future Node.js APM agent with a fix for elastic/apm-agent-nodejs#2427 then the "Post 7.16" config should also set sanitizeFieldNames for the (possibly rare) case that captureHeaders is set true.

@lizozom
Copy link
Contributor Author

lizozom commented Nov 18, 2021

@trentm thanks for the clarification.
So if I understand correctly, we should stick to using the full configs, until the defaults are set to reliable values?

@trentm
Copy link
Member

trentm commented Nov 18, 2021

we should stick to using the full configs, until the defaults are set to reliable values?

Yes, I think so.

@dannycroft dannycroft added [zube]: Inbox and removed Team:APM All issues that need APM UI Team support labels Nov 29, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 29, 2021
@dannycroft dannycroft removed [zube]: Inbox needs-team Issues missing a team label labels Nov 29, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 29, 2021
@lizozom
Copy link
Contributor Author

lizozom commented Dec 1, 2021

@trentm @mshustov

  • I see that the APm team removed itself as an owner of this issue. I wonder who would be the right owner for these defaults in your opinion?
  • I wonder if these settings are good just for Kibana? Or can they be used for other products trying to integrate with APM for internal monitoring?

@rudolf
Copy link
Contributor

rudolf commented Dec 1, 2021

As an experiment I tried to analyze why hitting Kibana's telemetry endpoint might cause performance degradation (a scenario we've seen happen in a few SDH's). APM's traces gives us good detail about why a certain API might take long (e.g. there's many serial requests to elasticsearch) but it's not easy to see why the event loop is blocked and the entire server is slow.

After some experiments with Søren we got some promising data from span.self_time metrics, but this requires elastic.apm.breakdownMetrics: true

We got a recommendation to disable this setting in #78792 (comment) but that was based on the assumption that it could improve the performance of the RUM agent which we don't use. We seem to think that the performance impact has also been lowered since our initial testing #78792 (comment)

Finding the reason for a slow event loop is probably the number one performance diagnosis problem we're struggling to solve at the moment.

So I think we should use elastic.apm.breakdownMetrics: true in our next performance benchmark and consider making it part of the default configuration.

@mshustov
Copy link
Contributor

mshustov commented Dec 1, 2021

I see that the APm team removed itself as an owner of this issue. I wonder who would be the right owner for these defaults in your opinion?

If the APM team doesn't own it, maybe Kibana Observability project can take ownership then?

I wonder if these settings are good just for Kibana? Or can they be used for other products trying to integrate with APM for internal monitoring?

Could you provide an example of other products trying to integrate with APM?
I'm not sure they are suitable for every case. First of all, different language clients can have different settings, or enabling these settings can have a different impact on runtime.

So I think we should use elastic.apm.breakdownMetrics: true in our next performance benchmark and consider making it part of the default configuration.

Another option might be to give access to Kibana developers to adjust these settings. It will require some additional work from the Infra team to set up access to the APM settings only, so I doubt the possibility of implementing this option.

@sorenlouv
Copy link
Member

sorenlouv commented Dec 1, 2021

I see that the APm team removed itself as an owner of this issue. I wonder who would be the right owner for these defaults in your opinion?

It was my impression that this was owned by the Kibana Operations team since they are codeowners of that part of the codebase and would have the most knowledge. The APM team (both ui, server and agents) are of course ready to help with anything.
And if needed we can also make these changes

@lizozom
Copy link
Contributor Author

lizozom commented Dec 1, 2021

It was my impression that this was owned by the Kibana Operations team since they are codeowners

This is an interesting question. I think that there is a level of uncertainty still around the performance implications of some of these configs. If we had up to date benchmarks, we could probably own this.

Could you provide an example of other products trying to integrate with APM?

Currently multiple teams are integrating with APM (see issue in dev repo), wondering if we can have some guidelines.

Another option might be to give access to Kibana developers to adjust these settings.

This requires sudo perms on prod. Do you think we could maybe have one of the team \ tech leads carry that responsibility? That wouldn't require any work, but just authorization from @mfinkle and someone on the cloud side.

@sorenlouv
Copy link
Member

This is an interesting question. I think that there is a level of uncertainty still around the performance implications of some of these configs. If we had up to date benchmarks, we could probably own this.

Benchmarking is owned by the apm agent devs (this specifically falls in the lap of the nodejs agent devs).

@lizozom
Copy link
Contributor Author

lizozom commented Dec 5, 2021

@mshustov can Kibana own these settings? If so, which team would it be? At the momentm Kibana-core owns the integration, right?

@mshustov
Copy link
Contributor

mshustov commented Dec 7, 2021

@mshustov can Kibana own these settings? If so, which team would it be?

I'd suggest that the (virtual) team working on Kibana Observability project owns the settings for a while. AFAIK Core team isn't actively working on the project, so it might be faster to have a dedicated team to test and adjust the config. Once the default config is finalized, we can transfer the ownership to the Core team. Does it sound reasonable?

@sorenlouv
Copy link
Member

sorenlouv commented Dec 7, 2021

If we don't have any obvious owners APM UI can take this on. Is this just a matter of updating the default settings in
https://github.com/elastic/kibana/blob/d0f7d7c21d2e2af244f2166f8544480cd4062c0b/packages/kbn-apm-config-loader/src/config.ts to align with the values described in this issue?

@trentm
Copy link
Member

trentm commented Dec 7, 2021

@sqren Also, I think dealing with possible confusion with CENTRALIZED_*_CONFIG that I mentioned above in #117492 (comment)

I'm happy to help here on establishing a reasonable base config.

@sorenlouv
Copy link
Member

Also, I think dealing with possible confusion with CENTRALIZED_*_CONFIG that I mentioned above in

Yes, I remember that comment and agree 100%. I was also having a hard time following what settings would apply for which environment. Having it less dynamic would help.

@lizozom
Copy link
Contributor Author

lizozom commented Dec 16, 2021

@sqren the main part is not only to update the defaults in the loader, but to establish the values that cloud will use to prepopulate cloud settings.

I don't think there's much technical work to be done here, it's just about signing off these values as good enough and sufficient.

@mshustov no dedicated virtual team for this yet. It's being worked on as part of each team's engineering efforts at the moment. Guess we can discuss this when we have the sync (2022).

@dmlemeshko
Copy link
Member

I collected performance results for the recommended APM configuration in this issue summary and on average latency increase is within 5%.
I did not find in docs how to pass elastic.apm.propagateTracestate: true via env variable, but if I'm right it is already set by default in main branch.

@lizozom
Copy link
Contributor Author

lizozom commented Apr 13, 2022

While we might endup changing the sampling rate, so far these defaults seem to serve us well.
Closing the issue until further notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss needs-team Issues missing a team label performance
Projects
None yet
Development

No branches or pull requests