Use APM metrics to introduce low-fi data layer for space reduction #104

roncohen · 2019-07-01T12:49:29Z

We should use metrics transaction timing data to have two layers of data fidelity in the APM UI.

We'd have a low-fi layer and a hi-fi layer.

Motivation

Today, most graphs in the APM UI are querying transaction documents. This works because we're sending up all transactions, even unsampled.

As part of #78 we also started sending up transaction timing data as a metricset. Some of the data shown in the APM UI can be calculated using this new timing data instead of the transaction documents.

This would allow users to get rid of the transaction documents early, say after 7 days, but still be able to derive value from the APM UI beyond this timeframe. Setting a separate ILM policy for transactions is already supported through a bit of manual work.

User experience on low-fi data

The idea would be that the low-fi layer is calculated from the metrics data, while all the data that requires the (unsampled) transactions will be part of the hi-fi layer.

From the new metricset, we can show

avg. transaction timing for each transaction group and across each transaction type. Until [Rollup] Support for data-structure based metrics (Cardinality, Percentiles, etc) elasticsearch#33214 arrives, there's no way to show percentile data based on the new timing data.

transactions per minute for each transaction group and across each type. The current UI shows transactions per minute per result type (2xx, 3xx, etc.). Result type is not currently in the new timing data, but we could add it as another dimension

transaction list, without percentiles

We'd be unable to show the transaction distribution chart or any samples:

If agents eventually support histograms as a a metric, we could encode the transaction duration as a histogram and show the transaction distribution even with only the low-fi data. This shouldn't be a blocker at the moment.

Querying

To make things simple, the APM UI could always be using the new metrics data to draw the things it can. We'd then fire off separate queries for the "hi-fi" data (percentiles, distribution chart, actual transaction samples etc.). If the hi-fi data is available for the given time range the percentile lines show on the graphs etc. If not, we only show the low-fi data.

That means, if you pick a time range that has both low-fi and hi-fi data for the full time rage, you'll see exactly what you see today.

If you go back in time far enough, only low-fi data is available and you'll not see percentiles, distribution chart etc.

If you select a time range that includes hi-fi data some part of the time range, the percentiles graph might appear in the middle of a graph. For the distribution chart in particular, this is a complication because it's not clear that the visualization that it's partial as it is on the graphs. Users will be able to deduct that fact by looking at the other graphs on the same page.

We could try to detect that the data is partial and show a note. Detection could happen by comparing the number of transactions we have compared to the number we get from the metricsets. Probably not a blocker for the first version.

Transaction group list

The transaction group list could represents a special problem here as it would require us to merge the low-fi and hi-fi data in the list. I don't think the merge can be done in Elasticsearch.

Due to pagination etc., we'd need to ensure that low-fi and hi-fi queries return data for the same transaction groups, and then merge it in Kibana. We could potentially do it by making the queries sort by both lists by avg. transaction time calculated on the metricset and transaction data respectively, and then do the merge in Kibana. I have more thoughts on this, but we should probably do a POC to investigate the feasibility of this.

Rollups

Introducing the low-fi layer as described above allows users to delete transaction data and still see low-fi data. I expect that will be a significant storage reduction for users that want to keep hi-fi data for, say one week, and low-fi data for 2 months. Some users will want to keep low-fi data for much longer. For those users, applying rollups to the low-fi data to decrease time granularity will allow them to further reduce storage costs. Supporting rollups isn't something we'd need to do in the first phase.

Rollups includes functionality to transparently rewrite queries to search regular documents and rolled up data at the same time. So the queries for low-fi data should mostly just work for rolled up data. There are some improvements to rollups coming which we should probably wait for before spending time investigating more: elastic/elasticsearch#42720

Future

When elastic/elasticsearch#33214 arrives, agents could start sending up transaction duration histograms and we'll be able to move percentiles and distribution chart into the low-fi layer. We'd be able to stop sending up unsampled transactions. The hi-fi layer will then only be actual transaction samples.

roncohen · 2019-07-03T08:25:20Z

@elastic/apm-ui thoughts?

dgieselaar · 2019-07-03T10:00:23Z

@roncohen this is pretty exciting! When I was working on my own APM stuff w/ ES I was always struggling with storage vs resolution. One of the options I considered back then was to store everything as a metricset, with a resolution of 1:1. After n days, data would then be rolled up into increasingly lower resolution. I could then always query the metricset instead of the raw documents. If we could do something similar, that would help a lot, but not sure what it means for storage, agent support etc. If we would have to support both transactions and metricsets and then merge them in Kibana it's feasible, but hairy.

What happens when you try to query the rollup search with a percentile agg? will it error out or just show no data?

roncohen · 2019-07-03T10:15:42Z

If we would have to support both transactions and metricsets and then merge them in Kibana it's feasible, but hairy.

If you think about it in two layers, with the hi-fi one being optional, does that help? For example, for the transaction duration graph, the "avg" line comes from the low-fi layer and is based on the metricset documents. A separate query will calculate percentiles based on "transaction" documents. If the percentile queries return data then we "just" add two lines to the transaction duration graph.

formgeist · 2019-07-03T10:27:24Z

Sounds like a great plan for being able to support different data resolutions. Got a few questions;

How will it work with ML? Do the jobs have to change to use low-fidelity by default in order to assure that we can keep the anomaly detection intact?
Queries will double if we're adding hi-fi on top after the metrics have loaded. Do we include this in the view load, so we wait until we query both to display the data?
Are we going to open for customization on those hi-fi layers like percentiles i.e. letting the user choose which percentiles to calculate and display on the charts and tables?

roncohen · 2019-07-03T11:00:39Z

great questions

How will it work with ML? Do the jobs have to change to use low-fidelity by default in order to assure that we can keep the anomaly detection intact?

It would probably make sense to change them to be based on metricsets eventually because the plan is to stop sending up unsampled transactions some day. But in the mean time it shouldn't matter. The numbers should be the same. I think we'd consider the ML data part of the low-fi layer.

Queries will double if we're adding hi-fi on top after the metrics have loaded. Do we include this in the view load, so we wait until we query both to display the data?

As a start, it's probably simplest to wait for both to return before drawing the graphs. If it's not big difference in complexity, it's probably nice to show data as soon as we have something and then add to it when the other query arrives.

Are we going to open for customization on those hi-fi layers like percentiles i.e. letting the user choose which percentiles to calculate and display on the charts and tables?

It's an interesting idea, but i don't think we should do that for now.

formgeist · 2019-07-03T11:03:04Z

@roncohen thanks for clarifying, makes sense

sorenlouv · 2019-07-09T11:57:02Z

We've been talking about this for a while - thanks for finally getting the ball rolling @roncohen!

One aspect I didn't see mentioned is the query bar. Currently it is used to filter the UI via ES filters applied to transaction and error documents. Metric docs won't have these dimensions and will therefore render the query bar useless.
When the query bar was released it was hailed as one of the things that set us apart from competitors because it let users filter by any dimensions of their data. I don't see how it can stick around without changes that would also limit its use drastically.

roncohen · 2019-07-17T18:36:17Z

that's a good point.

I have two ideas for what we could do:

We'd query the hi-fi data for 99p, 95p and avg etc. and at the same time query the low-fi data. Then "fill in" the avg. line with data from the low-fi data when it arrives. If there's a filter set, low-fi data will not return anything and the hi-fi data will just work. I suspect that combining the data sets like this in the UI could be complex, but maybe not.
Improving on (1), we'd come up with a set of fields that should be included in the metrics. For example, container.id, host.name, kubernetes.pod.uid, transaction.result and perhaps a few more. Agents would need to collect a metric for each combination of these. Those fields would then always work in the filter bar. Looking at a time-range where hi-fi data only goes back half way, the effect would be that avg. line will go across the graph, while the 99p/95p will start in the middle of the graph.

axw · 2019-07-18T01:23:26Z

Improving on (1), we'd come up with a set of fields that should be included in the metrics. For example, container.id, host.name, kubernetes.pod.uid, transaction.result and perhaps a few more. Agents would need to collect a metric for each combination of these.

For all of those except transaction.result, the server already adds them based on the metadata sent at the start of the stream. I think we'd just need to add dimensions for transaction.name, transaction.type, and transaction.result, to reach parity with the non-sampled transaction docs we have today.

felixbarny · 2019-07-19T09:24:23Z

I'm a bit worried about a cardinality increase when including the transaction.result dimension. Especially as this is a user-definable field. There's no guarantee that users group the result like we do, for example, HTTP 2xx. But even with the grouping we do for the HTTP status codes, we might hit the limit of 1000 metricsets for an agent pretty quickly.

roncohen · 2019-07-31T12:04:42Z

@felixbarny agreed that we need to be vigilant about cardinality increase

felixbarny · 2021-12-13T09:36:23Z

This has been shipped (see xpack.apm.searchAggregatedTransactions and Configure transaction metrics) and will be the default in 8.0.

roncohen added the discussion label Jul 1, 2019

axw mentioned this issue Sep 25, 2019

Request Sample Rate needed - not scaleable #151

Closed

sorenlouv mentioned this issue Nov 21, 2019

[APM] Add support for Rollups elastic/kibana#50729

Closed

axw mentioned this issue Nov 26, 2019

[Jaeger] Support receiving Jaeger spans elastic/apm-server#2886

Closed

axw mentioned this issue Feb 12, 2020

[Jaeger] How to support sampling elastic/apm-server#3011

Closed

axw mentioned this issue Mar 15, 2020

Introduce transaction histogram metrics elastic/apm-server#3485

Closed

dgieselaar mentioned this issue Apr 3, 2020

[APM] Support latency metrics elastic/kibana#62459

Closed

graphaelli mentioned this issue Jun 1, 2020

Improve scalability on ESS #184

Open

dgieselaar mentioned this issue Jul 3, 2020

[APM] Use apmEventClient for querying APM event indices dgieselaar/kibana#3

Closed

dgieselaar mentioned this issue Jul 28, 2020

[APM] Use apmEventClient for querying APM event indices elastic/kibana#73449

Merged

dgieselaar mentioned this issue Aug 6, 2020

[APM] Investigate using transforms for APM UI data elastic/kibana#74498

Closed

felixbarny closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use APM metrics to introduce low-fi data layer for space reduction #104

Use APM metrics to introduce low-fi data layer for space reduction #104

roncohen commented Jul 1, 2019 •

edited

Loading

roncohen commented Jul 3, 2019

dgieselaar commented Jul 3, 2019

roncohen commented Jul 3, 2019 •

edited

Loading

formgeist commented Jul 3, 2019

roncohen commented Jul 3, 2019

formgeist commented Jul 3, 2019

sorenlouv commented Jul 9, 2019

roncohen commented Jul 17, 2019 •

edited

Loading

axw commented Jul 18, 2019

felixbarny commented Jul 19, 2019

roncohen commented Jul 31, 2019

felixbarny commented Dec 13, 2021

Use APM metrics to introduce low-fi data layer for space reduction #104

Use APM metrics to introduce low-fi data layer for space reduction #104

Comments

roncohen commented Jul 1, 2019 • edited Loading

Motivation

User experience on low-fi data

Querying

Transaction group list

Rollups

Future

roncohen commented Jul 3, 2019

dgieselaar commented Jul 3, 2019

roncohen commented Jul 3, 2019 • edited Loading

formgeist commented Jul 3, 2019

roncohen commented Jul 3, 2019

formgeist commented Jul 3, 2019

sorenlouv commented Jul 9, 2019

roncohen commented Jul 17, 2019 • edited Loading

axw commented Jul 18, 2019

felixbarny commented Jul 19, 2019

roncohen commented Jul 31, 2019

felixbarny commented Dec 13, 2021

roncohen commented Jul 1, 2019 •

edited

Loading

roncohen commented Jul 3, 2019 •

edited

Loading

roncohen commented Jul 17, 2019 •

edited

Loading