Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use APM metrics to introduce low-fi data layer for space reduction #104

Closed
roncohen opened this issue Jul 1, 2019 · 12 comments
Closed

Use APM metrics to introduce low-fi data layer for space reduction #104

roncohen opened this issue Jul 1, 2019 · 12 comments

Comments

@roncohen
Copy link

roncohen commented Jul 1, 2019

We should use metrics transaction timing data to have two layers of data fidelity in the APM UI.

We'd have a low-fi layer and a hi-fi layer.

Motivation

Today, most graphs in the APM UI are querying transaction documents. This works because we're sending up all transactions, even unsampled.

As part of #78 we also started sending up transaction timing data as a metricset. Some of the data shown in the APM UI can be calculated using this new timing data instead of the transaction documents.

This would allow users to get rid of the transaction documents early, say after 7 days, but still be able to derive value from the APM UI beyond this timeframe. Setting a separate ILM policy for transactions is already supported through a bit of manual work.

User experience on low-fi data

The idea would be that the low-fi layer is calculated from the metrics data, while all the data that requires the (unsampled) transactions will be part of the hi-fi layer.

From the new metricset, we can show

untitled

  • transactions per minute for each transaction group and across each type. The current UI shows transactions per minute per result type (2xx, 3xx, etc.). Result type is not currently in the new timing data, but we could add it as another dimension

image

  • transaction list, without percentiles

image

We'd be unable to show the transaction distribution chart or any samples:

image

If agents eventually support histograms as a a metric, we could encode the transaction duration as a histogram and show the transaction distribution even with only the low-fi data. This shouldn't be a blocker at the moment.

Querying

To make things simple, the APM UI could always be using the new metrics data to draw the things it can. We'd then fire off separate queries for the "hi-fi" data (percentiles, distribution chart, actual transaction samples etc.). If the hi-fi data is available for the given time range the percentile lines show on the graphs etc. If not, we only show the low-fi data.

That means, if you pick a time range that has both low-fi and hi-fi data for the full time rage, you'll see exactly what you see today.

If you go back in time far enough, only low-fi data is available and you'll not see percentiles, distribution chart etc.

If you select a time range that includes hi-fi data some part of the time range, the percentiles graph might appear in the middle of a graph. For the distribution chart in particular, this is a complication because it's not clear that the visualization that it's partial as it is on the graphs. Users will be able to deduct that fact by looking at the other graphs on the same page.

We could try to detect that the data is partial and show a note. Detection could happen by comparing the number of transactions we have compared to the number we get from the metricsets. Probably not a blocker for the first version.

Transaction group list

The transaction group list could represents a special problem here as it would require us to merge the low-fi and hi-fi data in the list. I don't think the merge can be done in Elasticsearch.

Due to pagination etc., we'd need to ensure that low-fi and hi-fi queries return data for the same transaction groups, and then merge it in Kibana. We could potentially do it by making the queries sort by both lists by avg. transaction time calculated on the metricset and transaction data respectively, and then do the merge in Kibana. I have more thoughts on this, but we should probably do a POC to investigate the feasibility of this.

Rollups

Introducing the low-fi layer as described above allows users to delete transaction data and still see low-fi data. I expect that will be a significant storage reduction for users that want to keep hi-fi data for, say one week, and low-fi data for 2 months. Some users will want to keep low-fi data for much longer. For those users, applying rollups to the low-fi data to decrease time granularity will allow them to further reduce storage costs. Supporting rollups isn't something we'd need to do in the first phase.

Rollups includes functionality to transparently rewrite queries to search regular documents and rolled up data at the same time. So the queries for low-fi data should mostly just work for rolled up data. There are some improvements to rollups coming which we should probably wait for before spending time investigating more: elastic/elasticsearch#42720

Future

When elastic/elasticsearch#33214 arrives, agents could start sending up transaction duration histograms and we'll be able to move percentiles and distribution chart into the low-fi layer. We'd be able to stop sending up unsampled transactions. The hi-fi layer will then only be actual transaction samples.

@roncohen
Copy link
Author

roncohen commented Jul 3, 2019

@elastic/apm-ui thoughts?

@dgieselaar
Copy link
Member

@roncohen this is pretty exciting! When I was working on my own APM stuff w/ ES I was always struggling with storage vs resolution. One of the options I considered back then was to store everything as a metricset, with a resolution of 1:1. After n days, data would then be rolled up into increasingly lower resolution. I could then always query the metricset instead of the raw documents. If we could do something similar, that would help a lot, but not sure what it means for storage, agent support etc. If we would have to support both transactions and metricsets and then merge them in Kibana it's feasible, but hairy.

What happens when you try to query the rollup search with a percentile agg? will it error out or just show no data?

@roncohen
Copy link
Author

roncohen commented Jul 3, 2019

If we would have to support both transactions and metricsets and then merge them in Kibana it's feasible, but hairy.

If you think about it in two layers, with the hi-fi one being optional, does that help? For example, for the transaction duration graph, the "avg" line comes from the low-fi layer and is based on the metricset documents. A separate query will calculate percentiles based on "transaction" documents. If the percentile queries return data then we "just" add two lines to the transaction duration graph.

@formgeist
Copy link
Contributor

Sounds like a great plan for being able to support different data resolutions. Got a few questions;

  • How will it work with ML? Do the jobs have to change to use low-fidelity by default in order to assure that we can keep the anomaly detection intact?
  • Queries will double if we're adding hi-fi on top after the metrics have loaded. Do we include this in the view load, so we wait until we query both to display the data?
  • Are we going to open for customization on those hi-fi layers like percentiles i.e. letting the user choose which percentiles to calculate and display on the charts and tables?

@roncohen
Copy link
Author

roncohen commented Jul 3, 2019

great questions

How will it work with ML? Do the jobs have to change to use low-fidelity by default in order to assure that we can keep the anomaly detection intact?

It would probably make sense to change them to be based on metricsets eventually because the plan is to stop sending up unsampled transactions some day. But in the mean time it shouldn't matter. The numbers should be the same. I think we'd consider the ML data part of the low-fi layer.

Queries will double if we're adding hi-fi on top after the metrics have loaded. Do we include this in the view load, so we wait until we query both to display the data?

As a start, it's probably simplest to wait for both to return before drawing the graphs. If it's not big difference in complexity, it's probably nice to show data as soon as we have something and then add to it when the other query arrives.

Are we going to open for customization on those hi-fi layers like percentiles i.e. letting the user choose which percentiles to calculate and display on the charts and tables?

It's an interesting idea, but i don't think we should do that for now.

@formgeist
Copy link
Contributor

@roncohen thanks for clarifying, makes sense

@sorenlouv
Copy link
Member

We've been talking about this for a while - thanks for finally getting the ball rolling @roncohen!

One aspect I didn't see mentioned is the query bar. Currently it is used to filter the UI via ES filters applied to transaction and error documents. Metric docs won't have these dimensions and will therefore render the query bar useless.
When the query bar was released it was hailed as one of the things that set us apart from competitors because it let users filter by any dimensions of their data. I don't see how it can stick around without changes that would also limit its use drastically.

@roncohen
Copy link
Author

roncohen commented Jul 17, 2019

that's a good point.

I have two ideas for what we could do:

  1. We'd query the hi-fi data for 99p, 95p and avg etc. and at the same time query the low-fi data. Then "fill in" the avg. line with data from the low-fi data when it arrives. If there's a filter set, low-fi data will not return anything and the hi-fi data will just work. I suspect that combining the data sets like this in the UI could be complex, but maybe not.

  2. Improving on (1), we'd come up with a set of fields that should be included in the metrics. For example, container.id, host.name, kubernetes.pod.uid, transaction.result and perhaps a few more. Agents would need to collect a metric for each combination of these. Those fields would then always work in the filter bar. Looking at a time-range where hi-fi data only goes back half way, the effect would be that avg. line will go across the graph, while the 99p/95p will start in the middle of the graph.

@axw
Copy link
Member

axw commented Jul 18, 2019

Improving on (1), we'd come up with a set of fields that should be included in the metrics. For example, container.id, host.name, kubernetes.pod.uid, transaction.result and perhaps a few more. Agents would need to collect a metric for each combination of these.

For all of those except transaction.result, the server already adds them based on the metadata sent at the start of the stream. I think we'd just need to add dimensions for transaction.name, transaction.type, and transaction.result, to reach parity with the non-sampled transaction docs we have today.

@felixbarny
Copy link
Member

I'm a bit worried about a cardinality increase when including the transaction.result dimension. Especially as this is a user-definable field. There's no guarantee that users group the result like we do, for example, HTTP 2xx. But even with the grouping we do for the HTTP status codes, we might hit the limit of 1000 metricsets for an agent pretty quickly.

@roncohen
Copy link
Author

@felixbarny agreed that we need to be vigilant about cardinality increase

@felixbarny
Copy link
Member

This has been shipped (see xpack.apm.searchAggregatedTransactions and Configure transaction metrics) and will be the default in 8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants