Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Improve data streams API efficiency #116428

Open
hop-dev opened this issue Oct 27, 2021 · 13 comments
Open

[Fleet] Improve data streams API efficiency #116428

hop-dev opened this issue Oct 27, 2021 · 13 comments
Labels
bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@hop-dev
Copy link
Contributor

hop-dev commented Oct 27, 2021

Kibana version:

7.15.0, 7.16.0, master

Description of the problem including expected versus actual behavior:

Originally pointed out by @joshdover here:

The data stream view can be quite slow to load when there are a lot of streams. We currently get all data streams in one request without pagination and perform an aggregation per data stream.

This issue is to look into ways of improving the performance, current options discussed:

1. Using the data stream name to extract the type, dataset and namespace instead of aggregating

Currently, there is no guarantee that the constant_keyword values in the data match the data stream name. @ruflin suggested we could look at putting a feature request for elastic to validate the constant keywords against the data stream name allowing us to rely on this link.

However, we are now looking at adding another aggregation as part of elastic/integrations#768 so there may no longer be a big efficiency gain to be found here.

2. Introducing pagination

We could introduce pagination to limit the work we do, however there would be some challenges:

3. Combine individual aggregations into one aggregation
I am not sure this is possible. We could find a way to use filters and sub aggregations to get the namespace, dataset and type for each data stream in one query. We would need to be able to distinguish each data stream using a filter query I believe and the only way to distinguish them would be to use the values we are querying for!

Steps to reproduce:

  1. Setup Fleet & Fleet Server
  2. Create an agent policy with many integrations to create many data streams
  3. Go to /app/fleet/data-streams
  4. Note that the page can be quite slow to load
@hop-dev hop-dev added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor

@elastic/kibana-stack-management have you all solved optimizing your usage of the Data Streams stats API? I noticed that by default, stats are excluded from your Data Streams UI (you have to switch on a toggle in the top right). Curious if there's any history behind this decision and if we should also consider excluding stats by default or removing them from the list view entirely.

@cjcenizal
Copy link
Contributor

@joshdover We haven't had an opportunity to revisit that functionality since it was first implemented. Because loading the data stream stats requires hitting a separate API (https://github.com/elastic/kibana/pull/75107/files#diff-0db7f035e2e41be22bac202848c325fabf209f626b8a934d09cce5e9e074941bR34), and I think the stats themselves might take awhile to fetch, it might take awhile to retrieve the data streams along with their stats. I recommend pinging the ES Data Management team for more detailed and up-to-date info.

@joshdover joshdover changed the title [Fleet] Improve data streams API efficiencey [Fleet] Improve data streams API efficiency Jan 26, 2022
@joshdover
Copy link
Contributor

This continues to be a problem for what I expect to be most Fleet customers. In my test cluster, I have ~60 data streams, with ~300 backing indices and the request to GET /api/fleet/data_streams is timing out on Kibana after 2 minutes, resulting in a 502 error in Cloud, likely from the proxy layer: backend closed connection.

I don't think this is anywhere close to large amount of data (I'm only ingesting data from ~6 integrations from 2 laptops that aren't even always in use).

@jen-huang I'm going to add this to our iteration board to look at in the next testing cycle. I think we should try to get a fix in for the 7.x series as well.

@joshdover joshdover added bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. labels Jan 26, 2022
@joshdover
Copy link
Contributor

joshdover commented Jan 26, 2022

I did some further digging in our production data here and I'm seeing about 2.5% of customers who attempted to use this page were affected by this bug in the last 7 days. I haven't dug further, but my guess is this affects our largest, most mature adopters of Fleet, an important segment. While the incidence rate isn't incredibly high, 97.5% isn't exactly a great SLA. I think prioritizing this is the right call.

@thunderwood19
Copy link

@joshdover

Any update on this? I am one of the effected customers who relies heavily on fleet. If I can help with any logs/testing, I would be more than happy to!

@joshdover
Copy link
Contributor

Hi @thunderwood19 we have this prioritized to be worked on soon but have not yet dug in further. In the meantime, I do suggest using the UI in Stack Management > Index Management > Data streams.


Related to this, in #126067 it was discovered that the user needs to have access to the manage cluster privilege in order to access the Data stream stats API. This limits the usability of this page now that we're allowing non-superusers to use Fleet.

I think this requirement gives us further reason to explore decoupling the request to the Data stream stats API from fetching the list of data streams. If we loaded the stats separately, we may be able to show the main list quicker while also providing a more progressive UI for users with lower privileges.

@joshdover
Copy link
Contributor

@thunderwood19 Have you had a chance to test this on 8.1? We've made some improvements and I'm no longer seeing this issue as widespread in our production data or in my personal cluster on Elastic Cloud.

@thunderwood19
Copy link

@thunderwood19 Have you had a chance to test this on 8.1? We've made some improvements and I'm no longer seeing this issue as widespread in our production data or in my personal cluster on Elastic Cloud.

Yep! I let my support know yesterday, I can see the data streams via Fleet gui just fine now on 8.1.0.

@joshdover
Copy link
Contributor

Fantastic to hear, @jen-huang I'm going to de-prioritize this for now.

@joshdover
Copy link
Contributor

Some improvements are being made in #130973 to switch to use the terms enum API instead of aggregations for some of the calculations which increases the request count, but should have a big improvement on overall perf.

Pagination would still be welcome to avoid the n+1 query problem we have right now

@nimarezainia
Copy link
Contributor

@joshdover what remains for us to do in this regard? should we track this for 8.5 (for fleet scaling)

@joshdover
Copy link
Contributor

I think we mostly need to do the pagination work at this point. I don't think it's super high priority right now though. It doesn't affect control plane scaling, mostly data plane.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants