Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] API for selecting data sources, index aliases, and indices #64858

Closed
mattkime opened this issue Apr 30, 2020 · 29 comments
Closed

[DISCUSS] API for selecting data sources, index aliases, and indices #64858

mattkime opened this issue Apr 30, 2020 · 29 comments
Labels
discuss Feature:Data Views Data Views code and UI - index patterns before 8.0

Comments

@mattkime
Copy link
Contributor

mattkime commented Apr 30, 2020

In support of Data streams - elastic/elasticsearch#53100

tldr; We need a way to select data streams, index aliases, and indices in such a way that we show the user which entities their wildcard matches.

tl;

Initial display of indices -

Screen Shot 2020-04-29 at 8 30 04 PM

Display of matched indices -

Screen Shot 2020-04-29 at 8 30 40 PM

Currently we only display indices with at least one document. You can match an index alias but we don't indicate you match it, we just show the indices it references. Finding a document is important as we store a list of fields with the index pattern saved object. We could display an error if a wildcard matches an index without documents.

We may want to add metadata to the entities returned but currently have no defined needs. Let's make sure its easy to add in the future.

This needs to be cross cluster aware. Currently we make two requests when listing indices in the index pattern creation ui - * and *:*. We do this because the cross cluster request is more likely to be slow or fail so its nice to independently error on the cross cluster request.

API proposal:

Request - GET _data_source/{wildcard}
Result -

{
  indices: [{ name: 'index_name'}],
  data_streams: [{name: 'data_stream_name'}],
  index_aliases: [{name: 'index_alias_name'}],
}

I'm unsure if this should be implemented in elasticsearch or kibana. You could duplicate the result with GET * (for indices), GET /_alias and GET /_data_streams/ although the individual APIs might be doing more work than necessary. Speed should be taken into consideration as it affects flexibility of usage. We would prefer that index patterns are quick and easy to create as to facilitate data exploration, unlike now where its treated as a kibana configuration step. Its notable that GET * frequently returns large payloads detailing fields and their capabilities.

@mattkime mattkime changed the title [DISCUSS] Elasticsearch api for selecting data sources, index aliases, and indices [DISCUSS] API for selecting data sources, index aliases, and indices Apr 30, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@ppisljar
Copy link
Member

ppisljar commented May 4, 2020

i agree something like this might be desired. If we would implement this on kibana side we should use async request and then stream the data back as we get it, to not slow down due to cross cluster request.

if index pattern has no document, it can still have the mapping defined which should allow us to populate the fields right ? so only in case where there is no mapping defined we could show an error.

@cjcenizal
Copy link
Contributor

This proposal sounds great. I think this should be implemented in ES so that it can be updated with other "data sources" that get added in the future, as well as made to support all nuances of the ES search syntax (e.g. _all). This also seems like an API that would be useful to anyone consuming ES, not just Kibana.

I'd also suggest we expand the API to allow you to check if an index pattern matches a specific name or names. This will be useful when trying to figure out if an existing index or data stream is captured by an index pattern. For example:

POST _data_source/test-*/_match
{
  indices: [{ name: 'index_name'}],
  data_streams: [{name: 'test_stream'}],
  index_aliases: [{name: 'index_alias_name'}],
}

// 200
{
  indices: [],
  data_streams: [{name: 'test_stream'}],
  index_aliases: [],
}

@matt-davis-elastic
Copy link

matt-davis-elastic commented May 4, 2020

@mattkime Thanks for raising this issue. I have a few questions for you.

You can match an index alias but we don't indicate you match it, we just show the indices it references.

Would you see data streams following a similar pattern? In my mind, I am thinking that we would want to abstract the idea of Indices behind the data stream. @martijnvg @danhermann Not sure if you agree or not. At least I can't immediately think of a use case where someone would want to include only a subset of the indices behind a data stream in the index pattern, which doesn't mean there aren't any.

@mattkime I just want to make sure I understand the workflow here for a dedicated API call.

  1. A user would go to the Index Pattern UI and start typing a pattern this would issue a POST /{wildcard}/_search call. This call today just returns the indices which have a count greater than 1. Would this call need to return a list of indices, alias names and the associated indices, and data stream names and associated indices, or are you thinking you could call _date_source/{wildcard} here for this information instead of the _search call?

  2. A user is finished with the index pattern and selects the Next step button. Will this call _data_source/{wildcard} and get the list of fields/mappings for the indices that match the included pattern? So will you see the _data_source API replacing the _fields_for_wildcard API call?

I think this makes sense to have its own call. I think it's something that we could use in the ES UI, and the Kibana UI.

@cjcenizal what do you think? Is this something we could use in the SLM UI? For example I #65132 needs to be aware of data streams, so we could either add another api call for the snapshot policy creation to be data streams aware or we could use this same API call. There are a few differences though, as the SLM UI doesn't allow for wildcard searches, and includes indices with no data.

@mattkime
Copy link
Contributor Author

mattkime commented May 5, 2020

@matt-davis-elastic

Would you see data streams following a similar pattern? In my mind, I am thinking that we would want to abstract the idea of Indices behind the data stream.

Yes, thats exactly what we want - hidden indices will not be seen (at least by default) with this api.

At least I can't immediately think of a use case where someone would want to include only a subset of the indices behind a data stream in the index pattern, which doesn't mean there aren't any.

We would default to hiding hidden indices of any type. We could allow querying them via an optional query param but I have the same understanding of the need for this feature that you do.

Would this call need to return a list of indices, alias names and the associated indices, and data stream names and associated indices, or are you thinking you could call _date_source/{wildcard} here for this information instead of the _search call?

I'm not 100% certain I understand the question but I'll take a swing anyway. I don't see a need to return a list of indices but I'm happy to entertain the idea if someone sees a need. Seems like getting that list could be a different API call.

So will you see the _data_source API replacing the _fields_for_wildcard API call?

I don't, my starting point is querying data sources and their fields should be separate apis. _fields_for_wildcard wraps the FieldCaps API which will support data streams.

IMO the SLM use case looks very similar and we should design an api that is suitable for both. We could provide a way to specify what type of data sources you'd like to query but it would be trivial to ignore unwanted results.

@matt-davis-elastic
Copy link

I had a quick meeting with Matt to cover this. Here is the summary.

In the Index Patterns UI today when you search for a top-level item like an Alias then the UI will show you the Indices that match that top-level item. So for example, if I have an alias called log which has test-001 and test-002 indices behind it. Then if I search for lo* then I would see test-001 and test-002 in my results for indices that the pattern can match.

With the changes to make this UI work with data streams, we are proposing to add a new API _data_source API that would return the 3 objects, indices, aliases, and data streams. The UI would then show the top-level items instead of the indices behind those top-level items. So using my previous example if you searched for lo* you would see the alias log as an item that would match the pattern, and you wouldn't see test-001 or test-002 unless you changed the pattern to te*.

@mattkime let me know if I missed anything.

@danhermann
Copy link

Two comments -- Given that aliases are commonly named similarly to the indices to which they point (e.g., mysql-logs for mysql-logs-2020-05-05), I think it is likely that a wildcard will match both an alias and the indices to which it points. In that case, you'd need some indication that the indices were already referenced by the alias if you wanted to display only the alias. If there's an option to include hidden indices, the same will be true for data streams.

Second, we may want to iterate on the proposed name of _data_source for the new endpoint. We considered that name for some elements of the work on data streams and decided against it for various reasons including not wanting to add the potentially confusing concept of a "data source" to ES.

@cjcenizal
Copy link
Contributor

cjcenizal commented May 5, 2020

To address @danhermann's points, I propose two changes:

  1. This API should return as complete an entity definition as possible. For example, aliases would be defined the same way as the response returned by the Get alias API (which includes the names of aliased indices) and indices would be defined as by the Get index API (which includes aliases, mappings, and settings). This way users can trace indices back to their aliases, and aliases back to their indices. I believe this works for data streams too, since backing indices and data streams have references to each another.
  2. In terms of naming, this API seems to be all about testing out patterns to see what gets captured. How about we name it GET _capture_pattern/<pattern>? If that's too controversial, then we could leverage existing terminology and use GET _index_pattern/<pattern> even though the API has concerns beyond indices.

@matt-davis-elastic
Copy link

matt-davis-elastic commented May 6, 2020

@cjcenizal I think that makes sense to include the indices in which category they fall under. So an index that is part of an alias only shows up in the alias section with its alias and the same with data streams. So +1 for that.

For # 2 I also find _data_sources to be a little miss leading. I picture data sources as things like Nginx_logs or filebeat. @cjcenizal I like your idea of explaining a bit more about what it does. I don't like capture as that feels like it's taking some action on the indices. Index pattern is close. I think it aligns with the nomenclature that Kibana uses. Aliases and data streams are a way to group indices, so I think that does fit. However, I don't know if it will collide or be confusing given what index patterns mean in Kibana. For me just using that I might want to do something like _index_pattern/{kibana_index_pattern_name}. What about GET _index_groups/{wildcard}? Just brainstorming. I hate naming.

@mattkime
Copy link
Contributor Author

mattkime commented May 6, 2020

fwiw I have no opinion on whether data sources is the right name for this and would be happy if something else was agreed upon.

Supporting pagination in the api would be nice.

@danhermann
Copy link

@cjcenizal I think that makes sense to include the indices in which category they fall under. So an index that is part of an alias only shows up in the alias section with its alias and the same with data streams. So +1 for that.

I read @cjcenizal's suggestion as including enough metadata about each index, alias, and data stream (we call them collectively "index abstractions" on the ES side which does not necessarily mean that's the best name for the endpoint) to eliminate any ambiguities about how a particular index abstraction was matched with a given wildcard. I would prefer to avoid trying to determine which section an index should go in based on how it was resolved.

I would amend the original API proposal so that each alias includes all its referenced indices, each index includes all its aliases and parent data stream (if any), and each data stream includes all its backing indices. It would look something like this:

Request - GET _endpoint_name_tbd/{wildcard}

Result -

{
  indices: [
    { 
      name: 'index1',
      aliases: ['alias1', 'alias2']
    },
    { 
      name: 'index2',
      aliases: ['alias2']
    },
    { 
      name: 'my-data-stream-000001',
      aliases: [], # could be omitted if empty
      data_stream: 'my-data-stream' # omitted if not a backing index
    }
  ],
  data_streams: [
    { 
      name: 'my-data-stream',
      backing_indices: ['my-data-stream-000001']
    }
  ],
  aliases: [
    {
      name: 'alias1',
      indices: ['index1']
    },
    {
      name: 'alias2',
      indices: ['index1', 'index2']
    }
  ]
}

@mattkime
Copy link
Contributor Author

mattkime commented May 6, 2020

@danhermann That looks great, a couple of details / requests I'd like to nail down -

  • Is it possible to support pagination?
  • Is it cross cluster aware?
  • Possible to add a hasDocuments attribute to each top level data source returned? Set to true if there are one or more documents.
  • Can there be an option for returning hidden indices? (to be clear, I think this would be rarely used)

I need to do some research on frozen indices to determine how they might relate to this api.

@danhermann
Copy link

  • Is it possible to support pagination?

I'd like to defer that, if possible. It adds a lot of complexity and we already have a pretty aggressive schedule on the ES side to deliver data streams.

  • Is it cross cluster aware?

Could you give me an example of what you mean by this?

  • Possible to add a hasDocuments attribute to each top level data source returned? Set to true if there are one or more documents.

I'll check on that.

  • Can there be an option for returning hidden indices? (to be clear, I think this would be rarely used)

Yes, that's doable.

@mattkime
Copy link
Contributor Author

mattkime commented May 6, 2020

Frozen index support - could frozen indices be returned marked frozen: true?

Data streams - can we get the timestamp_field?

@mattkime mattkime added the Feature:Data Views Data Views code and UI - index patterns before 8.0 label May 6, 2020
@danhermann
Copy link

Frozen index support - could frozen indices be returned marked frozen: true?

Data streams - can we get the timestamp_field?

These two can be added pretty easily.

@martijnvg
Copy link
Member

Possible to add a hasDocuments attribute to each top level data source returned? Set to true if there are one or more documents.

Adding this to the new api would make it much more complex. All the other attributes can be retrieved from the cluster state that exists on all nodes in memory, checking whether an index is empty not. I chatted @mattkime and at least for now we will leave this out. This can be added in a later iteration.

is it cross cluster aware?

Yes, we can make this new api cross cluster aware (similar to _field_caps api).

@matt-davis-elastic
Copy link

@martijnvg Thanks for the comments.

@mattkime How much trouble is it going to cause for the user if we can't send back if the data stream is empty or not? I know we can't really bank on it, but at least at first I think the main way data streams will be created will be to send data to an index pattern that matches an index template. So I would think in almost all cases there would be data in a data stream. it is possible to create one via the API, but it's not exposed in the UI anywhere and I don't anticipate users doing that really.

@mattkime
Copy link
Contributor Author

mattkime commented May 12, 2020

How much trouble is it going to cause for the user if we can't send back if the data stream is empty or not?

At most it will mean an additional API call and a message to the user stating that the index pattern cant be created since none of the matching indices have a document.

That said, we're currently examining whether we can create kibana index pattern objects without having a document. It certainly would be helpful to beats and ingest teams.

@mattkime
Copy link
Contributor Author

@martijnvg I'm curious if you can provide a very rough guess as to the performance of this call vs the aggregation query we currently use -

"text": "{\"size\":0,\"aggs\":{\"indices\":{\"terms\":{\"field\":\"_index\",\"size\":200}}}}"

@danhermann
Copy link

@mattkime, as spec'ed here, this API will definitely be faster because it returns data only from cluster metadata which can be retrieved from any node in the cluster (excluding cross-cluster info, of course). It will not have to query every shard as the existing one does.

@cjcenizal
Copy link
Contributor

@mattkime or @danhermann Are we at a place now where we can create an issue in the ES repo to track this?

@mattkime
Copy link
Contributor Author

@danhermann Any ideas on how we might come up with a name for the set - data streams, indices, and index aliases?

@danhermann
Copy link

I've given it a little thought, but have not come up with any great ideas, yet. It's temporarily named resolve_index_abstraction while I'm working on it now, but I don't think that's the best name for end users. "Data sources" fits from a purely semantic perspective, but we did want to avoid the potential confusion of that term in ES since it has a number of domain-specific meanings in the database world. Perhaps we should open it up to suggestions from the broader team?

@mattkime
Copy link
Contributor Author

@danhermann

Perhaps we should open it up to suggestions from the broader team?

Sounds good - can you kick that off? I'm happy to help but this seems more like an ES thing than Kibana thing.

@jloleysens
Copy link
Contributor

@mattkime @danhermann

I think this kind of endpoint will prove super valuable to ES UI and am super excited to see how it is coming along!

Is there an open issue for this ES side? Or somewhere we can contribute?

@danhermann
Copy link

I planning to have a draft PR open for it within a day. I'm just working on its cross-cluster capabilities now. The request and response look like I proposed in this comment above.

@jloleysens
Copy link
Contributor

Ok, thanks for pointing that out @danhermann 👍 - glad to know that version is up-to-date.

@danhermann
Copy link

The draft PR is up. Resolving names against a local cluster works, but there's a bug with remote clusters that needs to be fixed. The request and response format and all parameters and options should be stable.

@danhermann
Copy link

This API has been completed on the ES side: elastic/elasticsearch#57626

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Data Views Data Views code and UI - index patterns before 8.0
Projects
None yet
Development

No branches or pull requests

10 participants