Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get list of all available analyzers. Request for a new API? #5481

Open
lukas-vlcek opened this issue Dec 7, 2022 · 30 comments · May be fixed by #10296
Open

Get list of all available analyzers. Request for a new API? #5481

lukas-vlcek opened this issue Dec 7, 2022 · 30 comments · May be fixed by #10296
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.16.0 Issues and PRs related to version 2.16.0

Comments

@lukas-vlcek
Copy link
Contributor

lukas-vlcek commented Dec 7, 2022

Is your feature request related to a problem? Please describe.

I am missing an option to get list of all available analyzers.
There is already analyze API documentation and it mentions "built-in" analyzers. But for normal user there is no way how to learn what are all the options. And even for people who are familiar with the code the list is a subject to updates/changes.
One option would be to document the list in the documentation page. But I think I would prefer if the cluster itself can give a list of its analyzers (and also tokenizers, charfilters, normalizers).

  • There are built-in analyzers at the cluster level (in fact the list is kept at the node level, see below)
  • There are also ad-hoc/custom analyzers at individual index level (but visibility of these should be subject of RBA rules?)

Describe the solution you'd like

As far as I understand list of all built-in analyzers is materialized once AnalysisModule.setupAnalyzers(plugins) is called.
I think it would be useful to extend one of the _nodes/ APIs and give it option to return list of all built-in analyzers (and its componenets: tokenizers, ... etc). (It needs to be API at the "nodes"-level because AnalysisRegistry is kept per node and I think a list of built-in analyzers can differ on each node depending on installed plugins.)

As for the list of analyzers defined at the index level I am not sure at this point. Maybe later...

Describe alternatives you've considered

Alternative is to go to the documentation (which does not have this list) or go to the code (which is not an option for many people).

Additional context
n/a

@dtaivpp
Copy link

dtaivpp commented Dec 7, 2022

@lukas-vlcek I couldn't agree more. What may even be interesting is if we could start to expose some of these through a generated "cluster documentation" page in dashboards. That could show a bit of the clusters meta information.

@lukas-vlcek lukas-vlcek changed the title Get list of all available analyzers. Request for a new _analyze API? Get list of all available analyzers. Request for a new API? Dec 18, 2022
@anasalkouz anasalkouz added Search Search query, autocomplete ...etc and removed Indexing & Search untriaged labels Dec 20, 2022
@lukas-vlcek
Copy link
Contributor Author

BTW, I am looking at this and I am trying to implement a quick prototype. Feel free to assign me.

@lukas-vlcek
Copy link
Contributor Author

Hi,

I prepared an experimental plugin with this functionality.

At this point I would love to get some feedback.

Below are more details about what this plugin can offer.

How does it work?

Right now, every OpenSearch node has internal AnalysisRegistry object that can be easily injected to plugins. This object is the main interface when client want to get access to specific analysis component (analyzer, tokenizer, ... etc). The problem, however, is that this registry object can only return analysis component if you know its name upfront. Although this registry has the list of all known components internally it is kept private. The important point is that all the analysis components are lazily initialized. I think there is a good reason not to initialize analysis component if it never gets used (and initialize it only before first use). It is cost saving especially for components based on large dictionaries.

Internally, the analysis registry contains Maps that have the key pointing to analysis component providers. So what I ended up doing is that I used reflection to get access to those internal Maps and I pulled keySets from them. I think this should be pretty safe and should not introduce any vulnerabilities. Those Maps are initialized at the node bootstrap (which means that the keySet is not changing later).

The content of those Maps/keySets depends on AnalysisPlugins that are found during node bootstrap (there are a few components available OOTB but most of them come from modules/plugins). This means that if you install any additional plugin the list will expand. I found that every AnalysisPlugin exposes information about which analysis components it is introducing to the system and I am using this information to provide more detailed information about available (built-in) analysis components.

Notice

Because it is implemented as plugin I had to use reflection API to gain access to information that is not exposed to plugins (hence security-plugin.policy is in place and plugin installation requires confirmation). If it were implemented as a core component I would consider implementing some further changes directly in OpenSearch so that reflection would not be needed.

Example

Imagine OpenSearch with the following plugins installed:

GET http://localhost:9200/_cat/plugins?v
name                     component           version
Lukass-MacBook-Pro.local analysis-icu        2.4.1
Lukass-MacBook-Pro.local analysis-kuromoji   2.4.1
Lukass-MacBook-Pro.local analysis-phonetic   2.4.1
Lukass-MacBook-Pro.local node-analyzers      1.0.0.0-rc.1

This yields the following comprehensive list of analysis components:

GET http://localhost:9200/_nodes/analyzers?pretty
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "opensearch",
  "nodes" : {
    "SSlN30D0RUmDieqwlmp4RA" : {
      "analyzers" : [
        [
          "standard",
          "german",
          "irish",
          "pattern",
          "sorani",
          "simple",
          "hungarian",
          "norwegian",
          "dutch",
          "chinese",
          "default",
          "estonian",
          "arabic",
          "bengali",
          "english",
          "fingerprint",
          "portuguese",
          "keyword",
          "romanian",
          "french",
          "czech",
          "greek",
          "indonesian",
          "swedish",
          "spanish",
          "danish",
          "russian",
          "cjk",
          "kuromoji",
          "armenian",
          "basque",
          "italian",
          "lithuanian",
          "thai",
          "persian",
          "catalan",
          "finnish",
          "stop",
          "brazilian",
          "turkish",
          "hindi",
          "bulgarian",
          "snowball",
          "whitespace",
          "galician",
          "icu_analyzer",
          "latvian"
        ]
      ],
      "tokenizers" : [
        [
          "standard",
          "lowercase",
          "kuromoji_tokenizer",
          "pattern",
          "thai",
          "uax_url_email",
          "PathHierarchy",
          "simple_pattern_split",
          "classic",
          "path_hierarchy",
          "edgeNGram",
          "nGram",
          "letter",
          "simple_pattern",
          "ngram",
          "keyword",
          "whitespace",
          "icu_tokenizer",
          "edge_ngram",
          "char_group"
        ]
      ],
      "tokenFilters" : [
        [
          "standard",
          "uppercase",
          "decimal_digit",
          "persian_normalization",
          "bengali_normalization",
          "flatten_graph",
          "kuromoji_readingform",
          "pattern_replace",
          "kuromoji_part_of_speech",
          "scandinavian_folding",
          "stemmer_override",
          "kuromoji_baseform",
          "multiplexer",
          "trim",
          "truncate",
          "fingerprint",
          "limit",
          "czech_stem",
          "word_delimiter_graph",
          "cjk_bigram",
          "german_normalization",
          "hindi_normalization",
          "pattern_capture",
          "kstem",
          "icu_collation",
          "arabic_stem",
          "condition",
          "stop",
          "min_hash",
          "hunspell",
          "brazilian_stem",
          "keep",
          "unique",
          "snowball",
          "edge_ngram",
          "icu_transform",
          "keyword_marker",
          "word_delimiter",
          "synonym_graph",
          "ja_stop",
          "kuromoji_number",
          "keep_types",
          "french_stem",
          "arabic_normalization",
          "elision",
          "icu_normalizer",
          "porter_stem",
          "sorani_normalization",
          "icu_folding",
          "hyphenation_decompounder",
          "stemmer",
          "synonym",
          "phonetic",
          "nGram",
          "german_stem",
          "delimited_payload",
          "cjk_width",
          "lowercase",
          "serbian_normalization",
          "scandinavian_normalization",
          "length",
          "remove_duplicates",
          "reverse",
          "apostrophe",
          "russian_stem",
          "dutch_stem",
          "kuromoji_stemmer",
          "classic",
          "edgeNGram",
          "predicate_token_filter",
          "asciifolding",
          "concatenate_graph",
          "indic_normalization",
          "shingle",
          "common_grams",
          "ngram",
          "dictionary_decompounder"
        ]
      ],
      "charFilters" : [
        [
          "mapping",
          "html_strip",
          "kuromoji_iteration_mark",
          "icu_normalizer",
          "pattern_replace"
        ]
      ],
      "normalizers" : [
        [
          "lowercase"
        ]
      ],
      "plugins" : {
        "plugin" : {
          "name" : "org.opensearch.analysis.common.CommonAnalysisPlugin",
          "analyzers" : [
            [
              "arabic",
              "armenian",
              "basque",
              "bengali",
              "brazilian",
              "bulgarian",
              "catalan",
              "chinese",
              "cjk",
              "czech",
              "danish",
              "dutch",
              "english",
              "estonian",
              "fingerprint",
              "finnish",
              "french",
              "galician",
              "german",
              "greek",
              "hindi",
              "hungarian",
              "indonesian",
              "irish",
              "italian",
              "latvian",
              "lithuanian",
              "norwegian",
              "pattern",
              "persian",
              "portuguese",
              "romanian",
              "russian",
              "snowball",
              "sorani",
              "spanish",
              "swedish",
              "thai",
              "turkish"
            ]
          ],
          "tokenizers" : [
            [
              "PathHierarchy",
              "char_group",
              "classic",
              "edgeNGram",
              "edge_ngram",
              "keyword",
              "letter",
              "lowercase",
              "nGram",
              "ngram",
              "path_hierarchy",
              "pattern",
              "simple_pattern",
              "simple_pattern_split",
              "thai",
              "uax_url_email",
              "whitespace"
            ]
          ],
          "tokenFilters" : [
            [
              "apostrophe",
              "arabic_normalization",
              "arabic_stem",
              "asciifolding",
              "bengali_normalization",
              "brazilian_stem",
              "cjk_bigram",
              "cjk_width",
              "classic",
              "common_grams",
              "concatenate_graph",
              "condition",
              "czech_stem",
              "decimal_digit",
              "delimited_payload",
              "dictionary_decompounder",
              "dutch_stem",
              "edgeNGram",
              "edge_ngram",
              "elision",
              "fingerprint",
              "flatten_graph",
              "french_stem",
              "german_normalization",
              "german_stem",
              "hindi_normalization",
              "hyphenation_decompounder",
              "indic_normalization",
              "keep",
              "keep_types",
              "keyword_marker",
              "kstem",
              "length",
              "limit",
              "lowercase",
              "min_hash",
              "multiplexer",
              "nGram",
              "ngram",
              "pattern_capture",
              "pattern_replace",
              "persian_normalization",
              "porter_stem",
              "predicate_token_filter",
              "remove_duplicates",
              "reverse",
              "russian_stem",
              "scandinavian_folding",
              "scandinavian_normalization",
              "serbian_normalization",
              "snowball",
              "sorani_normalization",
              "stemmer",
              "stemmer_override",
              "synonym",
              "synonym_graph",
              "trim",
              "truncate",
              "unique",
              "uppercase",
              "word_delimiter",
              "word_delimiter_graph"
            ]
          ],
          "charFilters" : [
            [
              "html_strip",
              "mapping",
              "pattern_replace"
            ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        },
        "plugin" : {
          "name" : "org.opensearch.plugin.analysis.AnalysisPhoneticPlugin",
          "analyzers" : [
            [ ]
          ],
          "tokenizers" : [
            [ ]
          ],
          "tokenFilters" : [
            [
              "phonetic"
            ]
          ],
          "charFilters" : [
            [ ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        },
        "plugin" : {
          "name" : "org.opensearch.plugin.analysis.icu.AnalysisICUPlugin",
          "analyzers" : [
            [
              "icu_analyzer"
            ]
          ],
          "tokenizers" : [
            [
              "icu_tokenizer"
            ]
          ],
          "tokenFilters" : [
            [
              "icu_normalizer",
              "icu_folding",
              "icu_transform",
              "icu_collation"
            ]
          ],
          "charFilters" : [
            [
              "icu_normalizer"
            ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        },
        "plugin" : {
          "name" : "org.opensearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin",
          "analyzers" : [
            [
              "kuromoji"
            ]
          ],
          "tokenizers" : [
            [
              "kuromoji_tokenizer"
            ]
          ],
          "tokenFilters" : [
            [
              "kuromoji_baseform",
              "kuromoji_stemmer",
              "ja_stop",
              "kuromoji_number",
              "kuromoji_readingform",
              "kuromoji_part_of_speech"
            ]
          ],
          "charFilters" : [
            [
              "kuromoji_iteration_mark"
            ]
          ],
          "hunspellDictionaries" : [
            [ ]
          ]
        }
      }
    }
  }
}

What is next?

  • First, I would like to get some feedback. I think this feature can be useful (self-documenting is a perfect example). If this functionality is found useful I am more than happy to prepare PR as a core component (and not as a standalone plugin).

  • Second, I would like to explore how much structural information I can get about individual analysis components. For example I would like to be able to provide more information about internals of analyzers (which tokenizers and filters it is composed of, whether it is a wrapping another analyzer... etc).

HTH,
Lukáš

@dtaivpp
Copy link

dtaivpp commented Jan 5, 2023

@lukas-vlcek This looks slick! I will try it out and see about getting it socialized a bit so we can have some feedback.

@lukas-vlcek
Copy link
Contributor Author

FYI, if you will be testing the 1.0.0-rc.1 plugin then please be aware of some known issues and also some fixes that are not included in that release. Of course you can always build the plugin from the source...

@dtaivpp
Copy link

dtaivpp commented Jan 6, 2023

I am just building myself. 2 questions:

  1. @lukas-vlcek can you present this on 1/17 https://forum.opensearch.org/t/opensearch-community-meeting-2023-0117/11891

  2. Checking my understanding here. When I create a new index template with an analyzer should that show up in this list or is it just for core analyzers?

Example:

PUT _template/twitter
{
            "index_patterns": [
                "twitter*"
            ],
            "template": {
                "settings": {
                    "analysis": {
                        "analyzer": {
                            "text_analyzer": {
                                "tokenizer": "standard",
                                "filter": [ "stop" ]
                            }
                        }
                    }
                },
                "mappings": {}
        }
    }

Here I was thinking text_analyzer would show in the list but it wasn't from what I could tell. I queried as both privileged and unprivileged users.

@lukas-vlcek
Copy link
Contributor Author

lukas-vlcek commented Jan 7, 2023

@dtaivpp

  1. Yes, I am glad to present this work. Feel free to include me, I am already signed up for the meeting.
  2. The plugin pulls the list of all the "build-in" analysis components. That means the text analysis building blocks available to all users (and these are defined at the node level). It is by definition a static list, it does not change during the life of the cluster (besides cluster rolling upgrade which can bring in a new version of OpenSearch or add another AnalysisPlugin). This suggested new API is primarily meant to provide complementary information for the documentation (or as you pointed out earlier, "generated documentation"). On the other hand, the custom/ad-hoc analyzer components defined at the index level is a different thing. These can change frequently and they live at the index level (so maybe /_index/analyzers would be more appropriate end point for such information). And the biggest difference is that they are not available to all users, if the user does not see the index (does not have the privs to see it) then he/she should not see such analyzer components as well (some sensitive info could leak this way). Other users can not "re-use" there analyzers as well, they always have to recreate them on their own indices, ... etc.

@andrross andrross added distributed framework and removed Search Search query, autocomplete ...etc labels Jan 24, 2023
@andrross
Copy link
Member

My two cents here is that implementing this as a core component is the right way to go architecturally. Unfortunately I missed the January 17 community meeting but is there any additional feedback to incorporate here as to the structure of the API itself?

@lukas-vlcek
Copy link
Contributor Author

@andrross I think the best place to provide feedback about this functionality is here, in this ticket. I am going to release a new RC version because some issues about output format has been fixed in the meantime, and OpenSearch 2.5 has been released as well.

Yes, I agree the best way would be to integrate it directly into the core. But as a proof of concept it was easier for me to implement it as a plugin because I did not have to care much about frequent changes happening in main branch.

As for the output format I remember one feedback was that this information could be part of the _cat API as oppose to introducing a new REST API. I liked this idea initially but now I do not think that would be a good fit, mostly because I can not think of good response format for the _cat API.

@andrross
Copy link
Member

I think the best place to provide feedback about this functionality is here, in this ticket.

Agreed! I was just asking to capture any feedback from the meeting into this ticket :)

...could be part of the _cat API

Yeah it does seem to be hard to model the structured/nested data in this API in the CAT format. On that front though, the large JSON response payload isn't the most human-readable format, so some sort of admin UI would seem to be a good fit here.

@lukas-vlcek
Copy link
Contributor Author

@macohen If there are any questions, feel free to ping me, I am happy to help.

@macohen
Copy link
Contributor

macohen commented Mar 14, 2023

@lukas-vlcek are you planning to keep going on this? I would encourage that! Mostly I brought it into the Search Applications Vertical project because there's some alignment there in other ways.

@macohen
Copy link
Contributor

macohen commented Mar 15, 2023

@andrross do you think the admin UI is required to launch this? @lukas-vlcek, are you able to turn this into a core component? I agree with Andrew because some analyzers are in core already.

@lukas-vlcek
Copy link
Contributor Author

@macohen Making it a core component is perfectly possible and will make implementation a little bit more transparent/clean.

@msfroh
Copy link
Collaborator

msfroh commented Mar 15, 2023

I just wanted to call out a related behavior that I just learned about for pipelines and processor plugins.

The NodeInfo class has a field of type IngestInfo that keeps track of ingest processors available on every node. When a new ingest pipeline is created, the node that receives the request fetches all of the NodeInfos, and confirms that every processor used in the pipeline is available on every node (to avoid a situation where some nodes fail to run the pipeline). I just added similar logic for search pipelines, since I copied the idea from ingest pipelines.

It feels like we would have a similar situation with analyzers, where you could specify an analyzer chain for a given field in your mapping, but it would only work reliably if every component of the chain is available on every node. @lukas-vlcek, do you know off-hand how mappings accomplish that? (I'm guessing there's got to be some kind of validation to make sure that analyzer plugins are installed everywhere, right?)

I'm wondering if there might be some opportunity to make the implementations more consistent between analyzers and processor pipelines (either putting everything into NodeInfo, so it all gets returned via the /_nodes API, or we could move the pipeline processor info into a sub-API like you've done here for analyzers, making NodeInfo a little smaller).

@lukas-vlcek
Copy link
Contributor Author

@msfroh Thanks for looking at this. I am going to look at that.

lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Jan 10, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
@kiranprakash154
Copy link
Contributor

Hi, are we on track for this to be released in 2.12 ?

@lukas-vlcek
Copy link
Contributor Author

Hi @kiranprakash154, depends on when is the code freeze for 2.12 and if we get more reviews on this PR. I am currently finishing documentation PR, I will push it on Monday.

lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Jan 25, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Jan 29, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
@macohen
Copy link
Contributor

macohen commented Jan 29, 2024

Code freeze for 2.12 is Feb 6th. @lukas-vlcek do you need any assistance to get this in?

lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 2, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 2, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
@lukas-vlcek
Copy link
Contributor Author

@macohen

lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 3, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 3, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 4, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 4, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
lukas-vlcek added a commit to lukas-vlcek/OpenSearch that referenced this issue Feb 5, 2024
Adding a new option for NodeInfo request to include information about available
analysis components on individual nodes.

Closes opensearch-project#5481

Signed-off-by: Lukáš Vlček <[email protected]>
@bbarani bbarani added v2.13.0 Issues and PRs related to version 2.13.0 and removed v2.12.0 Issues and PRs related to version 2.12.0 labels Feb 19, 2024
@hdhalter
Copy link

hdhalter commented Mar 6, 2024

@macohen - Can we please bump this up to release train 2.13?

@hdhalter
Copy link

@macohen - Can we please bump this up to release train 2.13?

Since this is still on the 2.13 roadmap, I'll move it to the 2.13 release train in the project.

@hdhalter
Copy link

Hi @lukas-vlcek, We closed the doc issue for List Analyzers Through _cat" (opensearch-project/documentation-website#5426), but I don't think there was a doc issue for this one, specifically. Are we releasing this in 2.13 and will it need documentation? Thanks!

@dblock
Copy link
Member

dblock commented Mar 12, 2024

#10296 is next to be merged, then we can update documentation for 2.13 accordingly.

@getsaurabh02 getsaurabh02 added v2.14.0 and removed v2.13.0 Issues and PRs related to version 2.13.0 labels Apr 10, 2024
@getsaurabh02
Copy link
Member

@lukas-vlcek should I update the tag to 2.15?

@getsaurabh02 getsaurabh02 added v2.15.0 Issues and PRs related to version 2.15.0 and removed v2.14.0 labels May 6, 2024
@getsaurabh02 getsaurabh02 added v2.16.0 Issues and PRs related to version 2.16.0 and removed v2.15.0 Issues and PRs related to version 2.15.0 labels Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.16.0 Issues and PRs related to version 2.16.0
Projects
Status: 2.14.0 (Launched)
Status: 👀 In review
Status: Planned work items
Development

Successfully merging a pull request may close this issue.