-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get list of all available analyzers. Request for a new API? #5481
Comments
@lukas-vlcek I couldn't agree more. What may even be interesting is if we could start to expose some of these through a generated "cluster documentation" page in dashboards. That could show a bit of the clusters meta information. |
BTW, I am looking at this and I am trying to implement a quick prototype. Feel free to assign me. |
Hi, I prepared an experimental plugin with this functionality.
At this point I would love to get some feedback. Below are more details about what this plugin can offer. How does it work?Right now, every OpenSearch node has internal Internally, the analysis registry contains Maps that have the The content of those Maps/keySets depends on NoticeBecause it is implemented as plugin I had to use reflection API to gain access to information that is not exposed to plugins (hence ExampleImagine OpenSearch with the following plugins installed: GET http://localhost:9200/_cat/plugins?v
This yields the following comprehensive list of analysis components: GET http://localhost:9200/_nodes/analyzers?pretty {
"_nodes" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"cluster_name" : "opensearch",
"nodes" : {
"SSlN30D0RUmDieqwlmp4RA" : {
"analyzers" : [
[
"standard",
"german",
"irish",
"pattern",
"sorani",
"simple",
"hungarian",
"norwegian",
"dutch",
"chinese",
"default",
"estonian",
"arabic",
"bengali",
"english",
"fingerprint",
"portuguese",
"keyword",
"romanian",
"french",
"czech",
"greek",
"indonesian",
"swedish",
"spanish",
"danish",
"russian",
"cjk",
"kuromoji",
"armenian",
"basque",
"italian",
"lithuanian",
"thai",
"persian",
"catalan",
"finnish",
"stop",
"brazilian",
"turkish",
"hindi",
"bulgarian",
"snowball",
"whitespace",
"galician",
"icu_analyzer",
"latvian"
]
],
"tokenizers" : [
[
"standard",
"lowercase",
"kuromoji_tokenizer",
"pattern",
"thai",
"uax_url_email",
"PathHierarchy",
"simple_pattern_split",
"classic",
"path_hierarchy",
"edgeNGram",
"nGram",
"letter",
"simple_pattern",
"ngram",
"keyword",
"whitespace",
"icu_tokenizer",
"edge_ngram",
"char_group"
]
],
"tokenFilters" : [
[
"standard",
"uppercase",
"decimal_digit",
"persian_normalization",
"bengali_normalization",
"flatten_graph",
"kuromoji_readingform",
"pattern_replace",
"kuromoji_part_of_speech",
"scandinavian_folding",
"stemmer_override",
"kuromoji_baseform",
"multiplexer",
"trim",
"truncate",
"fingerprint",
"limit",
"czech_stem",
"word_delimiter_graph",
"cjk_bigram",
"german_normalization",
"hindi_normalization",
"pattern_capture",
"kstem",
"icu_collation",
"arabic_stem",
"condition",
"stop",
"min_hash",
"hunspell",
"brazilian_stem",
"keep",
"unique",
"snowball",
"edge_ngram",
"icu_transform",
"keyword_marker",
"word_delimiter",
"synonym_graph",
"ja_stop",
"kuromoji_number",
"keep_types",
"french_stem",
"arabic_normalization",
"elision",
"icu_normalizer",
"porter_stem",
"sorani_normalization",
"icu_folding",
"hyphenation_decompounder",
"stemmer",
"synonym",
"phonetic",
"nGram",
"german_stem",
"delimited_payload",
"cjk_width",
"lowercase",
"serbian_normalization",
"scandinavian_normalization",
"length",
"remove_duplicates",
"reverse",
"apostrophe",
"russian_stem",
"dutch_stem",
"kuromoji_stemmer",
"classic",
"edgeNGram",
"predicate_token_filter",
"asciifolding",
"concatenate_graph",
"indic_normalization",
"shingle",
"common_grams",
"ngram",
"dictionary_decompounder"
]
],
"charFilters" : [
[
"mapping",
"html_strip",
"kuromoji_iteration_mark",
"icu_normalizer",
"pattern_replace"
]
],
"normalizers" : [
[
"lowercase"
]
],
"plugins" : {
"plugin" : {
"name" : "org.opensearch.analysis.common.CommonAnalysisPlugin",
"analyzers" : [
[
"arabic",
"armenian",
"basque",
"bengali",
"brazilian",
"bulgarian",
"catalan",
"chinese",
"cjk",
"czech",
"danish",
"dutch",
"english",
"estonian",
"fingerprint",
"finnish",
"french",
"galician",
"german",
"greek",
"hindi",
"hungarian",
"indonesian",
"irish",
"italian",
"latvian",
"lithuanian",
"norwegian",
"pattern",
"persian",
"portuguese",
"romanian",
"russian",
"snowball",
"sorani",
"spanish",
"swedish",
"thai",
"turkish"
]
],
"tokenizers" : [
[
"PathHierarchy",
"char_group",
"classic",
"edgeNGram",
"edge_ngram",
"keyword",
"letter",
"lowercase",
"nGram",
"ngram",
"path_hierarchy",
"pattern",
"simple_pattern",
"simple_pattern_split",
"thai",
"uax_url_email",
"whitespace"
]
],
"tokenFilters" : [
[
"apostrophe",
"arabic_normalization",
"arabic_stem",
"asciifolding",
"bengali_normalization",
"brazilian_stem",
"cjk_bigram",
"cjk_width",
"classic",
"common_grams",
"concatenate_graph",
"condition",
"czech_stem",
"decimal_digit",
"delimited_payload",
"dictionary_decompounder",
"dutch_stem",
"edgeNGram",
"edge_ngram",
"elision",
"fingerprint",
"flatten_graph",
"french_stem",
"german_normalization",
"german_stem",
"hindi_normalization",
"hyphenation_decompounder",
"indic_normalization",
"keep",
"keep_types",
"keyword_marker",
"kstem",
"length",
"limit",
"lowercase",
"min_hash",
"multiplexer",
"nGram",
"ngram",
"pattern_capture",
"pattern_replace",
"persian_normalization",
"porter_stem",
"predicate_token_filter",
"remove_duplicates",
"reverse",
"russian_stem",
"scandinavian_folding",
"scandinavian_normalization",
"serbian_normalization",
"snowball",
"sorani_normalization",
"stemmer",
"stemmer_override",
"synonym",
"synonym_graph",
"trim",
"truncate",
"unique",
"uppercase",
"word_delimiter",
"word_delimiter_graph"
]
],
"charFilters" : [
[
"html_strip",
"mapping",
"pattern_replace"
]
],
"hunspellDictionaries" : [
[ ]
]
},
"plugin" : {
"name" : "org.opensearch.plugin.analysis.AnalysisPhoneticPlugin",
"analyzers" : [
[ ]
],
"tokenizers" : [
[ ]
],
"tokenFilters" : [
[
"phonetic"
]
],
"charFilters" : [
[ ]
],
"hunspellDictionaries" : [
[ ]
]
},
"plugin" : {
"name" : "org.opensearch.plugin.analysis.icu.AnalysisICUPlugin",
"analyzers" : [
[
"icu_analyzer"
]
],
"tokenizers" : [
[
"icu_tokenizer"
]
],
"tokenFilters" : [
[
"icu_normalizer",
"icu_folding",
"icu_transform",
"icu_collation"
]
],
"charFilters" : [
[
"icu_normalizer"
]
],
"hunspellDictionaries" : [
[ ]
]
},
"plugin" : {
"name" : "org.opensearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin",
"analyzers" : [
[
"kuromoji"
]
],
"tokenizers" : [
[
"kuromoji_tokenizer"
]
],
"tokenFilters" : [
[
"kuromoji_baseform",
"kuromoji_stemmer",
"ja_stop",
"kuromoji_number",
"kuromoji_readingform",
"kuromoji_part_of_speech"
]
],
"charFilters" : [
[
"kuromoji_iteration_mark"
]
],
"hunspellDictionaries" : [
[ ]
]
}
}
}
}
} What is next?
HTH, |
@lukas-vlcek This looks slick! I will try it out and see about getting it socialized a bit so we can have some feedback. |
FYI, if you will be testing the 1.0.0-rc.1 plugin then please be aware of some known issues and also some fixes that are not included in that release. Of course you can always build the plugin from the source... |
I am just building myself. 2 questions:
Example: PUT _template/twitter
{
"index_patterns": [
"twitter*"
],
"template": {
"settings": {
"analysis": {
"analyzer": {
"text_analyzer": {
"tokenizer": "standard",
"filter": [ "stop" ]
}
}
}
},
"mappings": {}
}
} Here I was thinking |
|
My two cents here is that implementing this as a core component is the right way to go architecturally. Unfortunately I missed the January 17 community meeting but is there any additional feedback to incorporate here as to the structure of the API itself? |
@andrross I think the best place to provide feedback about this functionality is here, in this ticket. I am going to release a new RC version because some issues about output format has been fixed in the meantime, and OpenSearch 2.5 has been released as well. Yes, I agree the best way would be to integrate it directly into the core. But as a proof of concept it was easier for me to implement it as a plugin because I did not have to care much about frequent changes happening in As for the output format I remember one feedback was that this information could be part of the |
Agreed! I was just asking to capture any feedback from the meeting into this ticket :)
Yeah it does seem to be hard to model the structured/nested data in this API in the CAT format. On that front though, the large JSON response payload isn't the most human-readable format, so some sort of admin UI would seem to be a good fit here. |
@macohen If there are any questions, feel free to ping me, I am happy to help. |
@lukas-vlcek are you planning to keep going on this? I would encourage that! Mostly I brought it into the Search Applications Vertical project because there's some alignment there in other ways. |
@andrross do you think the admin UI is required to launch this? @lukas-vlcek, are you able to turn this into a core component? I agree with Andrew because some analyzers are in core already. |
@macohen Making it a core component is perfectly possible and will make implementation a little bit more transparent/clean. |
I just wanted to call out a related behavior that I just learned about for pipelines and processor plugins. The It feels like we would have a similar situation with analyzers, where you could specify an analyzer chain for a given field in your mapping, but it would only work reliably if every component of the chain is available on every node. @lukas-vlcek, do you know off-hand how mappings accomplish that? (I'm guessing there's got to be some kind of validation to make sure that analyzer plugins are installed everywhere, right?) I'm wondering if there might be some opportunity to make the implementations more consistent between analyzers and processor pipelines (either putting everything into |
@msfroh Thanks for looking at this. I am going to look at that. |
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Hi, are we on track for this to be released in 2.12 ? |
Hi @kiranprakash154, depends on when is the code freeze for 2.12 and if we get more reviews on this PR. I am currently finishing documentation PR, I will push it on Monday. |
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Code freeze for 2.12 is Feb 6th. @lukas-vlcek do you need any assistance to get this in? |
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
|
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
Adding a new option for NodeInfo request to include information about available analysis components on individual nodes. Closes opensearch-project#5481 Signed-off-by: Lukáš Vlček <[email protected]>
@macohen - Can we please bump this up to release train 2.13? |
Since this is still on the 2.13 roadmap, I'll move it to the 2.13 release train in the project. |
Hi @lukas-vlcek, We closed the doc issue for List Analyzers Through _cat" (opensearch-project/documentation-website#5426), but I don't think there was a doc issue for this one, specifically. Are we releasing this in 2.13 and will it need documentation? Thanks! |
#10296 is next to be merged, then we can update documentation for 2.13 accordingly. |
@lukas-vlcek should I update the tag to 2.15? |
Is your feature request related to a problem? Please describe.
I am missing an option to get list of all available analyzers.
There is already analyze API documentation and it mentions "built-in" analyzers. But for normal user there is no way how to learn what are all the options. And even for people who are familiar with the code the list is a subject to updates/changes.
One option would be to document the list in the documentation page. But I think I would prefer if the cluster itself can give a list of its analyzers (and also tokenizers, charfilters, normalizers).
Describe the solution you'd like
As far as I understand list of all built-in analyzers is materialized once
AnalysisModule.setupAnalyzers(plugins)
is called.I think it would be useful to extend one of the
_nodes/
APIs and give it option to return list of all built-in analyzers (and its componenets: tokenizers, ... etc). (It needs to be API at the "nodes"-level because AnalysisRegistry is kept per node and I think a list of built-in analyzers can differ on each node depending on installed plugins.)As for the list of analyzers defined at the index level I am not sure at this point. Maybe later...
Describe alternatives you've considered
Alternative is to go to the documentation (which does not have this list) or go to the code (which is not an option for many people).
Additional context
n/a
The text was updated successfully, but these errors were encountered: