Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis ICU Plugin #151

Closed
kimchy opened this issue Apr 27, 2010 · 1 comment
Closed

Analysis ICU Plugin #151

kimchy opened this issue Apr 27, 2010 · 1 comment

Comments

@kimchy
Copy link
Member

kimchy commented Apr 27, 2010

A plugin using ICU (http://icu-project.org/) to allow for unicode normalization, collation and folding. The plugin includes the following analysis token files:

ICU Normalization:

Normalizes characters as explained here: http://userguide.icu-project.org/transforms/normalization. it registeres itself by default under icu_normalizer or icuNormalizer using the default settings. Allows for the name parameter to be provided which can include the following values: nfc, nfkc, and nfkc_cf.

Sample setting:

curl -XPUT 'http://localhost:9200/test_1/'  -d '
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
'

ICU Folding:

Folding of unicode characters based on UTR#30. It registeres itself under icu_folding and icuFolding names. Sample setting:

curl -XPUT 'http://localhost:9200/test_1/'  -d '
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_folding"]
                }
            }
        }
    }
}
'

ICU Collation:

Uses collation token filter. Allows to either specify the rules for collation (defined here http://www.icu-project.org/userguide/Collate_Customization.html) using the rules parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the language parameter (further specialized by country and variant). By default registers under icu_collation or icuCollation and uses the default locale.

Sample settings:

curl -XPUT 'http://localhost:9200/test_1/'  -d '
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_collation"]
                }
            }
        }
    }
}
'

Sample custom collation:

curl -XPUT 'http://localhost:9200/test_1/'  -d '
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_collation"]
                }
            },
            "filter" : {
                "myCollator" : {
                    type : "icu_collator",
                    language : "en"
                }
            }
        }
    }
}
'
@kimchy
Copy link
Member Author

kimchy commented Apr 27, 2010

Analysis ICU Plugin, closed by 11e4ad9

rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015
rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015
costin pushed a commit that referenced this issue Dec 6, 2022
🤖 ESQL: Merge upstream
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant