Skip to content

Commit

Permalink
Docs: Document how to rebuild analyzers
Browse files Browse the repository at this point in the history
Adds documentation for how to rebuild all the built in analyzers and
tests for that documentation using the mechanism added in elastic#29535.

Closes elastic#29499
  • Loading branch information
nik9000 committed May 9, 2018
1 parent a3c5c5d commit bbae44c
Show file tree
Hide file tree
Showing 7 changed files with 260 additions and 90 deletions.
48 changes: 19 additions & 29 deletions docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token. If a stopword list is
configured, stop words will also be removed.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
2. <<analysis-asciifolding-tokenfilter>>
3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
4. <<analysis-fingerprint-tokenfilter>>

[float]
=== Example output

Expand Down Expand Up @@ -150,32 +136,38 @@ The above example produces the following term:
[ consistent godel said sentence yes ]
---------------------------

=== Further customization
[float]
=== Definition

The `fingerprint` tokenizer consists of:

You can further customize the behavior of the `fingerprint` analyzer by
declaring a `custom` analyzer with the `fingerprint` token filter. The
example below recreates the "standard" fingerprint analyzer and you can
add token filters to it to change the behavior.
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-asciifolding-tokenfilter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
* <<analysis-fingerprint-tokenfilter>>

If you need to customize the `fingerprint` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`fingerprint` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /fingerprint_example
{
"settings": {
"analysis": {
"filter": {
"fingerprint_stop": {
"type": "stop",
"stopwords": "_english_" <1>
}
},
"analyzer": {
"rebuilt_fingerprint": {
"tokenizer": "standard",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint_stop",
"fingerprint"
]
}
Expand All @@ -186,5 +178,3 @@ PUT /fingerprint_example
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
<1> The default stopwords can be overridden with the `stopwords`
or `stopwords_path` parameters.
45 changes: 37 additions & 8 deletions docs/reference/analysis/analyzers/keyword-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@
The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
string as a single token.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>

[float]
=== Example output

Expand Down Expand Up @@ -57,3 +49,40 @@ The above sentence would produce the following single term:
=== Configuration

The `keyword` analyzer is not configurable.

[float]
=== Definition

The `keyword` analyzer consists of:

Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>

If you need to customize the `keyword` analyzer then you need to
recreate it as a `custom` analyzer and modify it, usually by adding
token filters. Usually, you should prefer the
<<keyword, Keyword type>> when you want strings that are not split
into tokens, but just in case you need it, this his would recreate
the built in `keyword` analyzer and you can use it as a starting
point for further customization:

[source,js]
----------------------------------------------------
PUT /keyword_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_keyword": {
"tokenizer": "keyword",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
<1> You'd add any token filters here.
61 changes: 48 additions & 13 deletions docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic
========================================


[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

[float]
=== Example output

Expand Down Expand Up @@ -378,3 +365,51 @@ The regex above is easier to understand as:
[\p{L}&&[^\p{Lu}]] # then lower case
)
--------------------------------------------------

[float]
=== Definition

The `pattern` anlayzer consists of:

Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

If you need to customize the `pattern` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`pattern` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"stopwords": "\\W+" <1>
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
<1> The default pattern is `\W+` which splits on non-word characters
and this is where you'd change it.
<2> You'd add other token filters after `lowercase`.
42 changes: 34 additions & 8 deletions docs/reference/analysis/analyzers/simple-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@
The `simple` analyzer breaks text into terms whenever it encounters a
character which is not a letter. All terms are lower cased.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

[float]
=== Example output

Expand Down Expand Up @@ -127,3 +119,37 @@ The above sentence would produce the following terms:
=== Configuration

The `simple` analyzer is not configurable.

[float]
=== Definition

The `simple` anlzyer consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

If you need to customize the `simple` analyzer then you need to recreate
it as a `custom` analyzer and modify it, usually by adding token filters.
This would recreate the built in `simple` analyzer and you can use it as
a starting point for further customization:

[source,js]
----------------------------------------------------
PUT /simple_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_simple": {
"tokenizer": "lowercase",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
<1> You'd add any token filters here.
54 changes: 41 additions & 13 deletions docs/reference/analysis/analyzers/standard-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

[float]
=== Example output

Expand Down Expand Up @@ -276,3 +263,44 @@ The above example produces the following terms:
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------

[float]
=== Definition

The `standard` analyzer consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

If you need to customize the `standard` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`standard` analyzer and you can use it as a starting point:

[source,js]
----------------------------------------------------
PUT /standard_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase" <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
<1> You'd add any token filters after `lowercase`.
58 changes: 47 additions & 11 deletions docs/reference/analysis/analyzers/stop-analyzer.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
but adds support for removing stop words. It defaults to using the
`_english_` stop words.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>

[float]
=== Example output

Expand Down Expand Up @@ -239,3 +228,50 @@ The above example produces the following terms:
---------------------------
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
---------------------------

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>

Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>

If you need to customize the `stop` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built in
`stop` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /stop_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_" <1>
}
},
"analyzer": {
"rebuilt_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
<1> The default stopwords can be overridden with the `stopwords`
or `stopwords_path` parameters.
<2> You'd add any token filters after `english_stop`.
Loading

0 comments on commit bbae44c

Please sign in to comment.