Added lenient flag for synonym token filter #31484

sohaibiftikhar · 2018-06-20T22:58:31Z

Relates to #30968

cbuescher · 2018-06-21T08:37:01Z

add to whitelist

sohaibiftikhar · 2018-06-21T11:06:40Z

server/src/main/java/org/elasticsearch/index/analysis/ElasticSynonymParser.java

+
+    @Override
+    public void add(CharsRef input, CharsRef output, boolean includeOrig) {
+        if (!lenient || (input.length > 0 && output.length > 0)) {


@nik9000 I am not sure if this looks very clean but it avoids a fork of the SolrSynonymParser.

@nik I am not sure if this looks very clean but it avoids a fork of the SolrSynonymParser.

Wrong Nik, sadly.

Ah corrected it. But I still got the attention of the right one ;)

elasticmachine · 2018-06-21T11:25:52Z

Pinging @elastic/es-search-aggs

mayya-sharipova · 2018-06-22T12:52:44Z

docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc

+Additional settings are:
+
+* `expand` (defaults to `true`).
+* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration.

 This filter tokenize synonyms with whatever tokenizer and token filters


do we want to add a description of the lenient setting to synonym-graph-tokenfilter.asciidoc as well?

Yes. I missed that in the initial commit. Thanks!

mayya-sharipova · 2018-06-22T12:53:35Z

server/src/main/java/org/elasticsearch/index/analysis/ElasticSynonymParser.java

+
+public class ElasticSynonymParser extends SolrSynonymParser {
+
+    private boolean lenient;


Do we want to make lenient variable final as well?

mayya-sharipova · 2018-06-22T13:01:56Z

server/src/main/java/org/elasticsearch/index/analysis/ElasticSynonymParser.java

+
+    @Override
+    public void add(CharsRef input, CharsRef output, boolean includeOrig) {
+        if (!lenient || (input.length > 0 && output.length > 0)) {


!lenient we usually write as lenient == false

@sohaibiftikhar I am not sure I understand this condition (!lenient || (input.length > 0 && output.length > 0), can you please explain what you are trying to achieve here.

So if lenient is not set we should proceed as before. For the case when lenient is set, if there are exceptions during analyze we return a zero-length CharsRef. When these synonym mappings are added using the add method we skip the ones that were left empty (super.add would throw exceptions otherwise). Does this make sense?
I believe if the condition was only input.length > 0 && output.length > 0 it would still work cause it never really reaches this part with zero length input or output with the non-lenient setting since it would throw an exception during analyze. But I just wanted to be sure that I only change the logic when lenient is set.

@sohaibiftikhar Thanks for the explanation, it does makes sense and the logic is correct, but your condition by itself is not obvious. I would either provide more extensive explanation before this condition, or I would instead override addInternal method of SolrSynonymParser parser where the logic can be obvious.

I think overriding addInternal may not be the best idea because the method is private and it would mean repetition of the logic (it calls some other private methods that we'd need to copy as well). Unless we want to go for a fork of SolrSynonymParser.

I will add the explanation above the condition as you suggested.

mayya-sharipova

Overall, the PR is well implemented, and addresses the problem. Just some minor changes are necessary.

mayya-sharipova · 2018-06-22T16:08:04Z

server/src/main/java/org/elasticsearch/index/analysis/ElasticSynonymParser.java

+
+    @Override
+    public void add(CharsRef input, CharsRef output, boolean includeOrig) {
+        if (!lenient || (input.length > 0 && output.length > 0)) {


@sohaibiftikhar Thanks for the explanation, it does makes sense and the logic is correct, but your condition by itself is not obvious. I would either provide more extensive explanation before this condition, or I would instead override addInternal method of SolrSynonymParser parser where the logic can be obvious.

nik9000

Ooops! I had some comment that I forgot to click "submit" on.

nik9000 · 2018-06-21T14:26:32Z

server/src/main/java/org/elasticsearch/index/analysis/ElasticSynonymParser.java

+
+import java.io.IOException;
+
+public class ElasticSynonymParser extends SolrSynonymParser {


ElasticsearchSynonymParser is probably better.

nik9000 · 2018-06-21T14:27:30Z

server/src/main/java/org/elasticsearch/index/analysis/ElasticSynonymParser.java

+    public ElasticSynonymParser(boolean dedup, boolean expand, boolean lenient, Analyzer analyzer) {
+        super(dedup, expand, analyzer);
+        this.lenient = lenient;
+        logger = Loggers.getLogger(getClass(), "ElasticSynonymParser");


A static logger should be fine here. The point of these methods is to make loggers look good when referring to a particular shard.

nik9000 · 2018-06-22T18:33:49Z

@jpountz, would you like to take a look at this one too because we'd talked about it a while back?

mayya-sharipova

@sohaibiftikhar Thanks! LGTM

jpountz

This looks good in general, can we make the documentation more specific about the fact that only illegal synonyms are going to be ignored, not the entire rule? For instance if you have a rule that is foo, bar => baz and bar is a stop word, then Elasticsearch will handle it like if there was a foo => baz rule. And add tests for this case?

sohaibiftikhar · 2018-06-25T08:56:25Z

@jpountz You mean in the asciidoc?

jpountz · 2018-06-25T09:06:11Z

Yes.

sohaibiftikhar · 2018-06-25T21:39:55Z

@nik9000 @mayya-sharipova @jpountz Looking at this just now I realised with this PR we only support the lenient option for the SolrSynonymParser. Would we need to support it also for the WordnetSynonymParser? I think since the underlying class (SynonymMap.Parser) is the same the implementation should be very similar.

mayya-sharipova · 2018-06-25T21:59:58Z

@sohaibiftikhar yes, I think we should add lenient flag for WordnetSynonymParser as well. Good job for noticing this

sohaibiftikhar · 2018-06-26T01:12:00Z

server/src/main/java/org/elasticsearch/index/analysis/ESWordnetSynonymParser.java

+    public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException {
+        try {
+            return super.analyze(text, reuse);
+        } catch (IllegalArgumentException ex) {


I understand that there is code duplication in this class. But I don't know how to avoid it in this situation.

sohaibiftikhar · 2018-06-26T08:55:12Z

docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc

+                    "synonym_graph" : {
+                        "type" : "synonym_graph",
+                        "lenient": true,
+                        "synonyms" : ["foo, bar => baz"]


However, if it was "foo, baz => bar" or "bar, foo, baz" no mapping would be added. Is this acceptable?

@sohaibiftikhar Yes, I think it is acceptable, as long as we state it in the documentaiton

In the case of foo, baz => bar I think this is expected, but the other case is a bit unexpected to me, I would expect it to behave like foo, baz?

In that case, the parser maps everything to the first word. Similarly for WordnetParser if you have something like

s(100000001,1,'bar',v,1,0). s(100000001,2,'foo',v,1,0). s(100000001,3,'baz',v,1,0).

everything gets mapped to bar. But since that is a stop word no mappings get added.

In that case, the parser maps everything to the first word.

This is only true if expand is explicitly set to false. By default, all combinations are generated, see this recreation for instance:

PUT test { "settings": { "analysis": { "filter": { "my_syns": { "type": "synonym_graph", "synonyms": [ "foo, bar, baz" ] } }, "analyzer": { "my_syn_analyzer": { "tokenizer": "whitespace", "filter": [ "my_syns" ] } } } } } GET test/_analyze { "analyzer": "my_syn_analyzer", "text": "a bar" } // Returns: { "tokens": [ { "token": "a", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "foo", "start_offset": 2, "end_offset": 5, "type": "SYNONYM", "position": 1 }, { "token": "baz", "start_offset": 2, "end_offset": 5, "type": "SYNONYM", "position": 1 }, { "token": "bar", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 } ] }

Sorry for the late reply as I somehow missed this.
Yes. I forgot to mention the expand part in the documentation. Should I add that or remove that example altogether?

mayya-sharipova · 2018-07-03T18:03:02Z

docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc

+// CONSOLE
+With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping
+being added was "foo, baz => bar" or "bar, foo, baz" nothing would get added to the synonym list. This is because the
+target word for the mapping is itself eliminated because it was a stop word.


@sohaibiftikhar I think you can have your script example at it is. It is fine. Just in the explanation after script, you can elaborate more with expand option, e.g. with "bar, foo, baz" nothing will be added when expand=false, and foo, baz => foo, baz would be added if expand=true.

Relates to elastic#30968

-- Also made lenient final -- changed from !lenient to lenient == false

-- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static

-- also added more documentation

mayya-sharipova · 2018-07-04T14:38:11Z

@sohaibiftikhar Thanks, +1 from me. @jpountz Do you have any other comments for this PR, or we can merge it?

mayya-sharipova · 2018-07-11T11:12:48Z

@sohaibiftikhar We would like to backport these changes also to 6.x. Would you mind to create another PR for 6.x (mostly copy of the same PR)?
I could not simply backport the commit of this PR as there are inconsistencies in the affected classes between 6.x and master

sohaibiftikhar · 2018-07-11T14:09:21Z

@mayya-sharipova Okay. I will create a patch ~~by next week~~.

* master: [TEST] Mute SlackMessageTests.testTemplateRender Docs: Explain closing the high level client [ML] Re-enable memory limit integration tests (#31328) [test] disable packaging tests for suse boxes Add nio transport to security plugin (#31942) XContentTests : Insert random fields at random positions (#30867) Force execution of fetch tasks (#31974) Fix unreachable error condition in AmazonS3Fixture (#32005) Tests: Fix SearchFieldsIT.testDocValueFields (#31995) Add Expected Reciprocal Rank metric (#31891) [ML] Get ForecastRequestStats doc in RestoreModelSnapshotIT (#31973) SQL: Add support for single parameter text manipulating functions (#31874) [ML] Ensure immutability of MlMetadata (#31957) Tests: Mute SearchFieldsIT.testDocValueFields() muted tests due to #31940 Work around reported problem in eclipse (#31960) Move build integration tests out of :buildSrc project (#31961) Tests: Remove use of joda time in some tests (#31922) [Test] Reactive 3rd party tests on CI (#31919) SQL: Support for escape sequences (#31884) SQL: HAVING clause should accept only aggregates (#31872) Docs: fix typo in datehistogram (#31972) Switch url repository rest tests to new style requests (#31944) Switch reindex tests to new style requests (#31941) Docs: Added note about cloud service to installation and getting started [DOCS] Removes alternative docker pull example (#31934) Add Snapshots Status API to High Level Rest Client (#31515) ingest: date_index_name processor template resolution (#31841) Test: fix null failure in watcher test (#31968) Switch test framework to new style requests (#31939) Switch low level rest tests to new style Requests (#31938) Switch high level rest tests to new style requests (#31937) [ML] Mute test failing due to Java 11 date time format parsing bug (#31899) [TEST] Mute SlackMessageTests.testTemplateRender Fix assertIngestDocument wrongfully passing (#31913) Remove unused reference to filePermissionsCache (#31923) rolling upgrade should use a replica to prevent relocations while running a scroll HLREST: Bundle the x-pack protocol project (#31904) Increase logging level for testStressMaybeFlush Added lenient flag for synonym token filter (#31484) [X-Pack] Beats centralized management: security role + licensing (#30520) HLRest: Move xPackInfo() to xPack().info() (#31905) Docs: add security delete role to api call table (#31907) [test] port archive distribution packaging tests (#31314) Watcher: Slack message empty text (#31596) [ML] Mute failing DetectionRulesIT.testCondition() test Fix broken NaN check in MovingFunctions#stdDev() (#31888) Date: Add DateFormatters class that uses java.time (#31856) [ML] Switch native QA tests to a 3 node cluster (#31757) Change trappy float comparison (#31889) Fix building AD URL from domain name (#31849) Add opaque_id to audit logging (#31878) re-enable backcompat tests add support for is_write_index in put-alias body parsing (#31674) Improve release notes script (#31833) [DOCS] Fix broken link in painless example Handle missing values in painless (#30975) Remove the ability to index or query context suggestions without context (#31007) Ingest: Enable Templated Fieldnames in Rename (#31690) [Docs] Fix typo in the Rollup API Quick Reference (#31855) Ingest: Add ignore_missing option to RemoveProc (#31693) Add template config for Beat state to X-Pack Monitoring (#31809) Watcher: Add ssl.trust email account setting (#31684) Remove link to oss-MSI (#31844) Painless: Restructure Definition/Whitelist (#31879) HLREST: Add x-pack-info API (#31870)

* Added lenient flag for synonym-tokenfilter. Relates to #30968 * added docs for synonym-graph-tokenfilter -- Also made lenient final -- changed from !lenient to lenient == false * Changes after review (1) -- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static * Added lenient option for WordnetSynonymParser -- also added more documentation * Added additional documentation * Improved documentation (cherry picked from commit 88c270d)

* 6.x: Watcher: Make settings reloadable (#31746) [Rollup] Histo group config should support scaled_floats (#32048) lazy snapshot repository initialization (#31606) Add secure setting for watcher email password (#31620) Watcher: cleanup ensureWatchExists use (#31926) Add second level of field collapsing (#31808) Added lenient flag for synonym token filter (#31484) (#31970) Test: Fix a second case of bad watch creation [Rollup] Use composite's missing_bucket (#31402) Docs: Restyled cloud link in getting started Docs: Change formatting of Cloud options Re-instate link in StringFunctionUtils javadocs Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025) Fix problematic chars in javadoc [ML] Move open job failure explanation out of root cause (#31925) [ML] Switch ML native QA tests to use a 3 node cluster (#32011)

sohaibiftikhar commented Jun 21, 2018

View reviewed changes

colings86 added >enhancement :Search Relevance/Analysis How text is split into tokens labels Jun 21, 2018

mayya-sharipova self-requested a review June 22, 2018 12:51

mayya-sharipova reviewed Jun 22, 2018

View reviewed changes

mayya-sharipova requested changes Jun 22, 2018

View reviewed changes

nik9000 reviewed Jun 22, 2018

View reviewed changes

mayya-sharipova approved these changes Jun 22, 2018

View reviewed changes

jpountz reviewed Jun 25, 2018

View reviewed changes

sohaibiftikhar commented Jun 26, 2018

View reviewed changes

mayya-sharipova reviewed Jul 3, 2018

View reviewed changes

sohaibiftikhar added 6 commits July 4, 2018 00:25

Added lenient flag for synonym-tokenfilter.

aa9cbfb

Relates to elastic#30968

added docs for synonym-graph-tokenfilter

48be26a

-- Also made lenient final -- changed from !lenient to lenient == false

Changes after review (1)

bdc0ce1

-- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static

Added lenient option for WordnetSynonymParser

29d8c2f

-- also added more documentation

Added additional documentation

0780c74

Improved documentation

198ebed

sohaibiftikhar force-pushed the synonym_token_filter branch from e6e547c to 198ebed Compare July 3, 2018 23:52

jpountz approved these changes Jul 9, 2018

View reviewed changes

mayya-sharipova merged commit 88c270d into elastic:master Jul 10, 2018

sohaibiftikhar deleted the synonym_token_filter branch July 10, 2018 23:35

mayya-sharipova added v7.0.0 v6.4.0 labels Jul 11, 2018

sohaibiftikhar mentioned this pull request Jul 11, 2018

Added lenient flag for synonym token filter (6.x) #31970

Merged

This was referenced Sep 24, 2018

[meta] 6.4.0 release elastic/elasticsearch-net#3397

Closed

add lenient to synonym and synonym_graph token filters elastic/elasticsearch-net#3428

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019


		public class ElasticSynonymParser extends SolrSynonymParser {

		private boolean lenient;


		import java.io.IOException;

		public class ElasticSynonymParser extends SolrSynonymParser {

Added lenient flag for synonym token filter #31484

Added lenient flag for synonym token filter #31484

Conversation

sohaibiftikhar commented Jun 20, 2018

cbuescher commented Jun 21, 2018

sohaibiftikhar Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sohaibiftikhar Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sohaibiftikhar Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Jun 22, 2018

mayya-sharipova left a comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

sohaibiftikhar commented Jun 25, 2018

jpountz commented Jun 25, 2018

sohaibiftikhar commented Jun 25, 2018

mayya-sharipova commented Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sohaibiftikhar Jun 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Jul 4, 2018

mayya-sharipova commented Jul 11, 2018 • edited Loading

sohaibiftikhar commented Jul 11, 2018 • edited Loading

sohaibiftikhar Jun 21, 2018 •

edited

Loading

sohaibiftikhar Jun 22, 2018 •

edited

Loading

sohaibiftikhar Jun 22, 2018 •

edited

Loading

mayya-sharipova commented Jun 25, 2018 •

edited

Loading

sohaibiftikhar Jun 29, 2018 •

edited

Loading

mayya-sharipova commented Jul 11, 2018 •

edited

Loading

sohaibiftikhar commented Jul 11, 2018 •

edited

Loading