Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added lenient flag for synonym token filter #31484

Merged
merged 6 commits into from
Jul 10, 2018

Conversation

sohaibiftikhar
Copy link
Contributor

Relates to #30968

@cbuescher
Copy link
Member

add to whitelist


@Override
public void add(CharsRef input, CharsRef output, boolean includeOrig) {
if (!lenient || (input.length > 0 && output.length > 0)) {
Copy link
Contributor Author

@sohaibiftikhar sohaibiftikhar Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 I am not sure if this looks very clean but it avoids a fork of the SolrSynonymParser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik I am not sure if this looks very clean but it avoids a fork of the SolrSynonymParser.

Wrong Nik, sadly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah corrected it. But I still got the attention of the right one ;)

@colings86 colings86 added >enhancement :Search Relevance/Analysis How text is split into tokens labels Jun 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@mayya-sharipova mayya-sharipova self-requested a review June 22, 2018 12:51
Additional settings are:

* `expand` (defaults to `true`).
* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration.

This filter tokenize synonyms with whatever tokenizer and token filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add a description of the lenient setting to synonym-graph-tokenfilter.asciidoc as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I missed that in the initial commit. Thanks!


public class ElasticSynonymParser extends SolrSynonymParser {

private boolean lenient;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make lenient variable final as well?


@Override
public void add(CharsRef input, CharsRef output, boolean includeOrig) {
if (!lenient || (input.length > 0 && output.length > 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!lenient we usually write as lenient == false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohaibiftikhar I am not sure I understand this condition (!lenient || (input.length > 0 && output.length > 0), can you please explain what you are trying to achieve here.

Copy link
Contributor Author

@sohaibiftikhar sohaibiftikhar Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if lenient is not set we should proceed as before. For the case when lenient is set, if there are exceptions during analyze we return a zero-length CharsRef. When these synonym mappings are added using the add method we skip the ones that were left empty (super.add would throw exceptions otherwise). Does this make sense?
I believe if the condition was only input.length > 0 && output.length > 0 it would still work cause it never really reaches this part with zero length input or output with the non-lenient setting since it would throw an exception during analyze. But I just wanted to be sure that I only change the logic when lenient is set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohaibiftikhar Thanks for the explanation, it does makes sense and the logic is correct, but your condition by itself is not obvious. I would either provide more extensive explanation before this condition, or I would instead override addInternal method of SolrSynonymParser parser where the logic can be obvious.

Copy link
Contributor Author

@sohaibiftikhar sohaibiftikhar Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overriding addInternal may not be the best idea because the method is private and it would mean repetition of the logic (it calls some other private methods that we'd need to copy as well). Unless we want to go for a fork of SolrSynonymParser.

I will add the explanation above the condition as you suggested.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the PR is well implemented, and addresses the problem. Just some minor changes are necessary.


@Override
public void add(CharsRef input, CharsRef output, boolean includeOrig) {
if (!lenient || (input.length > 0 && output.length > 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohaibiftikhar Thanks for the explanation, it does makes sense and the logic is correct, but your condition by itself is not obvious. I would either provide more extensive explanation before this condition, or I would instead override addInternal method of SolrSynonymParser parser where the logic can be obvious.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops! I had some comment that I forgot to click "submit" on.


import java.io.IOException;

public class ElasticSynonymParser extends SolrSynonymParser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ElasticsearchSynonymParser is probably better.

public ElasticSynonymParser(boolean dedup, boolean expand, boolean lenient, Analyzer analyzer) {
super(dedup, expand, analyzer);
this.lenient = lenient;
logger = Loggers.getLogger(getClass(), "ElasticSynonymParser");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A static logger should be fine here. The point of these methods is to make loggers look good when referring to a particular shard.

@nik9000
Copy link
Member

nik9000 commented Jun 22, 2018

@jpountz, would you like to take a look at this one too because we'd talked about it a while back?

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohaibiftikhar Thanks! LGTM

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good in general, can we make the documentation more specific about the fact that only illegal synonyms are going to be ignored, not the entire rule? For instance if you have a rule that is foo, bar => baz and bar is a stop word, then Elasticsearch will handle it like if there was a foo => baz rule. And add tests for this case?

@sohaibiftikhar
Copy link
Contributor Author

@jpountz You mean in the asciidoc?

@jpountz
Copy link
Contributor

jpountz commented Jun 25, 2018

Yes.

@sohaibiftikhar
Copy link
Contributor Author

@nik9000 @mayya-sharipova @jpountz Looking at this just now I realised with this PR we only support the lenient option for the SolrSynonymParser. Would we need to support it also for the WordnetSynonymParser? I think since the underlying class (SynonymMap.Parser) is the same the implementation should be very similar.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jun 25, 2018

@sohaibiftikhar yes, I think we should add lenient flag for WordnetSynonymParser as well. Good job for noticing this

public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException {
try {
return super.analyze(text, reuse);
} catch (IllegalArgumentException ex) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that there is code duplication in this class. But I don't know how to avoid it in this situation.

"synonym_graph" : {
"type" : "synonym_graph",
"lenient": true,
"synonyms" : ["foo, bar => baz"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if it was "foo, baz => bar" or "bar, foo, baz" no mapping would be added. Is this acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohaibiftikhar Yes, I think it is acceptable, as long as we state it in the documentaiton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of foo, baz => bar I think this is expected, but the other case is a bit unexpected to me, I would expect it to behave like foo, baz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the parser maps everything to the first word. Similarly for WordnetParser if you have something like

s(100000001,1,'bar',v,1,0).
s(100000001,2,'foo',v,1,0).
s(100000001,3,'baz',v,1,0).

everything gets mapped to bar. But since that is a stop word no mappings get added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the parser maps everything to the first word.

This is only true if expand is explicitly set to false. By default, all combinations are generated, see this recreation for instance:

PUT test 
{
  "settings": {
    "analysis": {
      "filter": {
        "my_syns": {
          "type": "synonym_graph",
          "synonyms": [ "foo, bar, baz" ]
        }
      },
      "analyzer": {
        "my_syn_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "my_syns" ]
        }
      }
    }
  }
}

GET test/_analyze
{
  "analyzer": "my_syn_analyzer",
  "text": "a bar"
}

// Returns:

{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "foo",
      "start_offset": 2,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "baz",
      "start_offset": 2,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "bar",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 1
    }
  ]
}

Copy link
Contributor Author

@sohaibiftikhar sohaibiftikhar Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply as I somehow missed this.
Yes. I forgot to mention the expand part in the documentation. Should I add that or remove that example altogether?

// CONSOLE
With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping
being added was "foo, baz => bar" or "bar, foo, baz" nothing would get added to the synonym list. This is because the
target word for the mapping is itself eliminated because it was a stop word.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohaibiftikhar I think you can have your script example at it is. It is fine. Just in the explanation after script, you can elaborate more with expand option, e.g. with "bar, foo, baz" nothing will be added when expand=false, and foo, baz => foo, baz would be added if expand=true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

-- Also made lenient final
-- changed from !lenient to lenient == false
-- Renamed to ElasticsearchSynonymParser
-- Added explanation for ElasticsearchSynonymParser::add method
-- Changed ElasticsearchSynonymParser::logger instance to static
-- also added more documentation
@mayya-sharipova
Copy link
Contributor

@sohaibiftikhar Thanks, +1 from me. @jpountz Do you have any other comments for this PR, or we can merge it?

@mayya-sharipova mayya-sharipova merged commit 88c270d into elastic:master Jul 10, 2018
@sohaibiftikhar sohaibiftikhar deleted the synonym_token_filter branch July 10, 2018 23:35
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jul 11, 2018

@sohaibiftikhar We would like to backport these changes also to 6.x. Would you mind to create another PR for 6.x (mostly copy of the same PR)?
I could not simply backport the commit of this PR as there are inconsistencies in the affected classes between 6.x and master

@sohaibiftikhar
Copy link
Contributor Author

sohaibiftikhar commented Jul 11, 2018

@mayya-sharipova Okay. I will create a patch by next week.

dnhatn added a commit that referenced this pull request Jul 12, 2018
* master:
  [TEST] Mute SlackMessageTests.testTemplateRender
  Docs: Explain closing the high level client
  [ML] Re-enable memory limit integration tests (#31328)
  [test] disable packaging tests for suse boxes
  Add nio transport to security plugin (#31942)
  XContentTests : Insert random fields at random positions (#30867)
  Force execution of fetch tasks (#31974)
  Fix unreachable error condition in AmazonS3Fixture (#32005)
  Tests: Fix SearchFieldsIT.testDocValueFields (#31995)
  Add Expected Reciprocal Rank metric (#31891)
  [ML] Get ForecastRequestStats doc in RestoreModelSnapshotIT (#31973)
  SQL: Add support for single parameter text manipulating functions (#31874)
  [ML] Ensure immutability of MlMetadata (#31957)
  Tests: Mute SearchFieldsIT.testDocValueFields()
  muted tests due to #31940
  Work around reported problem in eclipse (#31960)
  Move build integration tests out of :buildSrc project (#31961)
  Tests: Remove use of joda time in some tests (#31922)
  [Test] Reactive 3rd party tests on CI (#31919)
  SQL: Support for escape sequences (#31884)
  SQL: HAVING clause should accept only aggregates (#31872)
  Docs: fix typo in datehistogram (#31972)
  Switch url repository rest tests to new style requests (#31944)
  Switch reindex tests to new style requests (#31941)
  Docs: Added note about cloud service to installation and getting started
  [DOCS] Removes alternative docker pull example (#31934)
  Add Snapshots Status API to High Level Rest Client (#31515)
  ingest: date_index_name processor template resolution (#31841)
  Test: fix null failure in watcher test (#31968)
  Switch test framework to new style requests (#31939)
  Switch low level rest tests to new style Requests (#31938)
  Switch high level rest tests to new style requests (#31937)
  [ML] Mute test failing due to Java 11 date time format parsing bug (#31899)
  [TEST] Mute SlackMessageTests.testTemplateRender
  Fix assertIngestDocument wrongfully passing (#31913)
  Remove unused reference to filePermissionsCache (#31923)
  rolling upgrade should use a replica to prevent relocations while running a scroll
  HLREST: Bundle the x-pack protocol project (#31904)
  Increase logging level for testStressMaybeFlush
  Added lenient flag for synonym token filter (#31484)
  [X-Pack] Beats centralized management: security role + licensing (#30520)
  HLRest: Move xPackInfo() to xPack().info() (#31905)
  Docs: add security delete role to api call table (#31907)
  [test] port archive distribution packaging tests (#31314)
  Watcher: Slack message empty text (#31596)
  [ML] Mute failing DetectionRulesIT.testCondition() test
  Fix broken NaN check in MovingFunctions#stdDev() (#31888)
  Date: Add DateFormatters class that uses java.time (#31856)
  [ML] Switch native QA tests to a 3 node cluster (#31757)
  Change trappy float comparison (#31889)
  Fix building AD URL from domain name (#31849)
  Add opaque_id to audit logging (#31878)
  re-enable backcompat tests
  add support for is_write_index in put-alias body parsing (#31674)
  Improve release notes script (#31833)
  [DOCS] Fix broken link in painless example
  Handle missing values in painless (#30975)
  Remove the ability to index or query context suggestions without context (#31007)
  Ingest: Enable Templated Fieldnames in Rename (#31690)
  [Docs] Fix typo in the Rollup API Quick Reference (#31855)
  Ingest: Add ignore_missing option to RemoveProc (#31693)
  Add template config for Beat state to X-Pack Monitoring (#31809)
  Watcher: Add ssl.trust email account setting (#31684)
  Remove link to oss-MSI (#31844)
  Painless: Restructure Definition/Whitelist (#31879)
  HLREST: Add x-pack-info API (#31870)
mayya-sharipova pushed a commit that referenced this pull request Jul 13, 2018
* Added lenient flag for synonym-tokenfilter.

Relates to #30968

* added docs for synonym-graph-tokenfilter

-- Also made lenient final
-- changed from !lenient to lenient == false

* Changes after review (1)

-- Renamed to ElasticsearchSynonymParser
-- Added explanation for ElasticsearchSynonymParser::add method
-- Changed ElasticsearchSynonymParser::logger instance to static

* Added lenient option for WordnetSynonymParser

-- also added more documentation

* Added additional documentation

* Improved documentation

(cherry picked from commit 88c270d)
dnhatn added a commit that referenced this pull request Jul 13, 2018
* 6.x:
  Watcher: Make settings reloadable (#31746)
  [Rollup] Histo group config should support scaled_floats (#32048)
  lazy snapshot repository initialization (#31606)
  Add secure setting for watcher email password (#31620)
  Watcher: cleanup ensureWatchExists use (#31926)
  Add second level of field collapsing (#31808)
  Added lenient flag for synonym token filter (#31484) (#31970)
  Test: Fix a second case of bad watch creation
  [Rollup] Use composite's missing_bucket (#31402)
  Docs: Restyled cloud link in getting started
  Docs: Change formatting of Cloud options
  Re-instate link in StringFunctionUtils javadocs
  Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025)
  Fix problematic chars in javadoc
  [ML] Move open job failure explanation out of root cause (#31925)
  [ML] Switch ML native QA tests to use a 3 node cluster (#32011)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants