Tokens generated after token filters ignore match query operator option #25746

pmishev · 2017-07-17T10:18:45Z

Elasticsearch version: 2.4, 5.5

Plugins installed: []

JVM version: 1.8.0_131

OS version: 4.4.0-81-generic #104-Ubuntu x86_64

Description of the problem including expected versus actual behavior:

When "operator": "and" is specified in a match query, ALL tokens generated by the search analyzer should be looked for in the indexed tokens.
However tokens generated by token filters are behaving differently. It is looking for ANY of those tokens.

Steps to reproduce:

PUT /test1
{
  "settings": {
    "analysis": {
      "filter": {
        "pattern_filter": {
          "type": "pattern_capture",
          "patterns": [
            "(\\p{L}+)"
          ]
        }
      },
      "analyzer": {
        "my_analyzer1": {
          "tokenizer": "uax_url_email",
          "filter": [
            "pattern_filter"
          ]
        },
        "my_analyzer2": {
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "emails": {
      "properties": {
        "email1": {
          "type": "string",
          "analyzer": "my_analyzer1"
        },
        "email2": {
          "type": "string",
          "analyzer": "my_analyzer2"
        }
      }
    }
  }
}

GET test1/emails/_validate/query?explain
{
  "query": {
    "match": {
      "email2": {
        "query": "[email protected]",
        "operator": "and"
      }
    }
  }
}

gives "explanation": "+(+email2:somebody +email2:we +email2:example.com) #ConstantScore(+ConstantScore(_type:emails))", which is correct

GET test1/emails/_validate/query?explain
{
  "query": {
    "match": {
      "email1": {
        "query": "[email protected]",
        "operator": "and"
      }
    }
  }
}

gives "explanation": "+(email1:[email protected] email1:somebody email1:we email1:example email1:com) #ConstantScore(+ConstantScore(_type:emails))" in ES 2.4

or "explanation": "+Synonym(email1:com email1:example email1:somebody email1:[email protected] email1:we) #_type:emails" in ES 5.5

I couldn't find a reason for such behaviour documented anywhere and I believe it is wrong and the correct explanation should be:
+(+email1:[email protected] +email1:somebody +email1:we +email1:example +email1:com) #ConstantScore(+ConstantScore(_type:emails))

The text was updated successfully, but these errors were encountered:

cbuescher · 2017-07-17T11:35:25Z

@pmishev Could you elaborate on what you think it wrong with the behaviour in 5.5.? Looking at your uax_url_email tokenizer and the subsequent pattern filter, the query looks okay to me:

POST /test1/_analyze
{
  "analyzer": "my_analyzer1", 
  "text" : "[email protected]"
}

{
  "tokens": [
    {
      "token": "[email protected]",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "somebody",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "we",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "example",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "com",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    }
  ]
}

As documented, the Patter Capture Token Filter emmits all tokens it produces in the same position, and with the same character offsets. This is the cause for the "+Synonym" in the explanation output.

pmishev · 2017-07-17T11:49:34Z

@cbuescher, to illustrate what I mean:


PUT /test1/emails/1
{
  "email1": "[email protected]",
  "email2": "[email protected]"
}

GET test1/emails/_search
{
  "query": {
    "match": {
      "email1": {
        "query": "[email protected]",
        "operator": "and"
      }
    }
  }
}

That query will return the document.
However it shouldn't, because one of the tokens during search term analysis will be blah and such token does not exist within the indexed tokens. And because I used "operator": "and" I expect to NOT get any results from the query.

jimczi · 2017-07-18T10:56:17Z

There is a note at the end of the documentation for the pattern_filter:

Note: All tokens are emitted in the same position, and with the same character offsets, so when combined with highlighting, the whole original token will be highlighted, not just the matching subset. For instance, querying the above email address for "smith" would highlight:

So the query parser thinks that all these tokens are at the same position and build them as synonyms. I think it should be clearly stated in the docs that each token will be considered as a full replacement for the email address. Bottom line is that this is the expected behavior with this token filter.

pmishev · 2017-07-19T11:14:22Z

Thank you for clarifying that. That explains a lot.
However, that makes using token filters that generate additional tokens kind of useless for AND queries, doesn't it?

Perhaps when and operator is used in a match query, the token order should be ignored?
But perhaps there are other scenarios where that may not be the right thing to do.

Alternatively perhaps a pattern_capture tokenizer should be introduced that is as powerful as the filter, but would generate the tokens in different positions?

There was some confusion about the fact that tokens emitted from a Pattern Capture Token Filter are treated as synonyms when used to analyze a search query. This commit adds an explanation to the note in the docs to emphasize this behaviour. Closes elastic#25746

#26278) There was some confusion about the fact that tokens emitted from a Pattern Capture Token Filter are treated as synonyms when used to analyze a search query. This commit adds an explanation to the note in the docs to emphasize this behaviour. Closes #25746

cbuescher added :Search Relevance/Analysis How text is split into tokens feedback_needed labels Jul 17, 2017

jimczi added >docs General docs changes and removed feedback_needed labels Jul 18, 2017

cbuescher mentioned this issue Aug 18, 2017

[Docs] Clarify behaviour of Pattern Capture Token Filter during search #26278

Merged

cbuescher closed this as completed in #26278 Aug 21, 2017

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokens generated after token filters ignore match query operator option #25746

Tokens generated after token filters ignore match query operator option #25746

pmishev commented Jul 17, 2017 •

edited

Loading

cbuescher commented Jul 17, 2017

pmishev commented Jul 17, 2017

jimczi commented Jul 18, 2017

pmishev commented Jul 19, 2017

Tokens generated after token filters ignore match query operator option #25746

Tokens generated after token filters ignore match query operator option #25746

Comments

pmishev commented Jul 17, 2017 • edited Loading

cbuescher commented Jul 17, 2017

pmishev commented Jul 17, 2017

jimczi commented Jul 18, 2017

pmishev commented Jul 19, 2017

pmishev commented Jul 17, 2017 •

edited

Loading