Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens generated after token filters ignore match query operator option #25746

Closed
pmishev opened this issue Jul 17, 2017 · 4 comments
Closed
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@pmishev
Copy link

pmishev commented Jul 17, 2017

Elasticsearch version: 2.4, 5.5

Plugins installed: []

JVM version: 1.8.0_131

OS version: 4.4.0-81-generic #104-Ubuntu x86_64

Description of the problem including expected versus actual behavior:

When "operator": "and" is specified in a match query, ALL tokens generated by the search analyzer should be looked for in the indexed tokens.
However tokens generated by token filters are behaving differently. It is looking for ANY of those tokens.

Steps to reproduce:

PUT /test1
{
  "settings": {
    "analysis": {
      "filter": {
        "pattern_filter": {
          "type": "pattern_capture",
          "patterns": [
            "(\\p{L}+)"
          ]
        }
      },
      "analyzer": {
        "my_analyzer1": {
          "tokenizer": "uax_url_email",
          "filter": [
            "pattern_filter"
          ]
        },
        "my_analyzer2": {
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "emails": {
      "properties": {
        "email1": {
          "type": "string",
          "analyzer": "my_analyzer1"
        },
        "email2": {
          "type": "string",
          "analyzer": "my_analyzer2"
        }
      }
    }
  }
}
GET test1/emails/_validate/query?explain
{
  "query": {
    "match": {
      "email2": {
        "query": "[email protected]",
        "operator": "and"
      }
    }
  }
}

gives "explanation": "+(+email2:somebody +email2:we +email2:example.com) #ConstantScore(+ConstantScore(_type:emails))", which is correct

GET test1/emails/_validate/query?explain
{
  "query": {
    "match": {
      "email1": {
        "query": "[email protected]",
        "operator": "and"
      }
    }
  }
}

gives "explanation": "+(email1:[email protected] email1:somebody email1:we email1:example email1:com) #ConstantScore(+ConstantScore(_type:emails))" in ES 2.4

or "explanation": "+Synonym(email1:com email1:example email1:somebody email1:[email protected] email1:we) #_type:emails" in ES 5.5

I couldn't find a reason for such behaviour documented anywhere and I believe it is wrong and the correct explanation should be:
+(+email1:[email protected] +email1:somebody +email1:we +email1:example +email1:com) #ConstantScore(+ConstantScore(_type:emails))

@cbuescher
Copy link
Member

@pmishev Could you elaborate on what you think it wrong with the behaviour in 5.5.? Looking at your uax_url_email tokenizer and the subsequent pattern filter, the query looks okay to me:

POST /test1/_analyze
{
  "analyzer": "my_analyzer1", 
  "text" : "[email protected]"
}

{
  "tokens": [
    {
      "token": "[email protected]",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "somebody",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "we",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "example",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    },
    {
      "token": "com",
      "start_offset": 0,
      "end_offset": 23,
      "type": "<EMAIL>",
      "position": 0
    }
  ]
}

As documented, the Patter Capture Token Filter emmits all tokens it produces in the same position, and with the same character offsets. This is the cause for the "+Synonym" in the explanation output.

@cbuescher cbuescher added :Search Relevance/Analysis How text is split into tokens feedback_needed labels Jul 17, 2017
@pmishev
Copy link
Author

pmishev commented Jul 17, 2017

@cbuescher, to illustrate what I mean:


PUT /test1/emails/1
{
  "email1": "[email protected]",
  "email2": "[email protected]"
}

GET test1/emails/_search
{
  "query": {
    "match": {
      "email1": {
        "query": "[email protected]",
        "operator": "and"
      }
    }
  }
}

That query will return the document.
However it shouldn't, because one of the tokens during search term analysis will be blah and such token does not exist within the indexed tokens. And because I used "operator": "and" I expect to NOT get any results from the query.

@jimczi
Copy link
Contributor

jimczi commented Jul 18, 2017

There is a note at the end of the documentation for the pattern_filter:

Note: All tokens are emitted in the same position, and with the same character offsets, so when combined with highlighting, the whole original token will be highlighted, not just the matching subset. For instance, querying the above email address for "smith" would highlight:

So the query parser thinks that all these tokens are at the same position and build them as synonyms. I think it should be clearly stated in the docs that each token will be considered as a full replacement for the email address. Bottom line is that this is the expected behavior with this token filter.

@jimczi jimczi added >docs General docs changes and removed feedback_needed labels Jul 18, 2017
@pmishev
Copy link
Author

pmishev commented Jul 19, 2017

Thank you for clarifying that. That explains a lot.
However, that makes using token filters that generate additional tokens kind of useless for AND queries, doesn't it?

Perhaps when and operator is used in a match query, the token order should be ignored?
But perhaps there are other scenarios where that may not be the right thing to do.

Alternatively perhaps a pattern_capture tokenizer should be introduced that is as powerful as the filter, but would generate the tokens in different positions?

cbuescher added a commit to cbuescher/elasticsearch that referenced this issue Aug 18, 2017
There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes elastic#25746
cbuescher added a commit that referenced this issue Aug 21, 2017
#26278)

There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes #25746
cbuescher added a commit that referenced this issue Aug 21, 2017
#26278)

There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes #25746
cbuescher added a commit that referenced this issue Aug 21, 2017
#26278)

There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes #25746
cbuescher added a commit that referenced this issue Aug 21, 2017
#26278)

There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes #25746
cbuescher added a commit that referenced this issue Aug 21, 2017
#26278)

There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.

Closes #25746
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

4 participants