-
-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill position
s in NgramTokenizer
#2471
base: main
Are you sure you want to change the base?
Conversation
@darashi I think removing the 0 in position seems good, but I'm not sure what would be best in terms of spec. /// Position, expressed in number of tokens.
pub position: usize, It would be good to know how Lucene handles this, the docs here are a little unclear to me. Instead of What do you think? Can you check what tokens Lucene emits? |
Thank you for your comment! The offset of a token in N-Gram varies depending on the N, so I agree that it's not that simple to determine it uniquely especially in cases where multiple Ns are mixed together. Even so, I thought it would be reasonable to express the position in terms of number of Unicode characters. I'm a novice with Java, but I managed to get For
and for
If I'm not mistaken, it looks like I used the codes in https://github.com/darashi/hellolucene to obtain the output mentioned above. |
Thanks for checking. Unfortunately they emit
In tantivy we use byte offsets, maybe they use something else? Can you add tests for ngram with different settings with phrase queries? |
I am looking into ways to make searches including non-ASCII characters. I want to search all documents that contain the query string as a substring, regardless of the word delimiter position.
For this purpose, I tried to use
NgramTokenizer
andPhraseQuery
. However, it did not work as I expected. The order of the tokens seems to be ignored. I looked into the cause and found that Position was changed to all zeros in commit e75bb1d.To be honest, I don't properly understand the reason for this change, but if this change was not necessary for the consistency of the specification or something, why not set the value of
position
?So, I made a patch to fill "where the token starts, in terms of characters (not bytes)" in
position
s.There may be a more appropriate way to do this and/or there may be something I've missed, but I would appreciate it if you could take a look.
As a supplement, I've used the following code to check the overall behavior: