Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gensim.similarities.Similarity merges results from shards incorrectly (LSI model) #2584

Closed
avoskresensky opened this issue Aug 28, 2019 · 5 comments · Fixed by #2720
Closed
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills help wanted

Comments

@avoskresensky
Copy link

avoskresensky commented Aug 28, 2019

If "num_best" is used, gensim.similarities.Similarity runs the query against each of the shards (MatrixSimilarity objects) and then merges the results.

MatrixSimilarity uses matutils.full2sparse_clipped() to pick "num_best" results which sorts by the absolute value.

gensim.similarities.Similarity on the other hand, just uses heapq.nlargest (in __getitem__) to merge the results from each of the shards. So negative sims are either pushed down the list or cut off completely.

@piskvorky
Copy link
Owner

You're right. This seems an inconsistency / bug introduced by PR #811.

@avoskresensky in your use case – is abs desirable, or is the raw similarity desirable?

@piskvorky piskvorky added the bug Issue described a bug label Aug 28, 2019
@avoskresensky
Copy link
Author

If my assumption that it's the absolute values that indicate similarity is correct, then I don't care that much about the raw values. abs values are okay. That's how we ended up "fixing" this inconsistency in out current implementation: we're just abs'ing the results in our Similarity.query_shards override.

But if I'm wrong and sign of the sim values does mean some form of dissimilarity, then the whole idea of sorting while merging becomes questionable.

@piskvorky
Copy link
Owner

piskvorky commented Aug 29, 2019

Negative similarity values (close to -1) mean "opposite vectors", which is can also be interpreted as "very similar", because it indicates a strong semantic connection between the two inputs. Cossim values around 0.0 mean perpendicular = unrelated.

Whether "opposite" really means "similar" or not will depend on how you generated your vectors, and what you're using them for. That's why I ask: this is an "upstream" question, not a technical one.

Either way: can you open a PR that fixes Similarity to sort on abs values as well? We definitely want this logic consistent across the similarity classes. Cheers.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 7, 2019

@avoskresensky Ping on this. Are you able to make a PR?

@avoskresensky
Copy link
Author

Not in the next couple of weeks.

@piskvorky piskvorky added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants