-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Okapi BM25 variants in Gensim #3304
Conversation
Pretty nice! I'll look into this after the 4.2 release. |
63804ce
to
34d4281
Compare
Codecov Report
@@ Coverage Diff @@
## develop #3304 +/- ##
===========================================
- Coverage 81.43% 79.66% -1.78%
===========================================
Files 122 69 -53
Lines 21052 11875 -9177
===========================================
- Hits 17144 9460 -7684
+ Misses 3908 2415 -1493
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@piskvorky Thank you. No need to look just yet, I have tried some benchmarks and the code seems to have issues both with speed and with correctness. I will let you know when the PR is ready for review; it is just a draft for the moment. |
4b50675
to
9ab6f52
Compare
I have experimentally confirmed compatibility of 9ab6f52 (Okapi BM25) with the rank-bm25 library:
I have also outlined some issues with the default behavior of the |
The To me it makes perfect sense to control the index & query normalization via a parameter. Are you able to add such option @Witiko? We have to keep the defaults 100% backward compatible though. |
Not a problem, I added it to my task list. The |
OK. |
f43806d
to
9ab6f52
Compare
In f43806d, I halfway-implemented BM25L, but I realized that it is difficult to fully implement as a sparse vector space model. That is because in BM25L, document vectors have the weight δ (typically set to δ = 0.5) for all terms that did not occur in the document, which eliminates sparsity. This could be implemented efficiently if |
53ec11f
to
fd283a4
Compare
@Witiko what's the equivalent API call for
|
I don't think there is an equivalent API call. You can get all similarities, run argmax over them and take the top 80. |
This would have been very useful for me during the last few weeks. Sadly, there doesn't seem to be much interest in BM25 here. |
@Witiko I don't follow it. You are saying for a given query, I iterate over the whole 350K datapoint (in the corpus) and get similarity and then take top 80. This would not scale at all. 😭 |
@Witiko also the API of gensim is neither user friendly nor convenient. It appears it just provide a It appears to me Gensim was gold back in the days. But now it is an old, stale, and out-dated library and neither wants to move forward. Probably time to abandon this library and devs should look for better alternative that provide easier API access and more functionalities. |
What I am saying is that it will be significantly faster than rank-bm25 at retrieval time. (This is a continued discussion from dorianbrown/rank_bm25#27 and dorianbrown/rank_bm25#25.) |
got it |
Gensim will get you similarities in the order of indexing, i.e. if you index documents 1, 2, and 3, and then perform a similarity query, you will get back similarities between the query and documents 1, 2, and 3, respectively. |
@Witiko you are phenomenal, thanks for all the great feedback. I have one more question: If I set
And then get similarity like the following, would the result would be sorted by most matched document?
So question is here the result of |
@Witiko wow, awesome work. In the context of this implementation:
Sorry for the stupid question. |
This feature would be very useful for me. |
@smith-co The similarities are BM25 scores, i.e. the higher the similarity, the more similar the document is to your query. |
@ramsey-coding @smith-co I added outputs to the example code in the original post. Furthermore, I also added an example showing how you can get back the best document for a query. I hope you will find this useful. 😉 |
@ramsey-coding Sorry, I wasn't at my computer over the weekend. Yes, your understanding is correct; specifying |
I would really appreciate merging this functionality as I must now use my own custom implementation of BM25 when working with the Gensim library. |
Co-authored-by: Radim Řehůřek <[email protected]>
@piskvorky Thank you for taking the time. We can mention in the release notes that Gensim can now be used for Lucene-style information retrieval. |
Sure, it goes into the release notes without saying. I meant more like some demo, or a practical use-case (who'd use the gensim implementation and why?), or similar. A motivational part, to anchor the technical part. Maybe @dunefox @mgeletka @smith-co @nashid @ramsey-coding could help? |
@mpenkov anything missing here? Let's aim to release soon after merging, to get this feature out. Thanks. |
Sorry for the delay guys, merging. Thank you for your efforts and your patience @Witiko |
Thanks Misha! @dunefox @mgeletka @smith-co @nashid @ramsey-coding could you write a few sentences about how you use Okapi BM25, or intend to use it? Your story, your use-case, your motivation to participate in this PR. |
Hello @Witiko! |
Hello, @ditengm. You can get the BM25 weight matrix of your corpus from |
This pull request implements the
gensim.models.bm25model
module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in #2592 (comment). The module acts as a replacement for thegensim.summarization.bm25model
module deprecated and removed in Gensim 4. The module should supersede thegensim.models.tfidfmodel
module as the baseline weighting function for information retrieval and related NLP tasks.Most implementations of BM25 such as the rank-bm25 library combine indexing with weighting and often forgo dictionary building for a speed improvement at indexing time (but a hefty penalty at retrieval time). To give an example, here is how a user would search for documents with rank-bm25:
As you can see, the interface is convenient, but retrieval is slow due to the lack of a dictionary. Furthermore, any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.
By contrast, the
gensim.models.bm25
module separates the three operations. To give an example, here is how a user would search for documents with thegensim.models.bm25
module:Tasks:
BM25L and BM25+[1, 2], Lucene BM25 [3, 4], and ATIRE BM25 [3, 5].models.bm25
.similarities.docsim
.run_topics_and_transformations
autoexample.normalize_queries=True, normalize_documents=True
named parameters toSparseMatrixSimilarity
,DenseMatrixSimilarity
, andSoftCosineSimilarity
classes as discussed in Implement Okapi BM25 variants in Gensim #3304 (comment) and on the Gensim mailing list. Deprecate thenormalize
named parameter ofSoftCosineSimilarity
. Addnormalize_queries=False, normalize_documents=False
to TF-IDF and BM25 examples.