Implement Okapi BM25 variants in Gensim #3304

Witiko · 2022-03-04T19:17:42Z

This pull request implements the gensim.models.bm25model module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in #2592 (comment). The module acts as a replacement for the gensim.summarization.bm25model module deprecated and removed in Gensim 4. The module should supersede the gensim.models.tfidfmodel module as the baseline weighting function for information retrieval and related NLP tasks.

Most implementations of BM25 such as the rank-bm25 library combine indexing with weighting and often forgo dictionary building for a speed improvement at indexing time (but a hefty penalty at retrieval time). To give an example, here is how a user would search for documents with rank-bm25:

>>> from rank_bm25 import BM25Okapi
>>>
>>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
>>> bm25_model = BM25Okapi(corpus)
>>>
>>> query = ["Hello", "bar"]
>>> similarities = bm25_model.get_scores(query)
>>> similarities

array([0.51082562, 0.09121886, 0.0638532 ])

>>> best_document, = bm25_model.get_top_n(query, corpus, n=1)
>>> best_document

['Hello', 'world']

As you can see, the interface is convenient, but retrieval is slow due to the lack of a dictionary. Furthermore, any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.

By contrast, the gensim.models.bm25 module separates the three operations. To give an example, here is how a user would search for documents with the gensim.models.bm25 module:

>>> from gensim.corpora import Dictionary
>>> from gensim.models import TfidfModel, OkapiBM25Model
>>> from gensim.similarities import SparseMatrixSimilarity
>>> import numpy as np
>>>
>>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
>>> dictionary = Dictionary(corpus)
>>> bm25_model = OkapiBM25Model(dictionary=dictionary)
>>> bm25_corpus = bm25_model[list(map(dictionary.doc2bow, corpus))]
>>> bm25_index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary),
...                                     normalize_queries=False, normalize_documents=False)
>>>
>>> query = ["Hello", "bar"]
>>> tfidf_model = TfidfModel(dictionary=dictionary, smartirs='bnn')  # Enforce binary weighting of queries
>>> tfidf_query = tfidf_model[dictionary.doc2bow(query)]
>>>
>>> similarities = bm25_index[tfidf_query]
>>> similarities

array([0.51082563, 0.09121886, 0.0638532 ], dtype=float32)

>>> best_document = corpus[np.argmax(similarities)]
>>> best_document

['Hello', 'world']

Tasks:

Add Okapi BM25, ~~BM25L and BM25⁺~~ [1, 2], Lucene BM25 [3, 4], and ATIRE BM25 [3, 5].
Add comments and docstrings to models.bm25.
Add comments and docstrings to similarities.docsim.
Add BM25 to the run_topics_and_transformations autoexample.
Add normalize_queries=True, normalize_documents=True named parameters to SparseMatrixSimilarity, DenseMatrixSimilarity, and SoftCosineSimilarity classes as discussed in Implement Okapi BM25 variants in Gensim #3304 (comment) and on the Gensim mailing list. Deprecate the normalize named parameter of SoftCosineSimilarity. Add normalize_queries=False, normalize_documents=False to TF-IDF and BM25 examples.

piskvorky · 2022-03-04T19:27:13Z

Pretty nice! I'll look into this after the 4.2 release.

codecov · 2022-03-04T21:23:24Z

Codecov Report

Merging #3304 (f43806d) into develop (ac3bbcd) will decrease coverage by 1.77%.
The diff coverage is 95.74%.

❗ Current head f43806d differs from pull request most recent head b4843cc. Consider uploading reports for the commit b4843cc to get more accurate results

@@             Coverage Diff             @@
##           develop    #3304      +/-   ##
===========================================
- Coverage    81.43%   79.66%   -1.78%     
===========================================
  Files          122       69      -53     
  Lines        21052    11875    -9177     
===========================================
- Hits         17144     9460    -7684     
+ Misses        3908     2415    -1493

Impacted Files	Coverage Δ
gensim/models/bm25model.py	`95.74% <95.74%> (ø)`
gensim/scripts/glove2word2vec.py	`76.19% <0.00%> (-7.15%)`	⬇️
gensim/corpora/wikicorpus.py	`93.75% <0.00%> (-1.04%)`	⬇️
gensim/matutils.py	`77.23% <0.00%> (-0.90%)`	⬇️
gensim/similarities/docsim.py	`23.95% <0.00%> (-0.76%)`	⬇️
gensim/models/rpmodel.py	`89.47% <0.00%> (-0.53%)`	⬇️
gensim/models/ldamulticore.py	`90.58% <0.00%> (-0.33%)`	⬇️
gensim/utils.py	`71.86% <0.00%> (-0.12%)`	⬇️
gensim/corpora/dictionary.py	`94.17% <0.00%> (-0.09%)`	⬇️
gensim/models/hdpmodel.py	`71.27% <0.00%> (-0.08%)`	⬇️
... and 91 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Witiko · 2022-03-05T13:06:23Z

@piskvorky Thank you. No need to look just yet, I have tried some benchmarks and the code seems to have issues both with speed and with correctness. I will let you know when the PR is ready for review; it is just a draft for the moment.

Witiko · 2022-03-07T18:32:27Z

I have experimentally confirmed compatibility of 9ab6f52 (Okapi BM25) with the rank-bm25 library:

I have also outlined some issues with the default behavior of the DenseMatrixSimilarity and SparseMatrixSimilarity indexes, which are likely to bite even experienced users and decrease the accuracy of their results with BM25, on the Gensim mailing list.

piskvorky · 2022-03-07T18:56:34Z

The *MatrixSimilarity stuff is the oldest part of Gensim, along with the LsiModel. It dates back to DML-CZ days, in ancient pre-history :) (definitely pre-github)

To me it makes perfect sense to control the index & query normalization via a parameter. Are you able to add such option @Witiko? We have to keep the defaults 100% backward compatible though.

Witiko · 2022-03-07T19:07:36Z

Not a problem, I added it to my task list. The SoftCosineSimilarity constructor uses a single parameter normalized that takes a two-tuple of booleans, one for queries and one for documents; let's deprecate that and have normalize_queries and normalize_documents across the SimilarityABC subclasses, seems more readable.

piskvorky · 2022-03-07T21:04:40Z

OK.

Witiko · 2022-03-16T19:05:08Z

In f43806d, I halfway-implemented BM25L, but I realized that it is difficult to fully implement as a sparse vector space model. That is because in BM25L, document vectors have the weight δ (typically set to δ = 0.5) for all terms that did not occur in the document, which eliminates sparsity. This could be implemented efficiently if scipy.sparse supported a flag that would make the value of zero elements not zero but a different constant, which I doubt it does. Alternatively, we could have a special-purpose index just for BM25L, but that seems to defeat the purpose of implementing things in Gensim, which is interoperability with other vector space concepts and models. Therefore, I plan to abandon BM25L and focus at BM25⁺ next.

ramsey-coding · 2022-08-27T19:59:42Z

@Witiko what's the equivalent API call for

bm25.get_top_n(tokenized_query, corpus, n=80)

Witiko · 2022-08-27T20:05:40Z

I don't think there is an equivalent API call. You can get all similarities, run argmax over them and take the top 80.

dunefox · 2022-08-27T20:18:07Z

This would have been very useful for me during the last few weeks. Sadly, there doesn't seem to be much interest in BM25 here.

ramsey-coding · 2022-08-27T20:20:55Z

I don't think there is an equivalent APU call. Get all similarities, run argmax over them and take the top 80.

@Witiko I don't follow it. You are saying for a given query, I iterate over the whole 350K datapoint (in the corpus) and get similarity and then take top 80.

This would not scale at all. 😭

ramsey-coding · 2022-08-27T20:40:19Z

@Witiko also the API of gensim is neither user friendly nor convenient. It appears it just provide a similarity score and does not even provide the original document. Devs need to maintain an external data structure to retrieve the original document.

It appears to me Gensim was gold back in the days. But now it is an old, stale, and out-dated library and neither wants to move forward.

Probably time to abandon this library and devs should look for better alternative that provide easier API access and more functionalities.

Witiko · 2022-08-27T20:46:20Z

What I am saying is that it will be significantly faster than rank-bm25 at retrieval time. (This is a continued discussion from dorianbrown/rank_bm25#27 and dorianbrown/rank_bm25#25.)

ramsey-coding · 2022-08-27T20:48:17Z

What I am saying is that it will be significantly faster than rank-bm25.

got it

Witiko · 2022-08-27T20:48:37Z

Gensim will get you similarities in the order of indexing, i.e. if you index documents 1, 2, and 3, and then perform a similarity query, you will get back similarities between the query and documents 1, 2, and 3, respectively.

ramsey-coding · 2022-08-27T20:57:30Z

@Witiko you are phenomenal, thanks for all the great feedback.

I have one more question:

If I set num_best=80 here:

SparseMatrixSimilarity(bm25_corpus,
                                        num_docs=len(corpus),
                                        num_terms=len(dictionary),
                                        normalize_queries=False,
                                        normalize_documents=False,
                                        num_best=80) // I set num_best=80 and want to get top 80 documents

And then get similarity like the following, would the result would be sorted by most matched document?

    tfidf_model = TfidfModel(dictionary=bm25_dictionary, smartirs='bnn')  # Enforce binary weighting of queries
    tfidf_query = tfidf_model[bm25_dictionary.doc2bow(tokenized_query)]

    similarities = bm25_index[tfidf_query]
    for doc_no, score in bm25_index[tfidf_query]:
        print("original document:", test_methods_corpus[doc_no])

So question is here the result of bm25_index[tfidf_query] would be sorted based on most matched documents or not?

smith-co · 2022-08-28T00:46:13Z

@Witiko wow, awesome work. In the context of this implementation:

similarities = bm25_index[tfidf_query]

is higher score means the document is more similar to the query?
Or lower score means the document is more similar to the query.

Sorry for the stupid question.

nashid · 2022-08-28T02:01:20Z

This feature would be very useful for me.

Witiko · 2022-08-28T09:01:02Z

@smith-co The similarities are BM25 scores, i.e. the higher the similarity, the more similar the document is to your query.

Witiko · 2022-08-28T15:35:01Z

@ramsey-coding @smith-co I added outputs to the example code in the original post. Furthermore, I also added an example showing how you can get back the best document for a query. I hope you will find this useful. 😉

Witiko · 2022-08-29T11:47:33Z

So question is here the result of bm25_index[tfidf_query] would be sorted based on most matched documents or not?

@ramsey-coding Sorry, I wasn't at my computer over the weekend. Yes, your understanding is correct; specifying num_best=80 in SparseMatrixSimilarity(...) will cause bm25_index[tfidf_query] to produce an iterable of 80 document ids and similarities sorted from the most matched document in the descending order of similarity.

mgeletka · 2022-08-30T07:38:26Z

I would really appreciate merging this functionality as I must now use my own custom implementation of BM25 when working with the Gensim library.

gensim/models/bm25model.py

piskvorky · 2022-08-30T12:58:22Z

Code looks nice and clean, sorry for taking so long to review.

@mpenkov anything else we need before merge?

@Witiko how about post-merge? What can we do to promote this functionality (beyond including it in the Gensim gallery)?

Co-authored-by: Radim Řehůřek <[email protected]>

Witiko · 2022-08-30T13:17:07Z

@piskvorky Thank you for taking the time. We can mention in the release notes that Gensim can now be used for Lucene-style information retrieval.

piskvorky · 2022-08-30T13:46:48Z

Sure, it goes into the release notes without saying.

I meant more like some demo, or a practical use-case (who'd use the gensim implementation and why?), or similar. A motivational part, to anchor the technical part.

Maybe @dunefox @mgeletka @smith-co @nashid @ramsey-coding could help?

piskvorky · 2022-09-07T14:53:36Z

@mpenkov anything missing here?

Let's aim to release soon after merging, to get this feature out. Thanks.

mpenkov · 2022-09-08T00:50:07Z

Sorry for the delay guys, merging.

Thank you for your efforts and your patience @Witiko

piskvorky · 2022-09-08T07:35:30Z

Thanks Misha!

@dunefox @mgeletka @smith-co @nashid @ramsey-coding could you write a few sentences about how you use Okapi BM25, or intend to use it?

Your story, your use-case, your motivation to participate in this PR.

ditengm · 2022-12-09T08:18:32Z

Hello @Witiko!
Can you please tell me how to get word corpus embedding? That is, the corpus weight matrix?
Thanks!

Witiko · 2022-12-09T09:56:17Z

Hello @Witiko!
Can you please tell me how to get word corpus embedding? That is, the corpus weight matrix?
Thanks!

Hello, @ditengm. You can get the BM25 weight matrix of your corpus from bm25_index.index, where bm25_index is the SparseMatrixSimilarity index from the second example in the original post. The type of bm25_index.index is scipy.sparse.csr_matrix.

Witiko force-pushed the feature/bm25 branch 2 times, most recently from 63804ce to 34d4281 Compare March 4, 2022 21:16

Witiko force-pushed the feature/bm25 branch from 34d4281 to 6ea5c22 Compare March 4, 2022 21:34

Witiko force-pushed the feature/bm25 branch 8 times, most recently from 4b50675 to 9ab6f52 Compare March 7, 2022 12:34

Witiko force-pushed the feature/bm25 branch 3 times, most recently from f43806d to 9ab6f52 Compare March 16, 2022 00:53

Witiko force-pushed the feature/bm25 branch from 1a062d1 to 9ab6f52 Compare March 31, 2022 15:52

Add and unit-test gensim.models.bm25model.OkapiBM25Model

869f07b

Witiko force-pushed the feature/bm25 branch 6 times, most recently from 53ec11f to fd283a4 Compare April 1, 2022 00:04

Merge branch 'develop' into feature/bm25

f3e37a6

piskvorky approved these changes Aug 30, 2022

View reviewed changes

gensim/models/bm25model.py Outdated Show resolved Hide resolved

gensim/models/bm25model.py Show resolved Hide resolved

Update gensim/models/bm25model.py

b4843cc

Co-authored-by: Radim Řehůřek <[email protected]>

mpenkov merged commit 5dbfb1e into piskvorky:develop Sep 8, 2022

suzhoum mentioned this pull request Oct 7, 2022

Semantic Search Tutorial autogluon/autogluon#2186

Merged

bogdankostic mentioned this pull request Oct 21, 2022

Add support for BM25Retriever in InMemoryDocumentStore deepset-ai/haystack#3447

Closed

Witiko mentioned this pull request Nov 11, 2022

Can't import OkapiBM25Model #3403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Okapi BM25 variants in Gensim #3304

Implement Okapi BM25 variants in Gensim #3304

Witiko commented Mar 4, 2022 •

edited

Loading

piskvorky commented Mar 4, 2022 •

edited

Loading

codecov bot commented Mar 4, 2022 •

edited

Loading

Witiko commented Mar 5, 2022 •

edited

Loading

Witiko commented Mar 7, 2022 •

edited

Loading

piskvorky commented Mar 7, 2022 •

edited

Loading

Witiko commented Mar 7, 2022 •

edited

Loading

piskvorky commented Mar 7, 2022

Witiko commented Mar 16, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

dunefox commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

smith-co commented Aug 28, 2022

nashid commented Aug 28, 2022

Witiko commented Aug 28, 2022

Witiko commented Aug 28, 2022 •

edited

Loading

Witiko commented Aug 29, 2022

mgeletka commented Aug 30, 2022

piskvorky commented Aug 30, 2022

Witiko commented Aug 30, 2022

piskvorky commented Aug 30, 2022 •

edited

Loading

piskvorky commented Sep 7, 2022

mpenkov commented Sep 8, 2022 •

edited

Loading

piskvorky commented Sep 8, 2022 •

edited

Loading

ditengm commented Dec 9, 2022

Witiko commented Dec 9, 2022

Implement Okapi BM25 variants in Gensim #3304

Implement Okapi BM25 variants in Gensim #3304

Conversation

Witiko commented Mar 4, 2022 • edited Loading

piskvorky commented Mar 4, 2022 • edited Loading

codecov bot commented Mar 4, 2022 • edited Loading

Codecov Report

Witiko commented Mar 5, 2022 • edited Loading

Witiko commented Mar 7, 2022 • edited Loading

piskvorky commented Mar 7, 2022 • edited Loading

Witiko commented Mar 7, 2022 • edited Loading

piskvorky commented Mar 7, 2022

Witiko commented Mar 16, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022 • edited Loading

Witiko commented Aug 27, 2022 • edited Loading

dunefox commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022 • edited Loading

Witiko commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022 • edited Loading

smith-co commented Aug 28, 2022

nashid commented Aug 28, 2022

Witiko commented Aug 28, 2022

Witiko commented Aug 28, 2022 • edited Loading

Witiko commented Aug 29, 2022

mgeletka commented Aug 30, 2022

piskvorky commented Aug 30, 2022

Witiko commented Aug 30, 2022

piskvorky commented Aug 30, 2022 • edited Loading

piskvorky commented Sep 7, 2022

mpenkov commented Sep 8, 2022 • edited Loading

piskvorky commented Sep 8, 2022 • edited Loading

ditengm commented Dec 9, 2022

Witiko commented Dec 9, 2022

Witiko commented Mar 4, 2022 •

edited

Loading

piskvorky commented Mar 4, 2022 •

edited

Loading

codecov bot commented Mar 4, 2022 •

edited

Loading

Witiko commented Mar 5, 2022 •

edited

Loading

Witiko commented Mar 7, 2022 •

edited

Loading

piskvorky commented Mar 7, 2022 •

edited

Loading

Witiko commented Mar 7, 2022 •

edited

Loading

Witiko commented Mar 16, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 28, 2022 •

edited

Loading

piskvorky commented Aug 30, 2022 •

edited

Loading

mpenkov commented Sep 8, 2022 •

edited

Loading

piskvorky commented Sep 8, 2022 •

edited

Loading