Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Okapi BM25 variants in Gensim #3304

Merged
merged 9 commits into from
Sep 8, 2022
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 6 additions & 6 deletions docs/src/auto_examples/core/run_topics_and_transformations.ipynb

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions docs/src/auto_examples/core/run_topics_and_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,20 @@
#
# model = models.TfidfModel(corpus, normalize=True)
#
# * `Okapi Best Matching, Okapi BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_
# expects a bag-of-words (integer values) training corpus during initialization.
# During transformation, it will take a vector and return another vector of the
# same dimensionality, except that features which were rare in the training corpus
# will have their value increased. It therefore converts integer-valued
# vectors into real-valued ones, while leaving the number of dimensions intact.
#
# Okapi BM25 is the standard ranking function used by search engines to estimate
# the relevance of documents to a given search query.
#
# .. sourcecode:: pycon
#
# model = models.OkapiBM25Model(corpus)
#
# * `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
# transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
# a latent space of a lower dimensionality. For the toy corpus above we used only
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
f49c3821bbacdeefdf3945d5dcb5ad01
226db24f9e807e4bbd2a6ef280a75510
150 changes: 132 additions & 18 deletions docs/src/auto_examples/core/run_topics_and_transformations.rst

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/src/auto_examples/core/sg_execution_times.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

Computation times
=================
**00:05.212** total execution time for **auto_examples_core** files:
**00:01.658** total execution time for **auto_examples_core** files:

+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:05.212 | 47.2 MB |
| :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``) | 00:01.658 | 58.1 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``) | 00:00.000 | 0.0 MB |
| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``) | 00:00.000 | 0.0 MB |
| :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
4 changes: 2 additions & 2 deletions docs/src/auto_examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the WMD.">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the SCM.">

.. only:: html

Expand All @@ -237,7 +237,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the SCM.">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the WMD.">

.. only:: html

Expand Down
14 changes: 14 additions & 0 deletions docs/src/gallery/core/run_topics_and_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,20 @@
#
# model = models.TfidfModel(corpus, normalize=True)
#
# * `Okapi Best Matching, Okapi BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_
# expects a bag-of-words (integer values) training corpus during initialization.
# During transformation, it will take a vector and return another vector of the
# same dimensionality, except that features which were rare in the training corpus
# will have their value increased. It therefore converts integer-valued
# vectors into real-valued ones, while leaving the number of dimensions intact.
#
# Okapi BM25 is the standard ranking function used by search engines to estimate
# the relevance of documents to a given search query.
#
# .. sourcecode:: pycon
#
# model = models.OkapiBM25Model(corpus)
#
# * `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
# transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
# a latent space of a lower dimensionality. For the toy corpus above we used only
Expand Down
1 change: 1 addition & 0 deletions gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from .ldamodel import LdaModel # noqa:F401
from .lsimodel import LsiModel # noqa:F401
from .tfidfmodel import TfidfModel # noqa:F401
from .bm25model import OkapiBM25Model, LuceneBM25Model, AtireBM25Model # noqa:F401
from .rpmodel import RpModel # noqa:F401
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec, FAST_VERSION # noqa:F401
Expand Down
Loading