Vectorize word2vec.predict_output_word for speed (#3153)

* [Fix] gensim/models/word2vec.py: in method predict_output_word, changed a call to sum to numpy.sum to gain performance. * [Feat] gensim.models.word2vec.Word2Vec.predict_output_word: added possibility for the user to input a list of word indices as parameter 'context' instead of a list of words. * Word2Vec.predict_output_word: Changed handling of ints and strs, trying to trying to make it more compact and versatile. * Fixed docstring of predict_output_word. * Simplified `predict_output_word` changes. * Retained the suggested `sum`->`np.sum` replacement, which has been tested to yield significant runtime gains. * Dropped unnecessary type/value checks that are already run when calling the `KeyedVectors.__isin__` dunder method. * Corrected the docstring to accurately document the supported inputs (which were already compatible prior to the PR this commit is a part of). * Added tests for gensim.Word2Vec.predict_output_word() when context contains ints. * Update CHANGELOG.md * update sbt install step Co-authored-by: Mathis <[email protected]> Co-authored-by: Paul Andrey <[email protected]> Co-authored-by: Mathis Demay <[email protected]> Co-authored-by: Michael Penkov <[email protected]>
piskvorky · Jul 19, 2021 · b287fd8 · b287fd8
1 parent a93067d
commit b287fd8
Show file tree

Hide file tree

Showing 4 changed files with 18 additions and 5 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -39,7 +39,8 @@ jobs:
       #
       - name: Update sbt
         run: |
-          echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
+          echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
+          echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list
           curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
           sudo apt-get update -y
           sudo apt-get install -y sbt

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,6 +19,7 @@ Changes
 * [#3115](https://github.com/RaRe-Technologies/gensim/pull/3115): Make LSI dispatcher CLI param for number of jobs optional, by [@robguinness](https://github.com/robguinness)
 * [#3128](https://github.com/RaRe-Technologies/gensim/pull/3128): Materialize and copy the corpus passed to SoftCosineSimilarity, by [@Witiko](https://github.com/Witiko)
 * [#3131](https://github.com/RaRe-Technologies/gensim/pull/3131): Added import to Nmf docs, and to models/__init__.py, by [@properGrammar](https://github.com/properGrammar)
+* [#3153](https://github.com/RaRe-Technologies/gensim/pull/3153): Vectorize word2vec.predict_output_word for speed, by [@M-Demay](https://github.com/M-Demay)
 * [#3157](https://github.com/RaRe-Technologies/gensim/pull/3157): New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by [@Witiko](https://github.com/Witiko)
 * [#3163](https://github.com/RaRe-Technologies/gensim/pull/3163): Optimize word mover distance (WMD) computation, by [@flowlight0](https://github.com/flowlight0)
 * [#2965](https://github.com/RaRe-Technologies/gensim/pull/2965): Remove strip_punctuation2 alias of strip_punctuation, by [@sciatro](https://github.com/sciatro)

diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py
@@ -1806,8 +1806,9 @@ def predict_output_word(self, context_words_list, topn=10):
 
         Parameters
         ----------
-        context_words_list : list of str
-            List of context words.
+        context_words_list : list of (str and/or int)
+            List of context words, which may be words themselves (str)
+            or their index in `self.wv.vectors` (int).
         topn : int, optional
             Return `topn` words and their probabilities.
 
@@ -1825,8 +1826,8 @@ def predict_output_word(self, context_words_list, topn=10):
 
         if not hasattr(self.wv, 'vectors') or not hasattr(self, 'syn1neg'):
             raise RuntimeError("Parameters required for predicting the output words not found.")
-
         word2_indices = [self.wv.get_index(w) for w in context_words_list if w in self.wv]
+
         if not word2_indices:
             logger.warning("All the input context words are out-of-vocabulary for the current model.")
             return None
@@ -1837,7 +1838,7 @@ def predict_output_word(self, context_words_list, topn=10):
 
         # propagate hidden -> output and take softmax to get probabilities
         prob_values = np.exp(np.dot(l1, self.syn1neg.T))
-        prob_values /= sum(prob_values)
+        prob_values /= np.sum(prob_values)
         top_indices = matutils.argsort(prob_values, topn=topn, reverse=True)
         # returning the most probable output words with their probabilities
         return [(self.wv.index_to_key[index1], prob_values[index1]) for index1 in top_indices]

diff --git a/gensim/test/test_word2vec.py b/gensim/test/test_word2vec.py
@@ -875,6 +875,16 @@ def test_predict_output_word(self):
         model_without_neg = word2vec.Word2Vec(sentences, min_count=1, negative=0)
         self.assertRaises(RuntimeError, model_without_neg.predict_output_word, ['system', 'human'])
 
+        # passing indices instead of words in context
+        str_context = ['system', 'human']
+        mixed_context = [model_with_neg.wv.get_index(str_context[0]), str_context[1]]
+        idx_context = [model_with_neg.wv.get_index(w) for w in str_context]
+        prediction_from_str = model_with_neg.predict_output_word(str_context, topn=5)
+        prediction_from_mixed = model_with_neg.predict_output_word(mixed_context, topn=5)
+        prediction_from_idx = model_with_neg.predict_output_word(idx_context, topn=5)
+        self.assertEqual(prediction_from_str, prediction_from_mixed)
+        self.assertEqual(prediction_from_str, prediction_from_idx)
+
     def test_load_old_model(self):
         """Test loading an old word2vec model of indeterminate version"""