New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary #3157

Witiko · 2021-05-25T16:19:41Z

When searching for similar word embeddings using the KeyedVectors.most_similar() method, we often have a dictionary that limits the number of words that we would like to consider and, for subword models such as FastText that enable word vector inference, also expands the number of words that we would like to consider:

This is also true in document similarity measures that use word embeddings as a source of word similarity, such as the Soft Cosine Measure. In the Soft Cosine Measure, the first step is the construction of a word similarity matrix. The word similarity matrix models a dictionary that will often be different from the vocabulary of the word embeddings. The word similarity matrix is sparse and uses the topn parameter of the KeyedVectors.most_similar() method to control how many closest words for each word will be considered. However, if the overlap between our dictionary and the vocabulary of the word embeddings is small, the KeyedVectors.most_similar() method will consistently return fewer than topn closest words from the dictionary and the matrix will be much more sparse than it would have otherwise been. This leads to a lack of control and possibly weaker models.

Proposed solution

The solution @gojomo and I discussed in #3146 (comment) is to have a KeyedVectors.vectors_for_all(words) method that would take an iterable of words (a dictionary) and produce a new KeyedVectors object that would contain vectors only for the requested words. In subword models such as FastText, vectors for words outside the vocabulary would be inferred. This would guarantee that all topn words retreived by the KeyedVectors.most_similar() method originated from our dictionary.

Here is an example usage, which this PR also adds to the documentation:

from gensim.test.utils import common_texts, datapath
from gensim.corpora import Dictionary
from gensim.models import FastText
from gensim.models.word2vec import LineSentence

corpus = common_texts
model = FastText(corpus, vector_size=20, min_count=1)  # train word-vectors on a corpus
different_corpus = LineSentence(datapath('lee_background.cor'))
dictionary = Dictionary(different_corpus)
words = [word for word, count in dictionary.most_common()]  # construct vocabulary on a different corpus
word_vectors = model.wv.vectors_for_all(words)  # remove OOV word-vectors and infer vectors for new words
assert len(dictionary) == len(word_vectors)  # all words from our vocabulary received their word-vectors

gojomo · 2021-05-25T17:43:18Z

As before, appreciate general utility of this method.

Regarding name, my 1st impression is that fit() is too generic, but maybe could be convinced if there's little else that could be confused with it. (But: mightn't someone get the impression that fit() might do some fine-tuning?)

Perhaps vector_subset()?

I don't think always sorting is a good default, given that sets of word-vectors are most-often pre-sorted in decreasing frequency, which proves useful when users want just a head-subset of results, and often gives a noticeable cache-warmness benefit in typical lookup patterns (with all the frequent words clustered near the front). The default could instead be never-sort (let the caller's chosen ordering specify exactly the order they want) or a non-default option-to-lexical-sort. (Similarly, I suspect a uniqueness-check could be left to the caller, with whatever oddness being created by a nonunique list-of-keys being survivable.)

Also: given the vecattr mechanism for adding typed extra info per key into a KeyedVectors instance - most often, word- frequencies – it may be interesting for such info to be preserved in this subset operation, either by default or (if the space/copy-over-time is too much to bear) only on-request.

Witiko · 2021-05-26T11:54:31Z

@gojomo

As before, appreciate general utility of this method.

Thank you. In contrast to the original proposal, I moved the method from FastTextKeyedVectors to KeyedVectors, because I realized that removing unwanted words is just as benefitial and it makes sense to fit word2vec word vectors to a dictionary even though some words from the dictionary may not receive a word vector.

Regarding name, my 1st impression is that fit() is too generic, but maybe could be convinced if there's little else that could be confused with it. (But: mightn't someone get the impression that might do some fine-tuning?)

Perhaps vector_subset()?

If we only took a subset of existing word vectors, then vector_subset() would be a perfect match. However, I think that fit() is better at communicating that we may also infer new word vectors. This inference (almost) requires a single forward pass of the fastText model, so I don't think the name promises more than it delivers from the sklearn perspective. From the language perspective, the method shrinks-and-expands the KeyedVectors object so that it fits the dictionary.

I don't think always sorting is a good default, given that sets of word-vectors are most-often sorted in decreasing frequency, which proves useful when users want just a head-subset of results, and often gives a noticeable cache-warmness benefit in typical lookup patterns

That is a good point. I was aiming at reproducibility, since we are enforcing uniqueness using set(), which randomizes the order. However, we can easily just filter out duplicates without reordering:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys([3, 2, 2, 3, 2, 1]))
[3, 2, 1]

Similarly, I suspect a uniqueness-check could be left to the caller, with whatever oddness being created by a nonunique list-of-keys being survivable.

Since we are already doing a single pass over the entire iterable and we want to know the size of the vocabulary before we start building the KeyedVectors object, it seems reasonable to remove the duplicates and check inferrability, so that we don't overallocate memory. Do you see any utility in not removing the duplicates?

Also: given the vecattr mechanism for adding typed extra info per key into a KeyedVectors instance - most often, word- frequencies – it may be interesting for such info to be preserved in this subset operation, either by default or (if the space/copy-over-time is too much to bear) only on-request.

At the moment, the lack of this functionality is explicitly noted in the docstring. Since we are not only subsetting word vectors, but also possibly inferring new ones, there is no guarantee that all word vectors in the resulting KeyedVectors object would have vector attributes even if all word vectors in the original KeyedVectors object had them, which seems confusing. Perhaps we should also add a parameter that would control whether new word vectors may be inferred. If new word vectors may not be inferred, then this gives us a stronger guarantee about the presence of vector attributes.

def fit(self, dictionary: Union[Iterable, Dictionary], allow_inference : bool = True, copy_vecattrs : bool = False):
    # ...
    if copy_vecattrs:
        for attr in self.expandos:
            try:
                val = self.get_vecattr(key, attr)
                kv.set_vecattr(key, attr, val)
            except KeyError:
                continue
    return kv

Witiko · 2021-05-26T22:55:41Z

If we only took a subset of existing word vectors, then vector_subset() would be a perfect match. However, I think that fit() is better at communicating that we may also infer new word vectors. This inference (almost) requires a single forward pass of the fastText model, so I don't think the name promises more than it delivers from the sklearn perspective. From the language perspective, the method shrinks-and-expands the KeyedVectors object so that it fits the dictionary.

It is true that fit() strongly implies in-place modification of a mutable object, which is something our method does not satisfy.

gensim/models/keyedvectors.py

Co-authored-by: Radim Řehůřek <[email protected]>

Witiko · 2021-05-28T21:18:32Z

@piskvorky, @gojomo Following the discussion, the vectors_for_all() method now:

Supports Dictionary objects, which it sorts by decreasing collection frequency for cache-warmness.
Supports copying vector attributes with copy_vecattrs=True.
Supports disabling the inference of new vectors allow_inference=False, so that we can have a guarantee that
1. we will always receive a subset of the current KeyedVectors object and
2. all vectors will have vector attributes defined when copy_vecattrs=True.
Adds five new unit tests.

Witiko · 2021-05-28T21:25:27Z

Unless we want to rename the method¹, or remove the support for Dictionary objects², this PR should be ready for review.

¹ @gojomo suggested vector_subset() and I suggested fit(), materialize(), or vectorize(). However, vector_subset() and fit() are both misleading as discussed in #3157 (comment) and #3157 (comment). I don't prefer materialize() or vectorize() over vectors_for_all().

² @piskvorky suggested the removal in #3157 (comment). I argue that the use of Dictionary will be common and getting a properly sorted iterable out of Dictionary is tricky (many thanks to @gojomo for suggesting ordering by frequency).

Witiko · 2021-06-02T16:34:14Z

@gojomo @mpenkov Thank you again for the many thoughtful suggestions raised in the reviews. I will implement them in bulk once I have drafted the demo that I promised to @piskvorky in #3146 (comment) and which I am behind schedule on.

Co-authored-by: Michael Penkov <[email protected]>

gensim/similarities/termsim.py

gensim/models/keyedvectors.py

gensim/corpora/dictionary.py

gensim/test/test_corpora_dictionary.py

gensim/models/keyedvectors.py

Witiko · 2021-06-22T21:50:40Z

I have implemented all suggestions from the reviews. Please, let me know if there are any other changes to make before merge.

gensim/test/test_fasttext.py

gensim/test/test_keyedvectors.py

gensim/test/test_fasttext.py

mpenkov · 2021-06-29T06:57:12Z

gensim/test/test_fasttext.py

+            vectors_for_all['an out-of-vocabulary word']
+            - vectors_for_all['responding']
+        )
+        self.assertGreater(greater_distance, smaller_distance)


Suggested change

self.assertGreater(greater_distance, smaller_distance)

assert greater_distance > smaller_distance

mpenkov · 2021-06-29T06:57:26Z

gensim/test/test_fasttext.py

+
+        expected = self.test_model.wv['responding']
+        predicted = vectors_for_all['responding']
+        self.assertTrue(np.allclose(expected, predicted))


Suggested change

self.assertTrue(np.allclose(expected, predicted))

assert np.allclose(expected, predicted)

mpenkov · 2021-06-29T06:57:41Z

gensim/test/test_keyedvectors.py

+
+        expected = self.vectors['conflict']
+        predicted = vectors_for_all['conflict']
+        self.assertTrue(np.allclose(expected, predicted))


Suggested change

self.assertTrue(np.allclose(expected, predicted))

assert np.allclose(expected, predicted)

mpenkov

Looks good! Thank you @Witiko !

Witiko added 2 commits May 25, 2021 18:13

Add KeyedVectors.vectors_for_all

ab6fb90

Add examples for KeyedVectors.vectors_for_all

98ed69d

Witiko changed the title ~~Add a KeyedVectors method that vectorizes all words in a vocabulary~~ Add a KeyedVectors method that vectorizes all words in a dictionary May 25, 2021

Witiko added 2 commits May 28, 2021 02:18

Support Dictionary in KeyedVectors.vectors_for_all

be1746b

Don't sort keys in KeyedVectors.vectors_for_all, just deduplicate

d81df64

piskvorky reviewed May 28, 2021

View reviewed changes

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

piskvorky requested changes May 28, 2021

View reviewed changes

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

Witiko and others added 3 commits May 28, 2021 11:41

Use docstrings in imperative mode (PEP8)

ef8bea6

Co-authored-by: Radim Řehůřek <[email protected]>

Guard against KeyError in KeyedVectors.vectors_for_all

d602018

Unit-test dictionary parameter of KeyedVectors.vectors_for_all

13a7ecd

Witiko marked this pull request as draft May 28, 2021 18:58

Witiko added 2 commits May 28, 2021 22:01

Order dictionary by decreasing cfs in KeyedVectors.vectors_for_all

6a8c688

Add allow_inference parameter to KeyedVectors.vectors_for_all

9ebe808

Witiko force-pushed the feature/vectors-for-all branch from 686945a to 9ebe808 Compare May 28, 2021 20:21

Witiko added 4 commits May 28, 2021 22:34

Add copy_vecattrs parameter to KeyedVectors.vectors_for_all

716dc32

Move copy_vecattrs tests for KeyedVectors.vectors_for_all

77e1889

Fix translation of term ids to terms in KeyedVectors.vectors_for_all

330d5f7

Fix a typo in KeyedVectors.vectors_for_all unit test

8fdda93

Witiko marked this pull request as ready for review May 28, 2021 21:10

Witiko added 3 commits May 28, 2021 23:30

Do not make assumptions about fake counts in _add_word_to_kv

ba636a2

Document that KeyedVectors.vectors_for_all allows arbitrary keys

1a9ea9b

Add notes about the behavior of KeyedVectors.vectors_for_all

e5a9a31

Witiko force-pushed the feature/vectors-for-all branch from a8f9550 to e5a9a31 Compare May 28, 2021 22:07

Properly reference Dictionary in KeyedVectors.vectors_for_all docstring

5eebef0

Witiko changed the title ~~Add a KeyedVectors method that vectorizes all words in a dictionary~~ Add a KeyedVectors method that vectorizes all words in a vocabulary May 28, 2021

Witiko and others added 10 commits June 22, 2021 23:00

Remove Gensim downloader from KeyedVectors.vectors_for_all example

32c14c5

Remove include_counts parameter from Dictionary.most_common

9acbcba

Shorten the KeyedVectors.vectors_for_all example

712ee61

Remove include_counts parameter from Dictionary.most_common (cont.)

b8625a5

Use pytest assertion syntax in unit tests

4aacad2

Remove an unnecessary comment in KeyedVectors.vectors_for_all

a86522c

Remove an unnecessary comment in KeyedVectors.vectors_for_all

7ea8337

Co-authored-by: Michael Penkov <[email protected]>

Remove an unnecessary variable in KeyedVectors.vectors_for_all

f08c582

Make the creation of new vocab in KeyedVectors.vectors_for_all explicit

ebc276d

Make AnnoyIndexer use the correct word-vectors in example

3bf7f33

Witiko commented Jun 22, 2021

View reviewed changes