-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary #3157
Conversation
As before, appreciate general utility of this method. Regarding name, my 1st impression is that Perhaps I don't think always sorting is a good default, given that sets of word-vectors are most-often pre-sorted in decreasing frequency, which proves useful when users want just a head-subset of results, and often gives a noticeable cache-warmness benefit in typical lookup patterns (with all the frequent words clustered near the front). The default could instead be never-sort (let the caller's chosen ordering specify exactly the order they want) or a non-default option-to-lexical-sort. (Similarly, I suspect a uniqueness-check could be left to the caller, with whatever oddness being created by a nonunique list-of-keys being survivable.) Also: given the |
Thank you. In contrast to the original proposal, I moved the method from
If we only took a subset of existing word vectors, then
That is a good point. I was aiming at reproducibility, since we are enforcing uniqueness using >>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys([3, 2, 2, 3, 2, 1]))
[3, 2, 1]
Since we are already doing a single pass over the entire iterable and we want to know the size of the vocabulary before we start building the
At the moment, the lack of this functionality is explicitly noted in the docstring. Since we are not only subsetting word vectors, but also possibly inferring new ones, there is no guarantee that all word vectors in the resulting def fit(self, dictionary: Union[Iterable, Dictionary], allow_inference : bool = True, copy_vecattrs : bool = False):
# ...
if copy_vecattrs:
for attr in self.expandos:
try:
val = self.get_vecattr(key, attr)
kv.set_vecattr(key, attr, val)
except KeyError:
continue
return kv |
It is true that |
686945a
to
9ebe808
Compare
@piskvorky, @gojomo Following the discussion, the
|
Unless we want to rename the method¹, or remove the support for ¹ @gojomo suggested ² @piskvorky suggested the removal in #3157 (comment). I argue that the use of |
a8f9550
to
e5a9a31
Compare
@gojomo @mpenkov Thank you again for the many thoughtful suggestions raised in the reviews. I will implement them in bulk once I have drafted the demo that I promised to @piskvorky in #3146 (comment) and which I am behind schedule on. |
Co-authored-by: Michael Penkov <[email protected]>
I have implemented all suggestions from the reviews. Please, let me know if there are any other changes to make before merge. |
gensim/test/test_fasttext.py
Outdated
vectors_for_all['an out-of-vocabulary word'] | ||
- vectors_for_all['responding'] | ||
) | ||
self.assertGreater(greater_distance, smaller_distance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertGreater(greater_distance, smaller_distance) | |
assert greater_distance > smaller_distance |
gensim/test/test_fasttext.py
Outdated
|
||
expected = self.test_model.wv['responding'] | ||
predicted = vectors_for_all['responding'] | ||
self.assertTrue(np.allclose(expected, predicted)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertTrue(np.allclose(expected, predicted)) | |
assert np.allclose(expected, predicted) |
gensim/test/test_keyedvectors.py
Outdated
|
||
expected = self.vectors['conflict'] | ||
predicted = vectors_for_all['conflict'] | ||
self.assertTrue(np.allclose(expected, predicted)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertTrue(np.allclose(expected, predicted)) | |
assert np.allclose(expected, predicted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you @Witiko !
When searching for similar word embeddings using the
KeyedVectors.most_similar()
method, we often have a dictionary that limits the number of words that we would like to consider and, for subword models such as FastText that enable word vector inference, also expands the number of words that we would like to consider:This is also true in document similarity measures that use word embeddings as a source of word similarity, such as the Soft Cosine Measure. In the Soft Cosine Measure, the first step is the construction of a word similarity matrix. The word similarity matrix models a dictionary that will often be different from the vocabulary of the word embeddings. The word similarity matrix is sparse and uses the
topn
parameter of theKeyedVectors.most_similar()
method to control how many closest words for each word will be considered. However, if the overlap between our dictionary and the vocabulary of the word embeddings is small, theKeyedVectors.most_similar()
method will consistently return fewer thantopn
closest words from the dictionary and the matrix will be much more sparse than it would have otherwise been. This leads to a lack of control and possibly weaker models.Proposed solution
The solution @gojomo and I discussed in #3146 (comment) is to have a
KeyedVectors.vectors_for_all(words)
method that would take an iterable of words (a dictionary) and produce a newKeyedVectors
object that would contain vectors only for the requested words. In subword models such as FastText, vectors for words outside the vocabulary would be inferred. This would guarantee that alltopn
words retreived by theKeyedVectors.most_similar()
method originated from our dictionary.Here is an example usage, which this PR also adds to the documentation: