-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188
Conversation
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
else:
mean.append(weight * self.get_vector(key, norm=True))
if self.has_index_for(key):
all_keys.add(self.get_index(key)) @gojomo, @piskvorky I am trying to refactor the |
For existing functions, let's try to keep backward compatibility = support the same input types. For new functions, we have freedom. I have no strong preference either way. Supporting the same input types makes sense (consistency) but if it makes the code too ugly / too much work, we can simplify the function signature. |
Yeah, we can do this by some simple changes in |
@piskvorky, I have updated the PR. |
Something to do with I restarted the jobs now, maybe that helps. EDIT: nope :( |
Not sure. I've merged develop into this PR, let's see if that helps. |
Resolve merge conflicts in most_similar
2acc5b0
to
330e5c2
Compare
@piskvorky, I have rebased the branch with |
LGTM – @mpenkov did you figure out why some tests are failing? I'd like to get a "green light" before merging. Thanks. |
Codecov Report
@@ Coverage Diff @@
## develop #3188 +/- ##
===========================================
- Coverage 81.43% 81.38% -0.05%
===========================================
Files 122 122
Lines 21052 21090 +38
===========================================
+ Hits 17144 17165 +21
- Misses 3908 3925 +17
Continue to review full report at Codecov.
|
Merging. Thank you for your contribution and your patience @rock420 ! |
If I may make an observation: as a user of Facebook's FastText library, I was looking for the equivalent of Facebook's Here is what I mean:
Gensim's get_mean_vector() docstring (
So, when I first tried out Gensim's |
That's a good point about likely confusion. However, Gensim typically lets users choose their tokenization, and even use tokens with internal whitespace if desired (including in That creates an especially tricky choice here: do we match the Facebook FastText operation, or Gensim conventions? Given the exact-same name, it might've made sense to exactly-mimic the string-style invocation in But given we shipped tokens-style already, probably the most-helpful things we should do would be to add a loud warning (or possibly even error) with explanatory text whenever a plain-string is provided, reminding the user to perform their own |
We definitely want to accept tokens – like everywhere else in Gensim. +1 on throwing an exception when users provide a single string where a sequence of strings was expected. IIRC we already do such "input sanity checks" in other places in Gensim, so should be straightforward to extend them to |
Fixes: #3015
This PR provides gensim support for
get_sentence_vector()
method. It also implementsget_mean_vector()
underkeyedVectors
as a general method to get the average wordvecs from a list of keys.Tasks to complete -
get_mean_vector()
underkeyedVectors
.get_sentence_vector()
underFastTextKeyedVectors
.