-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ns_exponent
parameter to control the negative sampling distribution for *2vec
models. Fix #2090
#2093
Add ns_exponent
parameter to control the negative sampling distribution for *2vec
models. Fix #2090
#2093
Conversation
gensim/models/doc2vec.py
Outdated
ns_exponent : float | ||
The exponent used to smooth the cumulative distribution used for negative sampling. | ||
1.0 leads to a sampling based on the frequency distribution, 0.0 makes items beings sampled equally, | ||
while a negative value makes unpopular items being sampled more often than popular onces. The default value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, grammar, and to give a hint of when this could be beneficially tuned, I'd reword as:
"The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications."
gensim/models/fasttext.py
Outdated
The exponent used to smooth the cumulative distribution used for negative sampling. | ||
1.0 leads to a sampling based on the frequency distribution, 0.0 makes items beings sampled equally, | ||
while a negative value makes unpopular items being sampled more often than popular onces. The default value | ||
is empirically set to 0.75 following the original paper of Word2Vec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
Code looks good; I've suggested rewording the comment for clarity & to give a hint/pointer why this might be changed. |
Hello, @gojomo. I've made the adjustments. Thank you for preparing a better text to document this parameter. |
It seems the build failed with some kind of timeout when running the tests for Python 3.5. After all, it was passing before I simply changed the docs. |
Looks like it was a spurious failure unrelated to your commits; I forced a retry and it succeeded. |
LGTM, @fernandocamargoti please resolve merge conflict and I'll merge your PR |
…ature/negative_sampling_distribution_parameter # Conflicts: # gensim/models/doc2vec.py # gensim/models/fasttext.py # gensim/models/word2vec.py
Done, @menshikh-iv. |
ns_exponent
parameter to control the negative sampling distribution for *2vec
models. Fix #2090
ns_exponent
parameter to control the negative sampling distribution for *2vec
models. Fix #2090ns_exponent
parameter to control the negative sampling distribution for *2vec
models. Fix #2090
@fernandocamargoti nice work, congratz with first contribution 👍 |
Fixes #2090
Description:
Like pointed out in the following article, the negative sampling distribution parameter, which is fixed as 0.75 in Gensim, is worth tuning, specially for other applications beyond NLP. So, I'd be very helpful to make it a parameter for the Word2Vec, instead of fixing it.
https://arxiv.org/abs/1804.04212