Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 #2093

fernandocamargoai · 2018-06-15T19:36:17Z

Fixes #2090

Description:

Like pointed out in the following article, the negative sampling distribution parameter, which is fixed as 0.75 in Gensim, is worth tuning, specially for other applications beyond NLP. So, I'd be very helpful to make it a parameter for the Word2Vec, instead of fixing it.

https://arxiv.org/abs/1804.04212

…ution.

gojomo · 2018-06-18T18:02:04Z

gensim/models/doc2vec.py

+        ns_exponent : float
+            The exponent used to smooth the cumulative distribution used for negative sampling.
+            1.0 leads to a sampling based on the frequency distribution, 0.0 makes items beings sampled equally,
+            while a negative value makes unpopular items being sampled more often than popular onces. The default value


For clarity, grammar, and to give a hint of when this could be beneficially tuned, I'd reword as:

"The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications."

gojomo · 2018-06-18T18:02:18Z

gensim/models/fasttext.py

+            The exponent used to smooth the cumulative distribution used for negative sampling.
+            1.0 leads to a sampling based on the frequency distribution, 0.0 makes items beings sampled equally,
+            while a negative value makes unpopular items being sampled more often than popular onces. The default value
+            is empirically set to 0.75 following the original paper of Word2Vec.


Same as above.

gojomo · 2018-06-18T18:03:09Z

Code looks good; I've suggested rewording the comment for clarity & to give a hint/pointer why this might be changed.

fernandocamargoai · 2018-06-18T20:10:12Z

Hello, @gojomo. I've made the adjustments. Thank you for preparing a better text to document this parameter.

fernandocamargoai · 2018-06-19T11:39:12Z

It seems the build failed with some kind of timeout when running the tests for Python 3.5. After all, it was passing before I simply changed the docs.

gojomo · 2018-06-19T16:17:53Z

Looks like it was a spurious failure unrelated to your commits; I forced a retry and it succeeded.

menshikh-iv · 2018-06-21T04:12:07Z

LGTM, @fernandocamargoti please resolve merge conflict and I'll merge your PR

…ature/negative_sampling_distribution_parameter # Conflicts: # gensim/models/doc2vec.py # gensim/models/fasttext.py # gensim/models/word2vec.py

fernandocamargoai · 2018-06-21T16:55:01Z

Done, @menshikh-iv.

menshikh-iv · 2018-06-22T01:02:37Z

@fernandocamargoti nice work, congratz with first contribution 👍

fernandocamargoai added 2 commits June 15, 2018 16:32

Adding ns_exponent parameter to control the negative sampling distrib…

35047b9

…ution.

Fixed a code style problem.

5d45235

fernandocamargoai mentioned this pull request Jun 18, 2018

What change in word2vec.py from gensim origin code ? anonymous-authors-recsys/w2v_reco_hyperparameters_matter#1

Open

gojomo reviewed Jun 18, 2018

View reviewed changes

gojomo requested review from piskvorky and menshikh-iv June 18, 2018 18:03

Updated the documentation of the ns_exponent parameter.

b860c50

piskvorky approved these changes Jun 19, 2018

View reviewed changes

menshikh-iv approved these changes Jun 21, 2018

View reviewed changes

Merge branch 'develop' of github.com:RaRe-Technologies/gensim into fe…

4c72455

…ature/negative_sampling_distribution_parameter # Conflicts: # gensim/models/doc2vec.py # gensim/models/fasttext.py # gensim/models/word2vec.py

menshikh-iv changed the title ~~Adding ns_exponent parameter to control the negative sampling distribution~~ Adding ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 Jun 22, 2018

menshikh-iv changed the title ~~Adding ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090~~ Add ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 Jun 22, 2018

menshikh-iv merged commit 76d194b into piskvorky:develop Jun 22, 2018

fernandocamargoai deleted the feature/negative_sampling_distribution_parameter branch June 22, 2018 13:15

gojomo mentioned this pull request Oct 20, 2020

Faster evaluation metrics (baked into the library?) #2986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 #2093

Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 #2093

fernandocamargoai commented Jun 15, 2018

gojomo Jun 18, 2018

gojomo Jun 18, 2018

gojomo commented Jun 18, 2018

fernandocamargoai commented Jun 18, 2018

fernandocamargoai commented Jun 19, 2018

gojomo commented Jun 19, 2018

menshikh-iv commented Jun 21, 2018

fernandocamargoai commented Jun 21, 2018

menshikh-iv commented Jun 22, 2018

Add ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 #2093

Add ns_exponent parameter to control the negative sampling distribution for *2vec models. Fix #2090 #2093

Conversation

fernandocamargoai commented Jun 15, 2018

gojomo Jun 18, 2018

Choose a reason for hiding this comment

gojomo Jun 18, 2018

Choose a reason for hiding this comment

gojomo commented Jun 18, 2018

fernandocamargoai commented Jun 18, 2018

fernandocamargoai commented Jun 19, 2018

gojomo commented Jun 19, 2018

menshikh-iv commented Jun 21, 2018

fernandocamargoai commented Jun 21, 2018

menshikh-iv commented Jun 22, 2018

Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 #2093

Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 #2093