-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cbow_mean default changed from 0 to 1. #538
Conversation
It's a change in welcome direction, but I'd prefer this to be a pull request that fixes the defaults fully, across variables, as discussed in #534 . Piece-meal updates will only make the switch more painful and confusing. |
OK. I will try to return to this in a few days. Should all Gensim's word2vec variables be aligned with Mikolov's tool, or only in the main()? |
Thanks @akutuzov ! I think we should keep the current names in the code, just change their default values. But the |
Hey @akutuzov do you have any updates? Planning a release and would be good to include this. |
Unfortunately not yet. |
17 Jan is the tentative date |
I see. Then I hope to finish this by 15 Jan. |
Conflicts: CHANGELOG.txt
I changed the default values of all hyperparameters to match those of Mikolov's wordvec. The only exception is default hierarchical softmax (word2vec has default negative sampling of 5 samples). If I change Gensim defaults to negative sampling, it immediately breaks a number of tests: particularly, word2vec training (which seems to rely on syn1 matrix) and scoring (which is not implemented for negative sampling at all). Thus, I left default hierarchical softmax as is. I also changed the defaults in (main) and will implement CLI arguments processing mimicking those of word2vec in the next couple of days. Not quite sure why checks are failed on this PR, and what conflicts it mentions. The code passes all the tests here. |
Conflicts: CHANGELOG.txt gensim/models/word2vec.py
… default vector size is 100, not 200).
@tmylk, I am finished with this. Gensim word2vec defaults are now consistent with Mikolov's C tool (except for hierarchical softmax, see above), and the (main) now accepts the same command line arguments as the C tool. |
Thanks for the contribution! FYI, the default size is 100 before and after, so that's not specifically the cause for the test failure you saw. Rather, that's been an ornery test, more likely to fail on some systems (including the CI servers) than others. See #531 and potential improvement in PR #581. Note that merely loosening the tolerance only helps a little, leaving the potential for outliers over the new threshold – whereas I think the changes is #581 will more effectively tighten the range of results. Other comments coming in line notes. |
@@ -342,8 +342,8 @@ class Word2Vec(utils.SaveLoad): | |||
""" | |||
def __init__( | |||
self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, | |||
max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, | |||
sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=hash, iter=1, null_word=0, | |||
max_vocab_size=None, sample=1e-3, seed=1, workers=12, min_alpha=0.0001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our multithreading doesn't parallelize as effectively as the C code (and in fact hits a point of diminishing returns even before reaching the CPU-core count), the workers
parameter is one exception where we should pick a different value. I suggest 3 as a value likely to help without causing issues.
@tmylk, seems OK for me. |
PR #538 Word2vec defaults changed + a stub for test
Merged in #593 - needed some changes to run. |
Conflicts: CHANGELOG.txt gensim/models/word2vec.py gensim/scripts/word2vec_standalone.py
Conflicts: CHANGELOG.md README.md gensim/models/word2vec.py tutorials.md
…y judgments datasets.
…1047) * Update CHANGELOG.txt * Update CHANGELOG.txt * cbow_mean default changed from 0 to 1. * Hyperparameters' default values are aligned with Mikolov's word2vec. * Fix for #538: cbow_mean default changed from 0 to 1. * Update changelog * (main) defaults aligned to Mikolov's word2vec. * word2vec (main) now mimics command-line arguments for Mikolov's word2vec. * Fix for #538 * Fix for #538 (tabs and spaces). * Fix for #538 (tests). * For #538: slightly relaxed sanity check demands (because now default vector size is 100, not 200). * Fixes as per @gojomo comments. * Test fixes due to negative sampling becoming default behavior. * Commented out tests which work for HS only. * Fix for #538. * Yet another fix. * Merging. * Fix for CBOW test. * Changelog mention of #538 * Fix for CBOW negative sampling tests. * Factoring out word2vec _main__ into gensim/scripts * Use logger instead of logging. * Made Changelog less verbose about word2vec defaults changed. * Fixes to word2vec_standalone.py as per Radim's comments. * Alpha argument. with different defaults for CBOW ans skipgram. * Release version typo fix * 'fisrt_push' * Finalizing. * Initial shippable release * Evaluation function to measure model correlation with human similarity judgments datasets. * Updating semantic similarity evaluation. * Scipy stats import * Evaluation function to measure model correlation with human similarity judgments datasets. * Remove unneccessary. * Changing the neame of the word pairs evaluation function.
Taking average from input vectors in CBOW mode gives better results than their sum (see https://docs.google.com/spreadsheets/d/1dgr513AePh4EjCUQxQyeT9i6Xnig3SOtjLJwiIVmuu4 for experiments). It is also the default setting in Mikolov's word2vec. Therefore, cbow_mean=1 should be default in Gensim as well.