Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cbow_mean default changed from 0 to 1. #538

Closed
wants to merge 50 commits into from

Conversation

akutuzov
Copy link
Contributor

Taking average from input vectors in CBOW mode gives better results than their sum (see https://docs.google.com/spreadsheets/d/1dgr513AePh4EjCUQxQyeT9i6Xnig3SOtjLJwiIVmuu4 for experiments). It is also the default setting in Mikolov's word2vec. Therefore, cbow_mean=1 should be default in Gensim as well.

@piskvorky
Copy link
Owner

It's a change in welcome direction, but I'd prefer this to be a pull request that fixes the defaults fully, across variables, as discussed in #534 .

Piece-meal updates will only make the switch more painful and confusing.

@akutuzov
Copy link
Contributor Author

OK. I will try to return to this in a few days. Should all Gensim's word2vec variables be aligned with Mikolov's tool, or only in the main()?

@piskvorky
Copy link
Owner

Thanks @akutuzov !

I think we should keep the current names in the code, just change their default values. But the main() CLI names can match the C tool, sure, why not.

@tmylk
Copy link
Contributor

tmylk commented Jan 9, 2016

Hey @akutuzov do you have any updates? Planning a release and would be good to include this.

@akutuzov
Copy link
Contributor Author

akutuzov commented Jan 9, 2016

Unfortunately not yet.
What is the planned date for the release? I could try to make it in time.

@tmylk
Copy link
Contributor

tmylk commented Jan 9, 2016

17 Jan is the tentative date

@akutuzov
Copy link
Contributor Author

akutuzov commented Jan 9, 2016

I see. Then I hope to finish this by 15 Jan.

@akutuzov
Copy link
Contributor Author

I changed the default values of all hyperparameters to match those of Mikolov's wordvec. The only exception is default hierarchical softmax (word2vec has default negative sampling of 5 samples). If I change Gensim defaults to negative sampling, it immediately breaks a number of tests: particularly, word2vec training (which seems to rely on syn1 matrix) and scoring (which is not implemented for negative sampling at all). Thus, I left default hierarchical softmax as is.

I also changed the defaults in (main) and will implement CLI arguments processing mimicking those of word2vec in the next couple of days.

Not quite sure why checks are failed on this PR, and what conflicts it mentions. The code passes all the tests here.

@akutuzov
Copy link
Contributor Author

@tmylk, I am finished with this. Gensim word2vec defaults are now consistent with Mikolov's C tool (except for hierarchical softmax, see above), and the (main) now accepts the same command line arguments as the C tool.
I had to change one value in the unit test for word2vec (model_sanity function, line 238). It tested for the word 'terrorism' to be in the first 50 synonyms of the word 'war', and this was not always the case, so I changed 50 to 60. The reason is that the default vector size is now 100 (as in Mikolov's word2vec), not 200. Thus, the resulting default models are slightly worse (but smaller).

@gojomo
Copy link
Collaborator

gojomo commented Jan 15, 2016

Thanks for the contribution!

FYI, the default size is 100 before and after, so that's not specifically the cause for the test failure you saw. Rather, that's been an ornery test, more likely to fail on some systems (including the CI servers) than others. See #531 and potential improvement in PR #581. Note that merely loosening the tolerance only helps a little, leaving the potential for outliers over the new threshold – whereas I think the changes is #581 will more effectively tighten the range of results.

Other comments coming in line notes.

@@ -342,8 +342,8 @@ class Word2Vec(utils.SaveLoad):
"""
def __init__(
self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001,
sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=hash, iter=1, null_word=0,
max_vocab_size=None, sample=1e-3, seed=1, workers=12, min_alpha=0.0001,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our multithreading doesn't parallelize as effectively as the C code (and in fact hits a point of diminishing returns even before reaching the CPU-core count), the workers parameter is one exception where we should pick a different value. I suggest 3 as a value likely to help without causing issues.

@tmylk
Copy link
Contributor

tmylk commented Jan 27, 2016

Standalone script manually tested successfully with minor code changes as in #593.
Automated tests for the standalone script are still work in progress in #593 .

Plan to merge in #593 tonight - Andrey, @akutuzov, if you could review it would be great.

@akutuzov
Copy link
Contributor Author

@tmylk, seems OK for me.

tmylk added a commit that referenced this pull request Jan 28, 2016
PR #538 Word2vec defaults changed + a stub for test
@tmylk
Copy link
Contributor

tmylk commented Jan 28, 2016

Merged in #593 - needed some changes to run.
Leaving it open because automated tests for the standalone scripts still need to be added. I had one go but in Travis it couldn't find the python executable, however in Windows it was ok. If anyone has seen this before, would appreciate help.

tmylk added a commit that referenced this pull request Jan 28, 2016
@piskvorky
Copy link
Owner

Thanks a lot @akutuzov ! Great work, much appreciated :)

@tmylk you're giving subprocess a single large string. My guess is it's trying to interpret that entire string as the executable name; try giving it a list of arguments instead (break the command down into its individual parts).

@akutuzov akutuzov closed this Dec 15, 2016
tmylk pushed a commit that referenced this pull request Dec 22, 2016
…1047)

* Update CHANGELOG.txt

* Update CHANGELOG.txt

* cbow_mean default changed from 0 to 1.

* Hyperparameters' default values are aligned with Mikolov's word2vec.

* Fix for #538: cbow_mean default changed from 0 to 1.

* Update changelog

* (main) defaults aligned to Mikolov's word2vec.

* word2vec (main) now mimics command-line arguments for Mikolov's word2vec.

* Fix for #538

* Fix for #538 (tabs and spaces).

* Fix for #538 (tests).

* For #538: slightly relaxed sanity check demands (because now default vector size is 100, not 200).

* Fixes as per @gojomo comments.

* Test fixes due to negative sampling becoming default behavior.

* Commented out tests which work for HS only.

* Fix for #538.

* Yet another fix.

* Merging.

* Fix for CBOW test.

* Changelog mention of #538

* Fix for CBOW negative sampling tests.

* Factoring out word2vec _main__ into gensim/scripts

* Use logger instead of logging.

* Made Changelog less verbose about word2vec defaults changed.

* Fixes to word2vec_standalone.py as per Radim's comments.

* Alpha argument. with different defaults for CBOW ans skipgram.

* Release version typo fix

* 'fisrt_push'

* Finalizing.

* Initial shippable release

* Evaluation function to measure model correlation with human similarity judgments datasets.

* Updating semantic similarity evaluation.

* Scipy stats import

* Evaluation function to measure model correlation with human similarity judgments datasets.

* Remove unneccessary.

* Changing the neame of the word pairs evaluation function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants