cbow_mean default changed from 0 to 1. #538

akutuzov · 2015-11-23T01:38:42Z

Taking average from input vectors in CBOW mode gives better results than their sum (see https://docs.google.com/spreadsheets/d/1dgr513AePh4EjCUQxQyeT9i6Xnig3SOtjLJwiIVmuu4 for experiments). It is also the default setting in Mikolov's word2vec. Therefore, cbow_mean=1 should be default in Gensim as well.

piskvorky · 2015-11-23T06:26:50Z

It's a change in welcome direction, but I'd prefer this to be a pull request that fixes the defaults fully, across variables, as discussed in #534 .

Piece-meal updates will only make the switch more painful and confusing.

akutuzov · 2015-11-24T18:41:54Z

OK. I will try to return to this in a few days. Should all Gensim's word2vec variables be aligned with Mikolov's tool, or only in the main()?

piskvorky · 2015-11-25T02:47:18Z

Thanks @akutuzov !

I think we should keep the current names in the code, just change their default values. But the main() CLI names can match the C tool, sure, why not.

tmylk · 2016-01-09T20:09:25Z

Hey @akutuzov do you have any updates? Planning a release and would be good to include this.

akutuzov · 2016-01-09T22:48:30Z

Unfortunately not yet.
What is the planned date for the release? I could try to make it in time.

tmylk · 2016-01-09T22:49:27Z

17 Jan is the tentative date

akutuzov · 2016-01-09T22:52:41Z

I see. Then I hope to finish this by 15 Jan.

Conflicts: CHANGELOG.txt

akutuzov · 2016-01-14T00:28:53Z

I changed the default values of all hyperparameters to match those of Mikolov's wordvec. The only exception is default hierarchical softmax (word2vec has default negative sampling of 5 samples). If I change Gensim defaults to negative sampling, it immediately breaks a number of tests: particularly, word2vec training (which seems to rely on syn1 matrix) and scoring (which is not implemented for negative sampling at all). Thus, I left default hierarchical softmax as is.

I also changed the defaults in (main) and will implement CLI arguments processing mimicking those of word2vec in the next couple of days.

Not quite sure why checks are failed on this PR, and what conflicts it mentions. The code passes all the tests here.

Conflicts: CHANGELOG.txt gensim/models/word2vec.py

…vec.

… default vector size is 100, not 200).

akutuzov · 2016-01-15T01:37:59Z

@tmylk, I am finished with this. Gensim word2vec defaults are now consistent with Mikolov's C tool (except for hierarchical softmax, see above), and the (main) now accepts the same command line arguments as the C tool.
I had to change one value in the unit test for word2vec (model_sanity function, line 238). It tested for the word 'terrorism' to be in the first 50 synonyms of the word 'war', and this was not always the case, so I changed 50 to 60. The reason is that the default vector size is now 100 (as in Mikolov's word2vec), not 200. Thus, the resulting default models are slightly worse (but smaller).

gojomo · 2016-01-15T02:40:33Z

Thanks for the contribution!

FYI, the default size is 100 before and after, so that's not specifically the cause for the test failure you saw. Rather, that's been an ornery test, more likely to fail on some systems (including the CI servers) than others. See #531 and potential improvement in PR #581. Note that merely loosening the tolerance only helps a little, leaving the potential for outliers over the new threshold – whereas I think the changes is #581 will more effectively tighten the range of results.

Other comments coming in line notes.

gojomo · 2016-01-15T02:45:10Z

gensim/models/word2vec.py

@@ -342,8 +342,8 @@ class Word2Vec(utils.SaveLoad):
    """
    def __init__(
            self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
-            max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001,
-            sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=hash, iter=1, null_word=0,
+            max_vocab_size=None, sample=1e-3, seed=1, workers=12, min_alpha=0.0001,


Since our multithreading doesn't parallelize as effectively as the C code (and in fact hits a point of diminishing returns even before reaching the CPU-core count), the workers parameter is one exception where we should pick a different value. I suggest 3 as a value likely to help without causing issues.

tmylk · 2016-01-27T06:10:48Z

Standalone script manually tested successfully with minor code changes as in #593.
Automated tests for the standalone script are still work in progress in #593 .

Plan to merge in #593 tonight - Andrey, @akutuzov, if you could review it would be great.

akutuzov · 2016-01-27T17:57:52Z

@tmylk, seems OK for me.

PR #538 Word2vec defaults changed + a stub for test

tmylk · 2016-01-28T16:46:05Z

Merged in #593 - needed some changes to run.
Leaving it open because automated tests for the standalone scripts still need to be added. I had one go but in Travis it couldn't find the python executable, however in Windows it was ok. If anyone has seen this before, would appreciate help.

piskvorky · 2016-01-29T00:59:52Z

Thanks a lot @akutuzov ! Great work, much appreciated :)

@tmylk you're giving subprocess a single large string. My guess is it's trying to interpret that entire string as the executable name; try giving it a list of arguments instead (break the command down into its individual parts).

Conflicts: CHANGELOG.txt gensim/models/word2vec.py gensim/scripts/word2vec_standalone.py

Conflicts: CHANGELOG.md README.md gensim/models/word2vec.py tutorials.md

…y judgments datasets.

@gojomo

…1047) * Update CHANGELOG.txt * Update CHANGELOG.txt * cbow_mean default changed from 0 to 1. * Hyperparameters' default values are aligned with Mikolov's word2vec. * Fix for #538: cbow_mean default changed from 0 to 1. * Update changelog * (main) defaults aligned to Mikolov's word2vec. * word2vec (main) now mimics command-line arguments for Mikolov's word2vec. * Fix for #538 * Fix for #538 (tabs and spaces). * Fix for #538 (tests). * For #538: slightly relaxed sanity check demands (because now default vector size is 100, not 200). * Fixes as per @gojomo comments. * Test fixes due to negative sampling becoming default behavior. * Commented out tests which work for HS only. * Fix for #538. * Yet another fix. * Merging. * Fix for CBOW test. * Changelog mention of #538 * Fix for CBOW negative sampling tests. * Factoring out word2vec _main__ into gensim/scripts * Use logger instead of logging. * Made Changelog less verbose about word2vec defaults changed. * Fixes to word2vec_standalone.py as per Radim's comments. * Alpha argument. with different defaults for CBOW ans skipgram. * Release version typo fix * 'fisrt_push' * Finalizing. * Initial shippable release * Evaluation function to measure model correlation with human similarity judgments datasets. * Updating semantic similarity evaluation. * Scipy stats import * Evaluation function to measure model correlation with human similarity judgments datasets. * Remove unneccessary. * Changing the neame of the word pairs evaluation function.

tmylk and others added 6 commits November 5, 2015 19:07

Merge branch 'release-0.12.3rc1'

1c63c9a

Merge branch 'release-0.12.3'

280a488

Merge branch 'release-0.12.3'

ddeb002

Update CHANGELOG.txt

f2ac3a9

Update CHANGELOG.txt

cf09e8c

cbow_mean default changed from 0 to 1.

b8b8f57

akutuzov mentioned this pull request Nov 23, 2015

match word2vec.c defaults (& option names? & command-line switches?) more closely #534

Open

akutuzov added 5 commits January 13, 2016 14:41

Hyperparameters' default values are aligned with Mikolov's word2vec.

6456cbc

Merge remote-tracking branch 'upstream/master' into develop

966a4b0

Conflicts: CHANGELOG.txt

Fix for piskvorky#538: cbow_mean default changed from 0 to 1.

d9ec7e4

Update changelog

76d2df7

(main) defaults aligned to Mikolov's word2vec.

0b6f45b

akutuzov added 6 commits January 14, 2016 19:16

Merge remote-tracking branch 'upstream/develop' into develop

7fb5f18

Conflicts: CHANGELOG.txt gensim/models/word2vec.py

word2vec (main) now mimics command-line arguments for Mikolov's word2…

bc7a447

…vec.

Fix for piskvorky#538

e689b4f

Fix for piskvorky#538 (tabs and spaces).

a5274ab

Fix for piskvorky#538 (tests).

5c32ca8

For piskvorky#538: slightly relaxed sanity check demands (because now…

ac889b3

… default vector size is 100, not 200).

gojomo mentioned this pull request Jan 15, 2016

threadsafe job counts; configurable batch_size #581

Merged

gojomo reviewed Jan 15, 2016
View reviewed changes

akutuzov added 4 commits January 27, 2016 17:30

Use logger instead of logging.

8a3d58b

Made Changelog less verbose about word2vec defaults changed.

c5249b9

Fixes to word2vec_standalone.py as per Radim's comments.

a40e624

Alpha argument. with different defaults for CBOW ans skipgram.

dbd0eab

tmylk added a commit that referenced this pull request Jan 28, 2016

Merge pull request #593 from piskvorky/pr_538_tests

8b26889

PR #538 Word2vec defaults changed + a stub for test

tmylk added a commit that referenced this pull request Jan 28, 2016

Shortened #538 description

518687b

tmylk and others added 16 commits January 29, 2016 18:35

resolve merge conflict in Changelog

b61287a

Merge branch 'release-0.12.4' with piskvorky#596

3ade404

Merge branch 'release-0.13.0'

9e6522e

Merge branch 'release-0.13.0'

87c4e9c

Release version typo fix

9c74b40

Merge branch 'release-0.13.0rc1'

7b30025

Merge branch 'release-0.13.0'

de79c8e

Merge branch 'release-0.13.1'

d4f9cc5

Merge remote-tracking branch 'upstream/master' into develop

e0627c6

Conflicts: CHANGELOG.txt gensim/models/word2vec.py gensim/scripts/word2vec_standalone.py

Finalizing.

b8b30c2

'fisrt_push'

f3f2a52

Initial shippable release

873f184

Merge remote-tracking branch 'upstream/develop' into develop

68a3e86

Conflicts: CHANGELOG.md README.md gensim/models/word2vec.py tutorials.md

Evaluation function to measure model correlation with human similarit…

498474d

…y judgments datasets.

Updating semantic similarity evaluation.

ce64d5a

Scipy stats import

0936971

akutuzov closed this Dec 15, 2016

akutuzov mentioned this pull request Dec 16, 2016

Evaluation of word2vec models against semantic similarity datasets #1047

Merged

pabs3 mentioned this pull request Apr 2, 2022

Enable test_word2vec_stand_alone_script by using sys.executable for python #3299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cbow_mean default changed from 0 to 1. #538

cbow_mean default changed from 0 to 1. #538

akutuzov commented Nov 23, 2015

piskvorky commented Nov 23, 2015

akutuzov commented Nov 24, 2015

piskvorky commented Nov 25, 2015

tmylk commented Jan 9, 2016

akutuzov commented Jan 9, 2016

tmylk commented Jan 9, 2016

akutuzov commented Jan 9, 2016

akutuzov commented Jan 14, 2016

akutuzov commented Jan 15, 2016

gojomo commented Jan 15, 2016

gojomo Jan 15, 2016

tmylk commented Jan 27, 2016

akutuzov commented Jan 27, 2016

tmylk commented Jan 28, 2016

piskvorky commented Jan 29, 2016

cbow_mean default changed from 0 to 1. #538

cbow_mean default changed from 0 to 1. #538

Conversation

akutuzov commented Nov 23, 2015

piskvorky commented Nov 23, 2015

akutuzov commented Nov 24, 2015

piskvorky commented Nov 25, 2015

tmylk commented Jan 9, 2016

akutuzov commented Jan 9, 2016

tmylk commented Jan 9, 2016

akutuzov commented Jan 9, 2016

akutuzov commented Jan 14, 2016

akutuzov commented Jan 15, 2016

gojomo commented Jan 15, 2016

gojomo Jan 15, 2016

Choose a reason for hiding this comment

tmylk commented Jan 27, 2016

akutuzov commented Jan 27, 2016

tmylk commented Jan 28, 2016

piskvorky commented Jan 29, 2016