[WIP][DNM] error-resistant train(). Fix #1052 #1139

gojomo · 2017-02-09T06:08:23Z

Changes train() to require explicit epochs argument, and explicit size of corpus (either total_examples or total_words) – thus forcing users to be conscious of train()s behavior in these regards, and (intentionally) breaking older code that might be operating under misconceptions.

Tests updated to be as explicit as needed.

To do:

Update any notebooks using train()

Also a potential related change: add an optional epoch_callback argument to train(), that could be called after each training epoch. That could completely obviate the need for users/tutorials running train() in their own loop to do extra steps (such as evaluation) before training is done.

robotcator · 2017-03-12T11:44:06Z

@gojomo Hello, when I fix the notebooks/docs to reflect the change of "train", I got an error in the notebooks/doc2vec-IMDB.ipynb:

TypeError                                 Traceback (most recent call last)
<ipython-input-2-4bb56643bf06> in <module>()
     53 
     54         for txt in txt_files:
---> 55             with open(txt, 'r', encoding='utf-8') as t:
     56                 control_chars = [chr(0x85)]
     57                 t_clean = t.read()

TypeError: 'encoding' is an invalid keyword argument for this function

I use Ubuntu 14.04, python 2.7.6 and gensim version 1.0.1, the error was caused by 'encoding' is an invalid keyword argument for the open function in python2.7. Dose it need to change the code for support the python2?

tmylk · 2017-03-12T13:02:50Z

The notebooks should support both Python 2 and 3, same as our source code. Unfortunately we have been lax about it before - fixing it as a part of this release would be a great contribution.

prakhar2b · 2017-03-12T13:19:14Z

@tmylk will python 2 and 3 compatibility library six be a good choice here ?

tmylk · 2017-03-12T13:44:50Z

Yes, gensim already uses six

robotcator · 2017-03-12T13:49:14Z

@prakhar2b would you mind guide me to work through this? I think this is a good opportunity for me to get family with the process of contribution. My initial thought was library codecs.

prakhar2b · 2017-03-12T14:46:37Z

@robotcator Definitely, that would be a pleasure. I suggest you post your doubts on gensim google group, the community is very active in responding. You can also mail me directly if you want - [email protected]

tmylk · 2017-03-17T00:42:30Z

Ping @robotcator for a status update

robotcator · 2017-03-17T01:21:25Z

@tmylk sorry for late status update. I have fixed the compatibility between Python 3 and Python 2 by using codecs library. But the code runs in Python 2 seems inefficiently, I am stuck in finding a solution to optimizing python2 code.

tmylk · 2017-03-17T01:59:50Z

Could you please update the pr with the new code?

robotcator · 2017-03-18T00:17:18Z

@tmylk sorry for late reply. I have make an silly mistake on the git. I create an pull request#1220. Since I am not familiar with git and I am curious about why there are so many commits and comments on my pull request.

tmylk · 2017-03-20T20:37:36Z

@robotcator Could you please add tests to make sure that ValueError is indeed thrown when params are not supplied.
Also here are the notebooks that need to be changed:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

robotcator · 2017-03-22T01:52:18Z

@tmylk Of course.

robotcator · 2017-03-23T13:58:01Z

@tmylk I have create pr 1237 to to make sure that ValueError is thrown when params are not supplied. I don't know why there are conflicts with the base branch.

…Fix #1052. (#1237) * fix the compatibility between python2 & 3 * require explicit corpus size, epochs for train() * make all train() calls use explicit count, epochs * add tests to make sure that ValueError is indeed thrown * update test * fix the word2vec's reset_from() * require explicit corpus size, epochs for train() * make all train() calls use explicit count, epochs * fix some error * fix test error

tmylk · 2017-04-03T12:56:43Z

@gojomo Could you please write a note for users on how to upgrade their code for this change? It will be the breaking change to warrant a major release to gensim 2.0.0

tmylk · 2017-04-03T21:05:33Z

gensim/models/word2vec.py

@@ -448,7 +448,8 @@ def __init__(
            if isinstance(sentences, GeneratorType):
                raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
            self.build_vocab(sentences, trim_rule=trim_rule)
-            self.train(sentences)
+            self.train(sentences, total_examples=self.corpus_count, epochs=self.iter,


Hanging indent please

gojomo · 2017-04-03T21:06:53Z

The note could be:

Any direct calls to method train() of Word2Vec/Doc2Vec now require an explicit epochs parameter and explicit estimate of corpus size (usually as total_examples). See the [method documentation] for more information.

I'm in favor of generous version-incrments, but if considering a "2.0.0" release to signal a break in backward compatibility, so soon after the "1.0.0" release, we may want to look at any other prposed/pending/desirable changes in names/defaults/parameters, to include as well.

Some things that come to mind:

infer_vector() defaults & parameter names
KeyedVector property/method names still retain lots of word2vec-specific details – like word where the vectors might be more general, or syn0 when it's no longer necessarily the sort of NN-layer that has that name
Doc2Vec's DocvecsArray may be mergeable with KeyedVectors, eliminating some code duplication, adding some other flexibility (at cost of complexity) to KeyedVectors, maybe making model.wv and model.dv perfectly-parallel ways to access word-vectors-by-key and doc-vectors-by-key, respectively.

gojomo · 2017-04-03T21:12:22Z

We could use a new label to mark issues/PRs where breaks with backward-compatibility are happening (or proposed), requiring sync w/ version numbering. I've created one for that purpose, and applied it here to get things started.

tmylk · 2017-04-04T15:14:52Z

Thanks for suggesting more ways to make the API and code more streamlined.

Unfortunately SemVer adoption means that (contrary to topic modelling) "Version numbers are not for humans"

Small changes in releases are better than big releases, so if we get to version 10.1.0 by the end of the year then it is ok.

gojomo · 2017-04-04T18:30:07Z

The idea of batching together sets of breaking changes, to help minimize user upgrade efforts, seems orthogonal to SemVer adoption to me. But if you'd like to move to a more rapid tempo, sure.

That style of numbering/rapid-incremental-release might also justify eventually splitting gensim into narrow single-algorithm subprojects (with some shared core dependencies). Then small changes, not affecting most users, would only trigger a release/version-increment/need-to-review-release-notes among the users of that one module.

tmylk · 2017-04-10T22:46:14Z

Merged in #1237

piskvorky · 2017-04-11T01:35:33Z

@gojomo any suggestions where those "subproject split lines" ought to be?

gojomo · 2017-04-11T22:41:40Z

@piskvorky In the very-small, very-focused packaging style of (say) node/npm, almost every module would be a different package. There'd be one or a few core or util packages, but then nearly every 'model' would be its own package. In the extreme, even Word2Vec and Doc2Vec would be different packages (with Doc2Vec depending on Word2Vec).

Still mulling over whether I'd advocate this – it's less common in the Python world than the JS world, and may result in a bewildering variety of projects each with their own version-number – but it does mesh with rigorous SemVer and rapid, high-resolution release cycles.

gojomo added 2 commits February 8, 2017 21:59

require explicit corpus size, epochs for train()

a3b9cde

make all train() calls use explicit count, epochs

0eb9a66

gojomo changed the title ~~[WIP][DNM] error-proof train()~~ [WIP][DNM] error-resistant train() Feb 9, 2017

JamesBondAgent007 approved these changes Feb 17, 2017

View reviewed changes

gojomo mentioned this pull request Mar 6, 2017

Make Word2vec multiple passes more explicit. Fix #1052 #1183

Closed

tmylk mentioned this pull request Mar 8, 2017

Make Word2vec multiple passes more explicit #1052

Closed

tmylk changed the title ~~[WIP][DNM] error-resistant train()~~ [WIP][DNM] error-resistant train(). Fix #1052 Mar 8, 2017

robotcator mentioned this pull request Mar 18, 2017

Make doc2vec imdb ipynb tutorial run in python 2 and 3 #1220

Merged

gojomo mentioned this pull request Mar 23, 2017

Loss through each iteration in skip gram #999

Closed

robotcator mentioned this pull request Mar 24, 2017

tests for word2vec's train(). Continuing #1139 #1237

Merged

tmylk reviewed Apr 3, 2017

View reviewed changes

gojomo added the breaks backward-compatibility Change breaks backward compatibility label Apr 3, 2017

tmylk closed this Apr 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DNM] error-resistant train(). Fix #1052 #1139

[WIP][DNM] error-resistant train(). Fix #1052 #1139

gojomo commented Feb 9, 2017 •

edited

Loading

robotcator commented Mar 12, 2017

tmylk commented Mar 12, 2017

prakhar2b commented Mar 12, 2017

tmylk commented Mar 12, 2017

robotcator commented Mar 12, 2017 •

edited

Loading

prakhar2b commented Mar 12, 2017

tmylk commented Mar 17, 2017

robotcator commented Mar 17, 2017

tmylk commented Mar 17, 2017

robotcator commented Mar 18, 2017

tmylk commented Mar 20, 2017 •

edited

Loading

robotcator commented Mar 22, 2017

robotcator commented Mar 23, 2017 •

edited

Loading

tmylk commented Apr 3, 2017

tmylk Apr 3, 2017

gojomo commented Apr 3, 2017

gojomo commented Apr 3, 2017

tmylk commented Apr 4, 2017

gojomo commented Apr 4, 2017

tmylk commented Apr 10, 2017

piskvorky commented Apr 11, 2017

gojomo commented Apr 11, 2017

[WIP][DNM] error-resistant train(). Fix #1052 #1139

[WIP][DNM] error-resistant train(). Fix #1052 #1139

Conversation

gojomo commented Feb 9, 2017 • edited Loading

robotcator commented Mar 12, 2017

tmylk commented Mar 12, 2017

prakhar2b commented Mar 12, 2017

tmylk commented Mar 12, 2017

robotcator commented Mar 12, 2017 • edited Loading

prakhar2b commented Mar 12, 2017

tmylk commented Mar 17, 2017

robotcator commented Mar 17, 2017

tmylk commented Mar 17, 2017

robotcator commented Mar 18, 2017

tmylk commented Mar 20, 2017 • edited Loading

robotcator commented Mar 22, 2017

robotcator commented Mar 23, 2017 • edited Loading

tmylk commented Apr 3, 2017

tmylk Apr 3, 2017

Choose a reason for hiding this comment

gojomo commented Apr 3, 2017

gojomo commented Apr 3, 2017

tmylk commented Apr 4, 2017

gojomo commented Apr 4, 2017

tmylk commented Apr 10, 2017

piskvorky commented Apr 11, 2017

gojomo commented Apr 11, 2017

gojomo commented Feb 9, 2017 •

edited

Loading

robotcator commented Mar 12, 2017 •

edited

Loading

tmylk commented Mar 20, 2017 •

edited

Loading

robotcator commented Mar 23, 2017 •

edited

Loading