-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][DNM] error-resistant train(). Fix #1052 #1139
Conversation
@gojomo Hello, when I fix the notebooks/docs to reflect the change of "train", I got an error in the notebooks/doc2vec-IMDB.ipynb:
I use Ubuntu 14.04, python 2.7.6 and gensim version 1.0.1, the error was caused by 'encoding' is an invalid keyword argument for the open function in python2.7. Dose it need to change the code for support the python2? |
The notebooks should support both Python 2 and 3, same as our source code. Unfortunately we have been lax about it before - fixing it as a part of this release would be a great contribution. |
Yes, gensim already uses six |
@prakhar2b would you mind guide me to work through this? I think this is a good opportunity for me to get family with the process of contribution. My initial thought was library codecs. |
@robotcator Definitely, that would be a pleasure. I suggest you post your doubts on gensim google group, the community is very active in responding. You can also mail me directly if you want - [email protected] |
Ping @robotcator for a status update |
@tmylk sorry for late status update. I have fixed the compatibility between Python 3 and Python 2 by using codecs library. But the code runs in Python 2 seems inefficiently, I am stuck in finding a solution to optimizing python2 code. |
Could you please update the pr with the new code? |
@robotcator Could you please add tests to make sure that ValueError is indeed thrown when params are not supplied. |
@tmylk Of course. |
…Fix #1052. (#1237) * fix the compatibility between python2 & 3 * require explicit corpus size, epochs for train() * make all train() calls use explicit count, epochs * add tests to make sure that ValueError is indeed thrown * update test * fix the word2vec's reset_from() * require explicit corpus size, epochs for train() * make all train() calls use explicit count, epochs * fix some error * fix test error
@gojomo Could you please write a note for users on how to upgrade their code for this change? It will be the breaking change to warrant a major release to |
@@ -448,7 +448,8 @@ def __init__( | |||
if isinstance(sentences, GeneratorType): | |||
raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.") | |||
self.build_vocab(sentences, trim_rule=trim_rule) | |||
self.train(sentences) | |||
self.train(sentences, total_examples=self.corpus_count, epochs=self.iter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please
The note could be:
I'm in favor of generous version-incrments, but if considering a "2.0.0" release to signal a break in backward compatibility, so soon after the "1.0.0" release, we may want to look at any other prposed/pending/desirable changes in names/defaults/parameters, to include as well. Some things that come to mind:
|
We could use a new label to mark issues/PRs where breaks with backward-compatibility are happening (or proposed), requiring sync w/ version numbering. I've created one for that purpose, and applied it here to get things started. |
Thanks for suggesting more ways to make the API and code more streamlined. Unfortunately SemVer adoption means that (contrary to topic modelling) "Version numbers are not for humans" Small changes in releases are better than big releases, so if we get to version 10.1.0 by the end of the year then it is ok. |
The idea of batching together sets of breaking changes, to help minimize user upgrade efforts, seems orthogonal to SemVer adoption to me. But if you'd like to move to a more rapid tempo, sure. That style of numbering/rapid-incremental-release might also justify eventually splitting gensim into narrow single-algorithm subprojects (with some shared core dependencies). Then small changes, not affecting most users, would only trigger a release/version-increment/need-to-review-release-notes among the users of that one module. |
Merged in #1237 |
@gojomo any suggestions where those "subproject split lines" ought to be? |
@piskvorky In the very-small, very-focused packaging style of (say) node/npm, almost every module would be a different package. There'd be one or a few core or util packages, but then nearly every 'model' would be its own package. In the extreme, even Word2Vec and Doc2Vec would be different packages (with Doc2Vec depending on Word2Vec). Still mulling over whether I'd advocate this – it's less common in the Python world than the JS world, and may result in a bewildering variety of projects each with their own version-number – but it does mesh with rigorous SemVer and rapid, high-resolution release cycles. |
Changes
train()
to require explicitepochs
argument, and explicit size of corpus (eithertotal_examples
ortotal_words
) – thus forcing users to be conscious oftrain()
s behavior in these regards, and (intentionally) breaking older code that might be operating under misconceptions.Tests updated to be as explicit as needed.
To do:
train()
Also a potential related change: add an optional
epoch_callback
argument totrain()
, that could be called after each training epoch. That could completely obviate the need for users/tutorials runningtrain()
in their own loop to do extra steps (such as evaluation) before training is done.