-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding type check for corpus_file argument #2469
Conversation
gensim/models/doc2vec.py
Outdated
|
||
# Check the type of corpus_file | ||
if not isinstance(corpus_file, string_types): | ||
raise TypeError("Parameter corpus_file of train() must be a string (path to a file).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also show what the received parameter is instead?
Having concrete error messages helps avoid confusion.
gensim/models/doc2vec.py
Outdated
@@ -794,6 +794,11 @@ def train(self, documents=None, corpus_file=None, total_examples=None, total_wor | |||
|
|||
""" | |||
kwargs = {} | |||
|
|||
# Check the type of corpus_file | |||
if not isinstance(corpus_file, string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corpus_file
may legitimately be None, when documents
is not None. This is the reason why some of the unit tests fail. Please have a look at Travis CI.
So, if you go ahead with your proposed check, you need something like:
if not isinstance(corpus_file, string_types): | |
if corpus_file is not None and not isinstance(corpus_file, string_types): |
Also, please add a unit test that stresses your new functionality (pass in a non-string corpus_file
, expect a TypeError
raised).
Next, what is the benefit of raising TypeError here? What happens with the existing code if we do not raise TypeError? What kind of exception does the user see?
I think if we go ahead with this kind of parameter checking, we should do it properly:
- Ensure that one of
documents
orcorpus_file
is None (they cannot both be non-None) - Ensure that one of
documents
orcorpus_file
is not None (they cannot both be None) - If
documents
is not None, then it must be an iterable - If
corpus_file
is not None, then it must be a string
Finally, if we go ahead with this, we should apply this consistently everywhere, not just doc2vec (e.g. fasttext has similar issues).
My question is: is it worth it? @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the tests. I edited the PR description for context (link to the original issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky @mpenkov Thanks for the detailed feedback. Working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'd say: while checking whether it's a string may help a bit, if there's checking, it'd make sense to ensure things like illegal or missing paths also generate meaningful error messages. It might be possible to just leverage some existing path-checking/file-existence-checking method, or wrap the failing method(s) in a handler that catches errors and shows a good-enough error message like "Problem with corpus_file
value X".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo so should I check corpus_file
for being a valid file path instead of checking for string_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, what covers the most cases with the most straightforward code? A path-test, with the right catches/error-message, might "catch N birds with one stone", pointing the user in the right direction to understand their parameter error with fewer conditionals/lines-of-tests. Or it might complicate things compared to the simple proper-type test. But the motivating case for this change - that users following slightly-older examples would get a confusing error message, due to a new corpus_file
parameter` – will actually already have been handled by the simple "one or the other but not both" test. So, even the string test isn't strictly required to address the motivating case. My pref would be: do the simplest thing that resolves the motivating case, for sure. If there's a easy/clear bit of extra checking that makes also makes sense, consider it as well.
# Check if both documents and corpus_file are not None | ||
if corpus_file is not None and documents is not None: | ||
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to combine these two cases with an XOR: if (corpus_file is None) ^ (documents is None):
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way for readability. Also, I noticed True ^ True
is False
. If I do it the xor way, this case will get skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it'd actually need to be if not ((corpus_file is None) ^ (documents is None)):
- but then essentially one error message, roughly "supply one or the other but not both", would be fine for either mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong -1 on this: hard to read and reason about. Please stick to simpler (and easier to maintain) constructs .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clearest approach would be not to have a method with such delicate XOR one-but-not-the-other positional parameters - use separate methods with distinct names! But failing that, a one-line XOR, whose next line is an error message with a plain-english description of the constraint, seems pretty simple to me - just accurately reflecting the design choice already made.
gensim/models/doc2vec.py
Outdated
|
||
# Check if documents is not None and iterable but not string type | ||
if documents is not None and isinstance(documents, Iterable) and isinstance(documents, string_types): | ||
raise TypeError("Documents must be an iterable of list and not a string type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the test of is-Iterable
and is-string_types
is likely redundant, as all strings are iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also important to not go overboard with the checking. For example, you're making sure it's not a string here. Why not take that further and make sure it's not an int? A float? A file buffer? And so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo I was not sure how to go about with handling this. documents
should be an iterable
but not a string since like you said strings are iterable. I think I got it now, so instead of checking for both, checking only for string
type would do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supplying a (path) string as documents
might sufficiently reveal itself to be an error in other ways - like training that completes almost instantly, with little to no vocabulary. Separately, I'm not sure that all objects that would work because they are effectively "iterable" will actually test positive as being isinstance()
Iterable
.
gensim/models/doc2vec.py
Outdated
# Check the type of corpus_file | ||
if corpus_file is not None and not isinstance(corpus_file, string_types): | ||
raise TypeError("""Parameter corpus_file of train() must be a | ||
string (path to an existing file) got %s instead.""" % corpus_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since None
will already fail the is-string_types
test, the check is not None
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a multiline string is poor here. Your exception message contains two lines, and a lot of unnecessary spaces.
I think you also don't need to mention the function name: it'll be obvious from the stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenkov noob question: should I resolve the conversations or the person who reviews the PR does it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense for the original reviewer to do it, once they're happy their comments have been addressed.
@mpenkov @gojomo Kindly check my latest commit. I've made the changes as suggested. For now, we test four things before training would begin (as suggested):
Explanation point 4: I check only for being iterable to handle cases when someone would pass integer, float, bool value to I think these four cases cover most plausible situations which users might encounter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that you've made changes to doc2vec only. What about the other modules that suffer from the exact same problem, e.g. fasttext?
gensim/models/doc2vec.py
Outdated
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.") | ||
|
||
# Check if corpus_file is string type | ||
if documents is None and not isinstance(corpus_file, string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If corpus_file
must be a valid path to a file, then why not explicitly check for that, instead of checking the type?
if documents is None and not isinstance(corpus_file, string_types): | |
if documents is None and not os.path.isfile(corpus_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that throw an exception for incorrect (non-string) corpus_file
? It's an interesting idea for a try-catch block though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could wrap it in a _is_valid_corpus(corpus_file) function for malformed input.
Given that this input validation will be happening in mulitple places, it's probably not a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky @mpenkov I've kept it simple. I used os.path.isfile
as suggested. It returns False
at all type of invalid inputs including non-string / float or whatever trash value a user would pass.
Also, to do this for fasttext, I think it will be better:
- To create a separate PR
- Wait until the this PR is finalised so that I can follow the same code approach there are well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 on creating separate PRs for this, for several reasons:
- The changes are tiny. They span several lines of code.
- The changes will be identical for each model (e.g. doc2vec, fasttext, etc)
- PR overhead (review, merge, release, document) is not worth the effort.
- It's salami-slicing.
For your second point, sure, I think it's a good idea to wait until people come to a consensus before expanding your approach to the other models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenkov Could you please review the latest changes and let me know is we are good to go or something else needs to be done ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Left you some nitpicky comments.
In the future, please consider ticking "allow commits from maintainers" when creating a PR on github. That allows the reviewers of your PR to push changes to your branch, potentially shortcutting some of the back-and-forth for minor issues like typos, formatting, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenkov thanks for the suggestion. I believe we are solid now on this PR. I will now make these changes in fasttext module as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please go ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenkov I've done the changes in fasttext module as well. Please have a look.
gensim/models/doc2vec.py
Outdated
|
||
# Check if both documents and corpus_file are not None | ||
if corpus_file is not None and documents is not None: | ||
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the above.
raise TypeError("Instead provide value to either of corpus_file or documents parameter but not both.") | |
raise TypeError("Both corpus_file and documents may not be provided at the same time") |
gensim/models/doc2vec.py
Outdated
|
||
# Check if both documents and corpus_file are None | ||
if corpus_file is None and documents is None: | ||
raise TypeError("Either one of corpus_file or documents value must be provided.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to include a period. Please apply this to other messages as well.
raise TypeError("Either one of corpus_file or documents value must be provided.") | |
raise TypeError("Either one of corpus_file or documents value must be provided") |
I tried committing these nitpicks to your branch myself, but I don't have permissions.
gensim/models/doc2vec.py
Outdated
|
||
# Check if corpus_file is string type | ||
if documents is None and not os.path.isfile(corpus_file): | ||
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise TypeError("Parameter corpus_file must be a valid path to a file, got %s instead." % corpus_file) | |
raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead." % corpus_file) |
Better than %s, because corpus_file may contain spaces.
gensim/models/fasttext.py
Outdated
@@ -901,6 +903,23 @@ def train(self, sentences=None, corpus_file=None, total_examples=None, total_wor | |||
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs) | |||
|
|||
""" | |||
|
|||
# Check if both sentences and corpus_file are None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these comments. They add no value to the source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
It looks like all comments have been addressed, so I'm merging this in. @saraswatmks Congrats on your first contribution to gensim 🥇 Thank you! |
Add a type check for
corpus_file
argument and raise type error if the type ofcorpus_file
is not a string. Fixes #2460.