-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705
Conversation
For each of these warnings, the goal should be to make note of the problem exactly once, at the earliest possible time. We don't want the overhead of redundant checks; the user doesn't need X million warnings for their X million text examples. So avoid doing checks for every document, during training, and instead do as soon as the problem parameter/input is seen for the 1st time. The alpha is managed by the Most of these issues apply to Word2Vec as well, and so in most cases the checks will be there (and inherited by Doc2Vec). You won't likely have to be concerned about either the |
@gojomo thanks for the help! I've updated the PR. Hope this is better. What should be a good threshold value for a "small" vocab? Also, |
@@ -509,9 +510,16 @@ def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_ | |||
Each sentence must be a list of unicode strings. | |||
|
|||
""" | |||
for sentence in sentences: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds the overhead of a full pass over the corpus in the case where the user has already done things right; that's excessive. It'd be better to add the check to a pass that's already happening (eg scan_vocab()
). (And, it may even be a desirable optimization to just check the first example... this mistake is likely to be all-or-nothing, and we don't want to burden a user doing the right thing, with a tens-of-millions-of-examples corpus, with a check of every example.)
Also, this code appears to only work for the Doc2Vec case, when 'sentence' is something TaggedDocument-shaped with a 'words' field. In Word2Vec, there's no such field, the example is just the list-of-tokens. Are you testing in a Word2Vec triggering scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh alright I didn't know about the all-or-nothing scenario. I've added checks in the scan_vocab
methods of Word2Vec
and Doc2Vec
wherein I'm only checking if the sentence
or document.words
is an instance of string_types
for sentence (or document) 0. So I don't think we need the check for all single characters in vocab
since, as you said, this check already covers it. Is this approach fine?
I don't have any confident idea of a good "your vocabulary is suspiciously small" threshold, so we should probably just skip that idea unless/until there's more evidence it's needed. (It was just one of a bunch of 'and/or' ideas for catching common errors.) |
@gojomo I've made the changes to the code. Could you please check? Also regarding raising the warning for a mismatch in the expected count for train() and actual, what is exactly count over here and what should it be compared against to raise the warning? |
4ec40d6
to
2eedf77
Compare
@@ -636,6 +636,12 @@ def scan_vocab(self, documents, progress_per=10000, trim_rule=None): | |||
interval_start = default_timer() - 0.00001 # guard against next sample being identical | |||
interval_count = 0 | |||
vocab = defaultdict(int) | |||
try: | |||
if isinstance(list(documents)[0].words, string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many users won't be passing an indexable/all-in-memory corpus – so this will break for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for bothering you so much :(
What would be a better way to check then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the existing iteration, check the current sentence. But, once the first is checked, set a flag to prevent further checks – based on our assumption that the user either gets this "all right" or "all wrong" and that (when wrong) they don't want redundant warnings for every sentence.
vocab = defaultdict(int) | ||
for document_no, document in enumerate(documents): | ||
if not checked_string_types: | ||
if isinstance(document.words, string_types): | ||
warnings.warn("'words' should be a list of unicode strings. True string type provided.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "True string type", why not output the actual type of document.words
?
Plus maybe even output the entire document
example, for explicitness and clarity.
Changed title to reflect that this can be an issue in Word2Vec as well (and should be checked there too). I'm in favor of |
|
Thanks for the suggestion @tmylk however I'm getting an |
@tmylk the build still seems to be failing..... |
840bcc3
to
b4195f4
Compare
@piskvorky @gojomo @tmylk I think travis is happy. I think the test failing is unrelated to my PR. Could you please review once? |
for sentence_no, sentence in enumerate(sentences): | ||
if not checked_string_types: | ||
if isinstance(sentence, string_types): | ||
logger.warn("'sentences' should be a list of list of unicode strings. %s provided.", type(sentence)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be pedantically accurate, sentences
need not be a list – just an iterable. But the important thing being checked here is whether the first item is the expected list-of-strings. So I would word this warning as: "Each sentences
item should be a list of words (usually unicode strings). First item here is instead plain %s." The warning in doc2vec could be similarly: "Each words
should be a list of words (usually unicode strings). First 'words' here is instead plain %s".) (I find that clearer than just "%s provided" which may at first glance seem like no contradiction with preceding advice.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Made the changes.
Other than my slight suggestion regarding warning wording, this looks good-to-go. The test failure is in gensim.test.test_glove2word2vec.TestGlove2Word2Vec and thus seems unrelated to these changes. @tmylk I suggest that test should be disabled here and in 'develop' until it can be independently made reliable. |
b4195f4
to
5bf3329
Compare
I've made the changes in the log message, rebased and squashed. |
@tmylk can this be merged? |
@dsquareindia Thanks a lot for this very needed PR! |
No problem! Thanks a ton for the help and reviews! |
Addresses #692. TODO:
load()
on an instance rather than the class.train()
and actual.@tmylk @gojomo am I going on the right track with this? Also how should the warnings be dealt with in
doc2vec_inner.c
?