[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705

devashishd12 · 2016-05-24T18:00:40Z

Addresses #692. TODO:

Raise warning if alpha is raised again after descending to minimum value. (Where should this be ideally done @gojomo?)
Raise warning if calling load() on an instance rather than the class.
Raise warning if mismatch in expected count for train() and actual.
Add test suit to test_doc2vec.py, test_word2vec.py

@tmylk @gojomo am I going on the right track with this? Also how should the warnings be dealt with in doc2vec_inner.c?

gojomo · 2016-05-24T22:20:13Z

For each of these warnings, the goal should be to make note of the problem exactly once, at the earliest possible time. We don't want the overhead of redundant checks; the user doesn't need X million warnings for their X million text examples. So avoid doing checks for every document, during training, and instead do as soon as the problem parameter/input is seen for the 1st time.

The alpha is managed by the train() method, so the check would have to go there.

Most of these issues apply to Word2Vec as well, and so in most cases the checks will be there (and inherited by Doc2Vec).

You won't likely have to be concerned about either the .c files (because they're auto-generated from the cython .pyx files), or the .pyx files (as those are optimized implementations for which any problems should have been discovered earlier, in pure-python code).

devashishd12 · 2016-05-25T20:16:01Z

@gojomo thanks for the help! I've updated the PR. Hope this is better. What should be a good threshold value for a "small" vocab? Also, alpha monotonically decreases right? I'll change the code if that's not the case.

gojomo · 2016-05-25T22:03:21Z

gensim/models/word2vec.py

@@ -509,9 +510,16 @@ def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_
        Each sentence must be a list of unicode strings.

        """
+        for sentence in sentences:


This adds the overhead of a full pass over the corpus in the case where the user has already done things right; that's excessive. It'd be better to add the check to a pass that's already happening (eg scan_vocab()). (And, it may even be a desirable optimization to just check the first example... this mistake is likely to be all-or-nothing, and we don't want to burden a user doing the right thing, with a tens-of-millions-of-examples corpus, with a check of every example.)

Also, this code appears to only work for the Doc2Vec case, when 'sentence' is something TaggedDocument-shaped with a 'words' field. In Word2Vec, there's no such field, the example is just the list-of-tokens. Are you testing in a Word2Vec triggering scenario?

Oh alright I didn't know about the all-or-nothing scenario. I've added checks in the scan_vocab methods of Word2Vec and Doc2Vec wherein I'm only checking if the sentence or document.words is an instance of string_types for sentence (or document) 0. So I don't think we need the check for all single characters in vocab since, as you said, this check already covers it. Is this approach fine?

gojomo · 2016-05-25T22:21:45Z

I don't have any confident idea of a good "your vocabulary is suspiciously small" threshold, so we should probably just skip that idea unless/until there's more evidence it's needed. (It was just one of a bunch of 'and/or' ideas for catching common errors.)

devashishd12 · 2016-05-27T12:53:06Z

@gojomo I've made the changes to the code. Could you please check? Also regarding raising the warning for a mismatch in the expected count for train() and actual, what is exactly count over here and what should it be compared against to raise the warning?

gojomo · 2016-05-27T16:58:37Z

gensim/models/doc2vec.py

@@ -636,6 +636,12 @@ def scan_vocab(self, documents, progress_per=10000, trim_rule=None):
        interval_start = default_timer() - 0.00001  # guard against next sample being identical
        interval_count = 0
        vocab = defaultdict(int)
+        try:
+            if isinstance(list(documents)[0].words, string_types):


Many users won't be passing an indexable/all-in-memory corpus – so this will break for them.

sorry for bothering you so much :(
What would be a better way to check then?

On the existing iteration, check the current sentence. But, once the first is checked, set a flag to prevent further checks – based on our assumption that the user either gets this "all right" or "all wrong" and that (when wrong) they don't want redundant warnings for every sentence.

devashishd12 · 2016-05-31T06:15:50Z

@tmylk @gojomo I've added unit tests for the build_vocab() warning and the train() warning. Sorry for still using the warnings module. The warnings weren't displaying on the console window with logger.warn and also the testing with warnings.warn seemed more convenient. Is it fine?

piskvorky · 2016-05-31T07:50:15Z

gensim/models/doc2vec.py

        vocab = defaultdict(int)
        for document_no, document in enumerate(documents):
+            if not checked_string_types:
+                if isinstance(document.words, string_types):
+                    warnings.warn("'words' should be a list of unicode strings. True string type provided.")


Instead of "True string type", why not output the actual type of document.words?

Plus maybe even output the entire document example, for explicitness and clarity.

gojomo · 2016-05-31T15:23:25Z

Changed title to reflect that this can be an issue in Word2Vec as well (and should be checked there too).

I'm in favor of logger.warn() for consistency. That should typically result in visible console output, unless something non-typical has been done with levels/loggers.

tmylk · 2016-06-02T23:48:08Z

testfixtures package allows to test for logging output so we can use logger.warn instead of just warnings.
It is just a test time dependency so not an issue.

devashishd12 · 2016-06-03T15:02:16Z

Thanks for the suggestion @tmylk however I'm getting an ImportError on import testfixtures while running nosetests. I don't get this error outside of nosetests. I've checked an answer here but I don't think this answer will work in our case since I can't remove the __init__.py nor can I specify a particular target directory since it won't work for other users. Also would this dependency be a problem for other users while running nosetests?

devashishd12 · 2016-06-03T21:34:27Z

@tmylk the build still seems to be failing.....

devashishd12 · 2016-06-06T17:14:31Z

@piskvorky @gojomo @tmylk I think travis is happy. I think the test failing is unrelated to my PR. Could you please review once?

gojomo · 2016-06-06T20:08:40Z

gensim/models/word2vec.py

        for sentence_no, sentence in enumerate(sentences):
+            if not checked_string_types:
+                if isinstance(sentence, string_types):
+                    logger.warn("'sentences' should be a list of list of unicode strings. %s provided.", type(sentence))


To be pedantically accurate, sentences need not be a list – just an iterable. But the important thing being checked here is whether the first item is the expected list-of-strings. So I would word this warning as: "Each sentences item should be a list of words (usually unicode strings). First item here is instead plain %s." The warning in doc2vec could be similarly: "Each words should be a list of words (usually unicode strings). First 'words' here is instead plain %s".) (I find that clearer than just "%s provided" which may at first glance seem like no contradiction with preceding advice.)

Agreed! Made the changes.

gojomo · 2016-06-06T20:09:44Z

Other than my slight suggestion regarding warning wording, this looks good-to-go.

The test failure is in gensim.test.test_glove2word2vec.TestGlove2Word2Vec and thus seems unrelated to these changes. @tmylk I suggest that test should be disabled here and in 'develop' until it can be independently made reliable.

… cases

devashishd12 · 2016-06-06T22:19:39Z

I've made the changes in the log message, rebased and squashed.

devashishd12 · 2016-06-07T16:04:06Z

@tmylk can this be merged?

tmylk · 2016-06-09T16:13:01Z

@dsquareindia Thanks a lot for this very needed PR!

devashishd12 · 2016-06-09T18:25:20Z

No problem! Thanks a ton for the help and reviews!

gojomo reviewed May 25, 2016
View reviewed changes

devashishd12 force-pushed the TaggedDocument_warning branch from 4ec40d6 to 2eedf77 Compare May 27, 2016 13:50

gojomo reviewed May 27, 2016
View reviewed changes

piskvorky reviewed May 31, 2016
View reviewed changes

gojomo changed the title ~~[WIP] ENH: Raise warnings if vocab in single character elements in doc2vec~~ [WIP] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec May 31, 2016

devashishd12 force-pushed the TaggedDocument_warning branch 3 times, most recently from 840bcc3 to b4195f4 Compare June 6, 2016 16:49

gojomo reviewed Jun 6, 2016
View reviewed changes

ENH: added check_input function to raise warnings in single character…

5bf3329

… cases

devashishd12 force-pushed the TaggedDocument_warning branch from b4195f4 to 5bf3329 Compare June 6, 2016 22:18

devashishd12 mentioned this pull request Jun 7, 2016

Skip TestGlove2Word2Vec #736

Merged

devashishd12 changed the title ~~[WIP] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec~~ [MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec Jun 7, 2016

tmylk merged commit 394ceab into piskvorky:develop Jun 9, 2016

devashishd12 deleted the TaggedDocument_warning branch June 9, 2016 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705

[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705

devashishd12 commented May 24, 2016 •

edited

Loading

gojomo commented May 24, 2016

devashishd12 commented May 25, 2016

gojomo May 25, 2016

devashishd12 May 26, 2016 •

edited

Loading

gojomo commented May 25, 2016

devashishd12 commented May 27, 2016

gojomo May 27, 2016

devashishd12 May 27, 2016

gojomo May 28, 2016

devashishd12 commented May 31, 2016 •

edited

Loading

piskvorky May 31, 2016

gojomo commented May 31, 2016 •

edited

Loading

tmylk commented Jun 2, 2016 •

edited

Loading

devashishd12 commented Jun 3, 2016

devashishd12 commented Jun 3, 2016

devashishd12 commented Jun 6, 2016

gojomo Jun 6, 2016

devashishd12 Jun 6, 2016

gojomo commented Jun 6, 2016

devashishd12 commented Jun 6, 2016

devashishd12 commented Jun 7, 2016

tmylk commented Jun 9, 2016

devashishd12 commented Jun 9, 2016

[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705

[MRG] ENH: Raise warnings if vocab in single character elements in word2vec/doc2vec #705

Conversation

devashishd12 commented May 24, 2016 • edited Loading

gojomo commented May 24, 2016

devashishd12 commented May 25, 2016

gojomo May 25, 2016

Choose a reason for hiding this comment

devashishd12 May 26, 2016 • edited Loading

Choose a reason for hiding this comment

gojomo commented May 25, 2016

devashishd12 commented May 27, 2016

gojomo May 27, 2016

Choose a reason for hiding this comment

devashishd12 May 27, 2016

Choose a reason for hiding this comment

gojomo May 28, 2016

Choose a reason for hiding this comment

devashishd12 commented May 31, 2016 • edited Loading

piskvorky May 31, 2016

Choose a reason for hiding this comment

gojomo commented May 31, 2016 • edited Loading

tmylk commented Jun 2, 2016 • edited Loading

devashishd12 commented Jun 3, 2016

devashishd12 commented Jun 3, 2016

devashishd12 commented Jun 6, 2016

gojomo Jun 6, 2016

Choose a reason for hiding this comment

devashishd12 Jun 6, 2016

Choose a reason for hiding this comment

gojomo commented Jun 6, 2016

devashishd12 commented Jun 6, 2016

devashishd12 commented Jun 7, 2016

tmylk commented Jun 9, 2016

devashishd12 commented Jun 9, 2016

devashishd12 commented May 24, 2016 •

edited

Loading

devashishd12 May 26, 2016 •

edited

Loading

devashishd12 commented May 31, 2016 •

edited

Loading

gojomo commented May 31, 2016 •

edited

Loading

tmylk commented Jun 2, 2016 •

edited

Loading