Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning if vocabulary is entirely single character elements in doc2vec #692

Closed
tmylk opened this issue May 5, 2016 · 16 comments
Closed
Labels
difficulty easy Easy issue: required small fix wishlist Feature request

Comments

@tmylk
Copy link
Contributor

tmylk commented May 5, 2016

Many users on the mailing list pass strings to doc2vec instead of lists.
It results in poor results as vocab is single chars.

A warning will help detect that.

@tmylk tmylk added wishlist Feature request difficulty easy Easy issue: required small fix labels May 5, 2016
@devashishd12
Copy link
Contributor

I'll take this up 👍

@gojomo
Copy link
Collaborator

gojomo commented May 5, 2016

Yes! The test could be words is a true string type, and/or that the vocab keys are all single-characters, and/or that the vocab is "very small" (threshold TBD).

Other sanity checks based on frequent issues are also possible. A few ideas:

  • model very large compared to single-pass training-data size (risks overfitting)
  • alpha raised again after descending to min_alpha (may indicate confusion about training cycles)
  • calling load() on an instance rather than the class (often indicates not saving reference to return value)
  • mismatch in expected count for train() and actual (may indicate skipped steps when doing build_vocab() and train() separately)

@bloody76
Copy link

What about just checking the type of the argument passed ? I can imagine at the beginning of train_document_* functions something like:

if isinstance(doc_words, str) or isinstance(doc_words, unicode):
    logger.warn("You are passing strings instead of list of ids for the training of your doc2vec model. Expect to have a vocabulary composed of one-char words.")

@devashishd12
Copy link
Contributor

devashishd12 commented May 20, 2016

Just had a small doubt here: Should I be making an __init__(self, *args, **kwargs) for TaggedDocument to raise those warnings? Right I'll go ahead with what @bloody76 suggested.

@tmylk
Copy link
Contributor Author

tmylk commented Sep 18, 2016

Left to implement:

  • Raise warning if calling load() on an instance rather than the class.
  • Raise warning if mismatch in expected count for train() and actual.

@aayux
Copy link
Contributor

aayux commented Sep 25, 2016

calling load() on an instance rather than the class

Working on fixing this but I'm not fully certain of how one might go about "calling load() on instance".
Is there a previously known issue/error-report that I could refer to?

@tmylk
Copy link
Contributor Author

tmylk commented Sep 25, 2016

@Dust0x Class methods and instance methods are different in python, so you can raise warning on one of them.

@aayux
Copy link
Contributor

aayux commented Sep 26, 2016

@tmylk yep, I see that. Could something go wrong when calling load() on instance?

What kind of warning should one be looking for here? A call to load by instance seems to be working fine for me. Am I missing something here?

@tmylk
Copy link
Contributor Author

tmylk commented Sep 26, 2016

Do you see the difference in behaviour of the two methods?

@gojomo
Copy link
Collaborator

gojomo commented Sep 27, 2016

Even though calling-on-an-instance might superficially succeed, often the user doing so doesn't realize they'll be getting a newly-loaded instance back as the return value. (Rather, their mental model is that a load is happening into the supplied instance.) To make this clear, and emphasize the intended/correct way to load, it'd be best if load-on-instance throws an error with helpful message that must be corrected.

@tmylk
Copy link
Contributor Author

tmylk commented Sep 27, 2016

Could it be easier then to not inherit from utils.SaveLoad?

@gojomo
Copy link
Collaborator

gojomo commented Sep 27, 2016

Sharing save/load utilities via callouts/composition, rather than inheritance, might be a better approach. But, that'd be a pretty big change at this point (and contrast with practices elsewhere in gensim). So, it could require a lot more thought/refactoring and perhaps synchronization with a major-revision-number increment.

@nishnik
Copy link

nishnik commented Dec 13, 2016

For Raise warning if mismatch in expected count for train() and actual., correct me if I am wrong.
In word2vec.py

if total_words is None and total_examples is None:
            if self.corpus_count:
                total_examples = self.corpus_count
                logger.info("expecting %i sentences, matching count from corpus used for vocabulary survey", total_examples)
            else:
                raise ValueError("you must provide either total_words or total_examples, to enable alpha and progress calculations")

We have to add an else and do the checks.
Or please guide me through.
Thanks.

@nishnik
Copy link

nishnik commented Dec 15, 2016

ping @gojomo @tmylk

@gojomo
Copy link
Collaborator

gojomo commented Dec 17, 2016

I believe both the check-mark to-dos added by @tmylk when re-opening this issue have or are already being addressed elsewhere:

warn if load()-on-instance: #889

warn if mismatch in provided/expected example counts: aa8a9cd

Since the issue-in-the-title (warn-if-single-characters) is fixed, I'd consider this issue closed, and the other ideas spun out to elsewhere.

@tmylk
Copy link
Contributor Author

tmylk commented Jan 29, 2017

Closing as resolved.

@tmylk tmylk closed this as completed Feb 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix wishlist Feature request
Projects
None yet
Development

No branches or pull requests

6 participants