Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of documents where number of words is less than n in n-grams #45

Closed
2 tasks done
lmullen opened this issue Oct 16, 2015 · 0 comments
Closed
2 tasks done

Comments

@lmullen
Copy link
Member

lmullen commented Oct 16, 2015

From rOpenSci onboarding:

When loading in a large corpus of documents, I got the error Error: n not less than length(words), which was ultimately traced to assert_that(n < length(words)) in tokenize_ngrams(). I found this was because I had a very short document, but also because the document was all whitespace characters. I wonder if it would make sense to check for empty documents.

Several things to do here:

  • Error message should be more descriptive.
  • Show in the vignette how to filter short documents.
lmullen added a commit that referenced this issue Oct 17, 2015
@lmullen lmullen closed this as completed Oct 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant