New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Better handling of documents where number of words is less than n in n-grams #45

Closed

2 tasks done

lmullen opened this issue Oct 16, 2015 · 0 comments

Closed

2 tasks done

Better handling of documents where number of words is less than n in n-grams #45

lmullen opened this issue Oct 16, 2015 · 0 comments

Member

lmullen commented Oct 16, 2015

From rOpenSci onboarding:

When loading in a large corpus of documents, I got the error Error: n not less than length(words), which was ultimately traced to assert_that(n < length(words)) in tokenize_ngrams(). I found this was because I had a very short document, but also because the document was all whitespace characters. I wonder if it would make sense to check for empty documents.

Several things to do here:

Error message should be more descriptive.
Show in the vignette how to filter short documents.

lmullen added a commit that referenced this issue


          Skip short or empty documents

b6d7207

Fixes #45

lmullen closed this as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment