-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character n-grams #40
Comments
Partially addressed in #45 |
I could look into implementing a ngram and skipgram iterator? Similar to the util functions in NLTK http://www.nltk.org/_modules/nltk/util.html#ngrams for characters and words (#2). |
Thanks @joshlk that would be very useful! Maybe without the rightpad/leftpad options for a start? It would also be interesting to have something that would work with ngram_range parameter as in scikit-learn CountVectorizer,
Though the extension of this parameter to skip grams is not clear. There is also a question of how to chain tokenization + n-grams iterators #21 |
PR: #82 Please take a look when you get a chance |
Allowing tokenize documents with character n-grams would be useful.
The text was updated successfully, but these errors were encountered: