Character n-grams #40

rth · 2019-04-29T09:59:02Z

Allowing tokenize documents with character n-grams would be useful.

rth · 2019-05-01T19:17:57Z

Partially addressed in #45

joshlk · 2020-06-10T08:19:54Z

I could look into implementing a ngram and skipgram iterator? Similar to the util functions in NLTK http://www.nltk.org/_modules/nltk/util.html#ngrams for characters and words (#2).

rth · 2020-06-10T09:10:32Z

Thanks @joshlk that would be very useful! Maybe without the rightpad/leftpad options for a start? It would also be interesting to have something that would work with ngram_range parameter as in scikit-learn CountVectorizer,

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

Though the extension of this parameter to skip grams is not clear.

There is also a question of how to chain tokenization + n-grams iterators #21

joshlk · 2020-07-06T16:00:55Z

PR: #82

Please take a look when you get a chance

rth added the new feature This doesn't seem right label May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character n-grams #40

Character n-grams #40

rth commented Apr 29, 2019

rth commented May 1, 2019

joshlk commented Jun 10, 2020 •

edited

Loading

rth commented Jun 10, 2020

joshlk commented Jul 6, 2020

Character n-grams #40

Character n-grams #40

Comments

rth commented Apr 29, 2019

rth commented May 1, 2019

joshlk commented Jun 10, 2020 • edited Loading

rth commented Jun 10, 2020

joshlk commented Jul 6, 2020

joshlk commented Jun 10, 2020 •

edited

Loading