`tokenize` API #11

MikeInnes · 2018-10-12T08:49:14Z

The set_tokenizer API seems a bit suspect here, given that it can be replaced with

const tokenize = WordTokenizers.nltk_tokenize

and likewise for RevTok etc, without bringing in multiple packages just to define an alias :)

I also think it's generally a good idea to expose people to higher order functions and such; people might not realise that you can just e.g. pass a custom tokenize function into a constructor rather than setting and unsetting it globally.

The text was updated successfully, but these errors were encountered:

oxinabox · 2018-10-16T03:52:37Z

perhaps.
Mu original thought was that it would be good if the average user didn't have to worry about what tokenizers were available,
and could just say tokenize this.

But also the more advanced user might want to configure it, and have it apply globally -- even into other packages.

However the big issue I see with that really the tokenizer is corpus specific.
So the idea of a settable global default is perhaps silly
If we thing about handling different languages, or even texts from twitter vs from the newpaper articles, you want a different tokenzer.
So making this set able globally might not be a good idea.

Related:
We actually should be thinking in terms of languages, like Embeddings,jl is.

I am thinking more like:

const tokenize = tokenizer(English()) # use the default enlish tokenizer
const tokenize = tokenizer(English(), 2) # use the second english tokenizer.

and we should expose
list_tokenizers(::Language) which gives a list of suitable tokenizers.
(E.g. TokTok #5 is good for a bunch of languages, where as Penn is only good for English)
More generally:
we can maybe attach traits to the tokenizer functions,
Traits for language, traits for reversability,
which might be better then one can say:

const tokenize = tokenizer(English(), Reversible() , URLsSupported())

(This should be using Languages.jl for type-based language ids here, and there cf JuliaText/Embeddings.jl#6)

aquatiko · 2019-04-03T13:21:30Z

@oxinabox Would it be a good idea to add a traits function which takes any tokenizer as input and gives info. about the tokenizer, which could be potentially used in the above proposed tokenizer approach?

oxinabox · 2019-04-03T14:58:27Z

I think multiple different trait functions.
Starting with language

MikeInnes · 2019-04-03T15:05:17Z

I really just think docs strings would be better here. It's a case of KISS until there's a clear need for any more complexity.

oxinabox · 2019-04-03T17:10:43Z

Yeah,
See the nice thing to do though
would be to have a default for languages.

Then the same for Embeddings.jl (which almost does this)
see JuliaText/Embeddings.jl#6

Then we could do things like:

LANG = Languages.detect_language(corpus)
tokenizer = Tokenizers.tokenizer(LANG)
words = tokenizer(corpus)
vocab = unique(words)
embtable = Embeddings.load_embeddings(LANG, vocab)

onehot_encoder = onehot_encoder(length(vocab))
mdl = model(emtable.embeddings)
train!(mdl, onehot_encoder.(words))

MikeInnes · 2019-04-03T17:15:13Z

That's a good use case, although even then, wasn't #14 meant to implement something fairly general and language-agnostic? It seems better to have the same default for all languages if at all possible.

oxinabox · 2019-04-03T18:19:02Z

#18 is fairly general and langauge agnostic, and is now the default.
But it is still basically useless in a ton of languages, it is still space centric.
Further we don't have any tokenizer for any language yet (including english) that are better than that.

So until we do this is not really pressing, as the answer would always be use toktok

Ayushk4 · 2019-08-21T15:06:31Z

I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers.

oxinabox · 2019-08-21T15:38:43Z

I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers.

I'm not sure how that would be.
They act at a different level.
The TokenBuffer API makes tokenizers.

The Tokenizer API specifies what should happen when you call
tokenizer(str) or split(str, (Words(), Sentences()) (IIRC)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tokenize` API #11

`tokenize` API #11

MikeInnes commented Oct 12, 2018

oxinabox commented Oct 16, 2018 •

edited

Loading

aquatiko commented Apr 3, 2019

oxinabox commented Apr 3, 2019

MikeInnes commented Apr 3, 2019

oxinabox commented Apr 3, 2019

MikeInnes commented Apr 3, 2019

oxinabox commented Apr 3, 2019

Ayushk4 commented Aug 21, 2019 •

edited

Loading

oxinabox commented Aug 21, 2019

tokenize API #11

tokenize API #11

Comments

MikeInnes commented Oct 12, 2018

oxinabox commented Oct 16, 2018 • edited Loading

aquatiko commented Apr 3, 2019

oxinabox commented Apr 3, 2019

MikeInnes commented Apr 3, 2019

oxinabox commented Apr 3, 2019

MikeInnes commented Apr 3, 2019

oxinabox commented Apr 3, 2019

Ayushk4 commented Aug 21, 2019 • edited Loading

oxinabox commented Aug 21, 2019

`tokenize` API #11

`tokenize` API #11

oxinabox commented Oct 16, 2018 •

edited

Loading

Ayushk4 commented Aug 21, 2019 •

edited

Loading