Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize API #11

Open
MikeInnes opened this issue Oct 12, 2018 · 9 comments
Open

tokenize API #11

MikeInnes opened this issue Oct 12, 2018 · 9 comments

Comments

@MikeInnes
Copy link
Collaborator

The set_tokenizer API seems a bit suspect here, given that it can be replaced with

const tokenize = WordTokenizers.nltk_tokenize

and likewise for RevTok etc, without bringing in multiple packages just to define an alias :)

I also think it's generally a good idea to expose people to higher order functions and such; people might not realise that you can just e.g. pass a custom tokenize function into a constructor rather than setting and unsetting it globally.

@oxinabox
Copy link
Member

oxinabox commented Oct 16, 2018

perhaps.
Mu original thought was that it would be good if the average user didn't have to worry about what tokenizers were available,
and could just say tokenize this.

But also the more advanced user might want to configure it, and have it apply globally -- even into other packages.

However the big issue I see with that really the tokenizer is corpus specific.
So the idea of a settable global default is perhaps silly
If we thing about handling different languages, or even texts from twitter vs from the newpaper articles, you want a different tokenzer.
So making this set able globally might not be a good idea.

Related:
We actually should be thinking in terms of languages, like Embeddings,jl is.

I am thinking more like:

const tokenize = tokenizer(English()) # use the default enlish tokenizer
const tokenize = tokenizer(English(), 2) # use the second english tokenizer.

and we should expose
list_tokenizers(::Language) which gives a list of suitable tokenizers.
(E.g. TokTok #5 is good for a bunch of languages, where as Penn is only good for English)
More generally:
we can maybe attach traits to the tokenizer functions,
Traits for language, traits for reversability,
which might be better then one can say:

const tokenize = tokenizer(English(), Reversible() , URLsSupported())

(This should be using Languages.jl for type-based language ids here, and there cf JuliaText/Embeddings.jl#6)

@aquatiko
Copy link
Contributor

aquatiko commented Apr 3, 2019

@oxinabox Would it be a good idea to add a traits function which takes any tokenizer as input and gives info. about the tokenizer, which could be potentially used in the above proposed tokenizer approach?

@oxinabox
Copy link
Member

oxinabox commented Apr 3, 2019

I think multiple different trait functions.
Starting with language

@MikeInnes
Copy link
Collaborator Author

I really just think docs strings would be better here. It's a case of KISS until there's a clear need for any more complexity.

@oxinabox
Copy link
Member

oxinabox commented Apr 3, 2019

Yeah,
See the nice thing to do though
would be to have a default for languages.

Then the same for Embeddings.jl (which almost does this)
see JuliaText/Embeddings.jl#6

Then we could do things like:

LANG = Languages.detect_language(corpus)
tokenizer = Tokenizers.tokenizer(LANG)
words = tokenizer(corpus)
vocab = unique(words)
embtable = Embeddings.load_embeddings(LANG, vocab)

onehot_encoder = onehot_encoder(length(vocab))
mdl = model(emtable.embeddings)
train!(mdl, onehot_encoder.(words))

@MikeInnes
Copy link
Collaborator Author

That's a good use case, although even then, wasn't #14 meant to implement something fairly general and language-agnostic? It seems better to have the same default for all languages if at all possible.

@oxinabox
Copy link
Member

oxinabox commented Apr 3, 2019

#18 is fairly general and langauge agnostic, and is now the default.
But it is still basically useless in a ton of languages, it is still space centric.
Further we don't have any tokenizer for any language yet (including english) that are better than that.

So until we do this is not really pressing, as the answer would always be use toktok

@Ayushk4
Copy link
Member

Ayushk4 commented Aug 21, 2019

I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers.

@oxinabox
Copy link
Member

I think that the Tokenizer API should also be able to expose the TokenBuffer API and its various lexer functions for building custom tokenizers.

I'm not sure how that would be.
They act at a different level.
The TokenBuffer API makes tokenizers.

The Tokenizer API specifies what should happen when you call
tokenizer(str) or split(str, (Words(), Sentences()) (IIRC)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants