-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move models to TextModels.jl #111
Comments
@aviks I want to work on issue #117 and Implement NER and POS tagger as a part of GSoC.
I would like to have your suggestions on the following -
|
As an end user, I'd be really interested in having the ability to use state of the art neural models like Stanza's as part of your TextModels.jl, whether for parsing or NER or tokenizing or sentence segmentation. |
We currently have NER and POS neural models (>90% accuracy on CoNLL'03) in TextAnalysis.jl#master. |
WordTokenizers.jl has a bunch of high speed tokenizers, a sentence segmentor and recently SentencePiece Tokenizer. |
Thanks @ayush1999 , I wasn't aware that the accuracy was that good! For the particular project I'm currently working on, though, I need the best possible accuracy; or at least, I need to be able to show my higher-ups that the models are close enough to the models we'd otherwise use (stanza's). Do you have specific benchmarks for the models (in particular, the POS one)? Similarly, I'd be interested in eventually being able to use neural models for tokenizing and sentennce segmenting (my understanding is that the current tokenizers are rule-based). (I'd alo suggest updating the docs with the benchmarks; I can help out with the doc updating if you have the benchmark info lying around somewhere.) I'm also very interested in having access to sota neural dependency parsers for my current project. I'm currently looking at trying to wrap spacy and spacy stannza (continuing on https://github.com/jekbradbury/SpaCy.jl/blob/master/src/SpaCy.jl), but if it won't be too hard for a non-software-engineer like me to do that with your TextAnalysis framework, I'd be interested in helping to extend TextAnalysis / TextModels to do that. (But again, for my current project, it has to be models that are as good, or not very far off from, stanza's performance.) Thanks once again for chiming in, and for all the hard work, though! I wasn't aware that your NER and POS models were already that good! |
If by neural models for tokenizing you mean statistical tokenizers, then those are available in WordTokenizers.jl (sentence piece), transformers.jl (Word piece and BPE). I am not aware of any neural model for tokenizing English language. It would be nice to have access to stanza's models as well in Julia. A simple wrapper could be easy to make using pycall. I don't have performances for POS but for NER, you can find those here - https://ayushk4.github.io/images/2019/ner_compare.png If you have compute and dataset, then you can train better models on BERT from Transformers.jl package for this downstream task. Also, contributions are always welcome to JuliaText packages. Dependency parser in Julia - flux would be a great addition. |
Thanks Ayush!!!! This is really helpful
…On Fri, Jul 31, 2020, 11:05 AM Ayush Kaushal ***@***.***> wrote:
If by neural models for tokenizing you mean statistical tokenizers, then
those are available in WordTokenizers.jl (sentence piece), transformers.jl
(Word piece and BPE). I am not aware of any neural model for tokenizing
English language.
It would be nice to have access to stanza's models as well in Julia. A
simple wrapper could be easy to make using pycall.
I don't have performances for POS but for NER, you can find those here -
https://ayushk4.github.io/images/2019/ner_compare.png
NER and Pos both employ the same architecture.
If you have compute and dataset, then you can train better models on BERT
from Transformers.jl package for this downstream task.
Also, contributions are always welcome to JuliaText packages. Dependency
parser in Julia - flux would be a great addition.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#111 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFCM7HAOI5Q65JFDJ2DRWKDR6LMR3ANCNFSM4GNRGHPA>
.
|
Just a quick final question: how much work do you think it might be for me to somehow pipe the stanza models for dependency parsing, pos tagging, etc into the juliatext interface? Pycall's going to be involved, of course, but I don't have a good sense right now for how much work it will be to wrap the models with the julia text interface (or whether it's worth it). |
TBH, I am not very familiar with either stanza or packaging with PyCall. Typically it is very easy to execute python code in Julia via PyCall. |
I've been trying out pycall with spacy-stanza (https://github.com/explosion/spacy-stanza), and it looks like it will take some work to get doc slicing to work with the syntax one would expect. |
models have been moved to TextModels.jl |
Just documenting my intention here. Adding a Flux dependency to this project seems somewhat of an unpopular change. I plan to move the actual ML models to a new TextModels.jl package, and leave this package containing basic text processing functions
The text was updated successfully, but these errors were encountered: