Move models to TextModels.jl #111

aviks · 2019-01-04T19:49:16Z

Just documenting my intention here. Adding a Flux dependency to this project seems somewhat of an unpopular change. I plan to move the actual ML models to a new TextModels.jl package, and leave this package containing basic text processing functions

Ayushk4 · 2019-03-20T12:19:14Z

@aviks I want to work on issue #117 and Implement NER and POS tagger as a part of GSoC.
Here are my major contributions so far - WordTokenizers.jl/#13, TextAnalysis/#134 and CorpusLoaders.jl/#15

My other contributions can be found here - WordTokenizers.jl/author=Ayushk4, TextAnalysis.j/author=Ayushk4 and CorpusLoaders.jl/author=Ayushk4

I would like to have your suggestions on the following -

I will be needing Conditional Random Fields for Implementing NER and POS tagger. Should the code for CRF be in JuliaText or Flux or something else?
I would like to know whether there is any plan you envision for TextModels.jl.

ym-han · 2020-07-30T17:55:24Z

As an end user, I'd be really interested in having the ability to use state of the art neural models like Stanza's as part of your TextModels.jl, whether for parsing or NER or tokenizing or sentence segmentation.

Ayushk4 · 2020-07-31T08:27:59Z

We currently have NER and POS neural models (>90% accuracy on CoNLL'03) in TextAnalysis.jl#master.

Docs - https://juliatext.github.io/TextAnalysis.jl/latest

Ayushk4 · 2020-07-31T08:29:08Z

WordTokenizers.jl has a bunch of high speed tokenizers, a sentence segmentor and recently SentencePiece Tokenizer.

ym-han · 2020-07-31T14:51:55Z

We currently have NER and POS neural models (>90% accuracy on CoNLL'03) in TextAnalysis.jl#master.

Docs - https://juliatext.github.io/TextAnalysis.jl/latest

Thanks @ayush1999 , I wasn't aware that the accuracy was that good! For the particular project I'm currently working on, though, I need the best possible accuracy; or at least, I need to be able to show my higher-ups that the models are close enough to the models we'd otherwise use (stanza's). Do you have specific benchmarks for the models (in particular, the POS one)? Similarly, I'd be interested in eventually being able to use neural models for tokenizing and sentennce segmenting (my understanding is that the current tokenizers are rule-based). (I'd alo suggest updating the docs with the benchmarks; I can help out with the doc updating if you have the benchmark info lying around somewhere.)

I'm also very interested in having access to sota neural dependency parsers for my current project. I'm currently looking at trying to wrap spacy and spacy stannza (continuing on https://github.com/jekbradbury/SpaCy.jl/blob/master/src/SpaCy.jl), but if it won't be too hard for a non-software-engineer like me to do that with your TextAnalysis framework, I'd be interested in helping to extend TextAnalysis / TextModels to do that. (But again, for my current project, it has to be models that are as good, or not very far off from, stanza's performance.)

Thanks once again for chiming in, and for all the hard work, though! I wasn't aware that your NER and POS models were already that good!

Ayushk4 · 2020-07-31T15:04:44Z

If by neural models for tokenizing you mean statistical tokenizers, then those are available in WordTokenizers.jl (sentence piece), transformers.jl (Word piece and BPE). I am not aware of any neural model for tokenizing English language.

It would be nice to have access to stanza's models as well in Julia. A simple wrapper could be easy to make using pycall.

I don't have performances for POS but for NER, you can find those here - https://ayushk4.github.io/images/2019/ner_compare.png
NER and Pos both employ the same architecture.

If you have compute and dataset, then you can train better models on BERT from Transformers.jl package for this downstream task.

Also, contributions are always welcome to JuliaText packages. Dependency parser in Julia - flux would be a great addition.

ym-han · 2020-07-31T15:34:58Z

Thanks Ayush!!!! This is really helpful

…

On Fri, Jul 31, 2020, 11:05 AM Ayush Kaushal ***@***.***> wrote: If by neural models for tokenizing you mean statistical tokenizers, then those are available in WordTokenizers.jl (sentence piece), transformers.jl (Word piece and BPE). I am not aware of any neural model for tokenizing English language. It would be nice to have access to stanza's models as well in Julia. A simple wrapper could be easy to make using pycall. I don't have performances for POS but for NER, you can find those here - https://ayushk4.github.io/images/2019/ner_compare.png NER and Pos both employ the same architecture. If you have compute and dataset, then you can train better models on BERT from Transformers.jl package for this downstream task. Also, contributions are always welcome to JuliaText packages. Dependency parser in Julia - flux would be a great addition. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#111 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFCM7HAOI5Q65JFDJ2DRWKDR6LMR3ANCNFSM4GNRGHPA> .

ym-han · 2020-08-01T06:21:11Z

Just a quick final question: how much work do you think it might be for me to somehow pipe the stanza models for dependency parsing, pos tagging, etc into the juliatext interface? Pycall's going to be involved, of course, but I don't have a good sense right now for how much work it will be to wrap the models with the julia text interface (or whether it's worth it).

Ayushk4 · 2020-08-01T06:36:12Z

TBH, I am not very familiar with either stanza or packaging with PyCall. Typically it is very easy to execute python code in Julia via PyCall.

ym-han · 2020-08-01T14:18:38Z

I've been trying out pycall with spacy-stanza (https://github.com/explosion/spacy-stanza), and it looks like it will take some work to get doc slicing to work with the syntax one would expect.

aviks · 2020-11-02T15:14:53Z

models have been moved to TextModels.jl

aviks mentioned this issue Jan 4, 2019

Simple document classifier (AKA spam filter) #106

Merged

Ayushk4 mentioned this issue Mar 12, 2019

Feature : Dependency Parsing #132

Open

aviks closed this as completed Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move models to TextModels.jl #111

Move models to TextModels.jl #111

aviks commented Jan 4, 2019 •

edited

Loading

Ayushk4 commented Mar 20, 2019

ym-han commented Jul 30, 2020 •

edited

Loading

Ayushk4 commented Jul 31, 2020 •

edited

Loading

Ayushk4 commented Jul 31, 2020

ym-han commented Jul 31, 2020

Ayushk4 commented Jul 31, 2020

ym-han commented Jul 31, 2020 via email

ym-han commented Aug 1, 2020

Ayushk4 commented Aug 1, 2020

ym-han commented Aug 1, 2020 •

edited

Loading

aviks commented Nov 2, 2020

Move models to TextModels.jl #111

Move models to TextModels.jl #111

Comments

aviks commented Jan 4, 2019 • edited Loading

Ayushk4 commented Mar 20, 2019

ym-han commented Jul 30, 2020 • edited Loading

Ayushk4 commented Jul 31, 2020 • edited Loading

Ayushk4 commented Jul 31, 2020

ym-han commented Jul 31, 2020

Ayushk4 commented Jul 31, 2020

ym-han commented Jul 31, 2020 via email

ym-han commented Aug 1, 2020

Ayushk4 commented Aug 1, 2020

ym-han commented Aug 1, 2020 • edited Loading

aviks commented Nov 2, 2020

aviks commented Jan 4, 2019 •

edited

Loading

ym-han commented Jul 30, 2020 •

edited

Loading

Ayushk4 commented Jul 31, 2020 •

edited

Loading

ym-han commented Aug 1, 2020 •

edited

Loading