Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move models to TextModels.jl #111

Closed
aviks opened this issue Jan 4, 2019 · 11 comments
Closed

Move models to TextModels.jl #111

aviks opened this issue Jan 4, 2019 · 11 comments

Comments

@aviks
Copy link
Member

aviks commented Jan 4, 2019

Just documenting my intention here. Adding a Flux dependency to this project seems somewhat of an unpopular change. I plan to move the actual ML models to a new TextModels.jl package, and leave this package containing basic text processing functions

@Ayushk4
Copy link
Member

Ayushk4 commented Mar 20, 2019

@aviks I want to work on issue #117 and Implement NER and POS tagger as a part of GSoC.
Here are my major contributions so far - WordTokenizers.jl/#13, TextAnalysis/#134 and CorpusLoaders.jl/#15

My other contributions can be found here - WordTokenizers.jl/author=Ayushk4, TextAnalysis.j/author=Ayushk4 and CorpusLoaders.jl/author=Ayushk4

I would like to have your suggestions on the following -

  • I will be needing Conditional Random Fields for Implementing NER and POS tagger. Should the code for CRF be in JuliaText or Flux or something else?
  • I would like to know whether there is any plan you envision for TextModels.jl.

@ym-han
Copy link

ym-han commented Jul 30, 2020

As an end user, I'd be really interested in having the ability to use state of the art neural models like Stanza's as part of your TextModels.jl, whether for parsing or NER or tokenizing or sentence segmentation.

@Ayushk4
Copy link
Member

Ayushk4 commented Jul 31, 2020

We currently have NER and POS neural models (>90% accuracy on CoNLL'03) in TextAnalysis.jl#master.

Docs - https://juliatext.github.io/TextAnalysis.jl/latest

@Ayushk4
Copy link
Member

Ayushk4 commented Jul 31, 2020

WordTokenizers.jl has a bunch of high speed tokenizers, a sentence segmentor and recently SentencePiece Tokenizer.

@ym-han
Copy link

ym-han commented Jul 31, 2020

We currently have NER and POS neural models (>90% accuracy on CoNLL'03) in TextAnalysis.jl#master.

Docs - https://juliatext.github.io/TextAnalysis.jl/latest

Thanks @ayush1999 , I wasn't aware that the accuracy was that good! For the particular project I'm currently working on, though, I need the best possible accuracy; or at least, I need to be able to show my higher-ups that the models are close enough to the models we'd otherwise use (stanza's). Do you have specific benchmarks for the models (in particular, the POS one)? Similarly, I'd be interested in eventually being able to use neural models for tokenizing and sentennce segmenting (my understanding is that the current tokenizers are rule-based). (I'd alo suggest updating the docs with the benchmarks; I can help out with the doc updating if you have the benchmark info lying around somewhere.)

I'm also very interested in having access to sota neural dependency parsers for my current project. I'm currently looking at trying to wrap spacy and spacy stannza (continuing on https://github.com/jekbradbury/SpaCy.jl/blob/master/src/SpaCy.jl), but if it won't be too hard for a non-software-engineer like me to do that with your TextAnalysis framework, I'd be interested in helping to extend TextAnalysis / TextModels to do that. (But again, for my current project, it has to be models that are as good, or not very far off from, stanza's performance.)

Thanks once again for chiming in, and for all the hard work, though! I wasn't aware that your NER and POS models were already that good!

@Ayushk4
Copy link
Member

Ayushk4 commented Jul 31, 2020

If by neural models for tokenizing you mean statistical tokenizers, then those are available in WordTokenizers.jl (sentence piece), transformers.jl (Word piece and BPE). I am not aware of any neural model for tokenizing English language.

It would be nice to have access to stanza's models as well in Julia. A simple wrapper could be easy to make using pycall.

I don't have performances for POS but for NER, you can find those here - https://ayushk4.github.io/images/2019/ner_compare.png
NER and Pos both employ the same architecture.

If you have compute and dataset, then you can train better models on BERT from Transformers.jl package for this downstream task.

Also, contributions are always welcome to JuliaText packages. Dependency parser in Julia - flux would be a great addition.

@ym-han
Copy link

ym-han commented Jul 31, 2020 via email

@ym-han
Copy link

ym-han commented Aug 1, 2020

Just a quick final question: how much work do you think it might be for me to somehow pipe the stanza models for dependency parsing, pos tagging, etc into the juliatext interface? Pycall's going to be involved, of course, but I don't have a good sense right now for how much work it will be to wrap the models with the julia text interface (or whether it's worth it).

@Ayushk4
Copy link
Member

Ayushk4 commented Aug 1, 2020

TBH, I am not very familiar with either stanza or packaging with PyCall. Typically it is very easy to execute python code in Julia via PyCall.

@ym-han
Copy link

ym-han commented Aug 1, 2020

I've been trying out pycall with spacy-stanza (https://github.com/explosion/spacy-stanza), and it looks like it will take some work to get doc slicing to work with the syntax one would expect.

@aviks
Copy link
Member Author

aviks commented Nov 2, 2020

models have been moved to TextModels.jl

@aviks aviks closed this as completed Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants