-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spaCy analyzer #527
Add spaCy analyzer #527
Conversation
Codecov Report
@@ Coverage Diff @@
## master #527 +/- ##
==========================================
+ Coverage 99.49% 99.50% +0.01%
==========================================
Files 80 82 +2
Lines 5340 5458 +118
==========================================
+ Hits 5313 5431 +118
Misses 27 27
Continue to review full report at Codecov.
|
It's working now (at least for the tfidf backend) but pretty slow - at least an order of magnitude slower than the Snowball analyzer. I think some batching must be used to make it more efficient, but that requires changes to the Analyzer API as well as to individual backends. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
f3cf264
to
ffab9ea
Compare
Rebased and force-pushed. There is a new release of spaCy (3.2.0) available, should test that. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
f4c8ea9
to
10463f1
Compare
Rebased on current master, fixed conflicts and force-pushed. |
… image via build arg
Ready for review! Things I'm a bit unsure about:
|
Works with YAKE, and Spacy analyzer also somewhat improved evaluation results compared to Snowball analyzer (on JYU test set F1@5 0.1706 -> 0.1870). Dockerfile looks good to me. (The three tries for timeouts in downloading ntlk data was originally added for builds by/in Drone, as there were some network problems in Drone at that time, but I think the situation has improved now.) Just one point to consider: if Spacy model has not been loaded, a bit lengthy traceback is shown ending |
I compared this to Snowball using the Annif-tutorial yso-nlf data sets and the three backend configurations (tfidf, mllm, omikuji-parabel) used in the tutorial. Results (best score for each backend type highlighted):
Observations:
The point of these experiments was to check that the analyzer works reasonably well with those backends, not that the results are necessarily better in terms of F1 scores etc. spaCy has other advantages, especially the many languages it supports. |
I also tested svc and fasttext backends using the 20news data set in Annif-corpora. Results:
Observations:
I'd say this is good enough, I will check a few final things (including the error shown when a model doesn't exist, thanks @juhoinkinen!) and then merge this. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Initial draft PR of new spaCy based (optional) analyzer.
Fixes #374
TODO items:
Test with Swedish (which doesn't have a complete pretrained model) and adapt the code as necessaryOut of scope for now