Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multilingual SubjectIndex backed by CSV file #608

Merged
merged 7 commits into from
Aug 15, 2022

Conversation

osma
Copy link
Member

@osma osma commented Aug 15, 2022

This PR completes (hopefully) the switch to multilingual vocabularies (#559) started in PR #600 and continued in PRs #604 and #606. It changes the SubjectIndex so that it keeps track of labels in all available languages, not just one at a time. The index is stored on disk in a CSV file which stores the labels in different columns named e.g. label_en, label_fi, label_sv etc. It replaces the current short-lived format where separate TSV files for each language were used (subjects.en.tsv, subjects.fi.tsv etc.). It turned out to be easier and cleaner to have just a single file containing labels in all languages. CSV is a good format for this as the columns can be named in a header row, so there is some flexibility in which columns are used.

It is also possible to use this CSV format to represent multilingual vocabularies that can be loaded with the annif loadvoc command.

@osma osma added this to the 0.59 milestone Aug 15, 2022
@osma osma self-assigned this Aug 15, 2022
@osma osma force-pushed the refactor-subjectindex-multilingual branch from 1e8b6a0 to 53343bc Compare August 15, 2022 07:00
@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

Merging #608 (d26a5db) into master (1c2e849) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #608      +/-   ##
==========================================
+ Coverage   99.55%   99.58%   +0.03%     
==========================================
  Files          86       87       +1     
  Lines        5663     5834     +171     
==========================================
+ Hits         5638     5810     +172     
+ Misses         25       24       -1     
Impacted Files Coverage Δ
tests/test_backend_http.py 100.00% <ø> (ø)
tests/test_eval.py 100.00% <ø> (ø)
annif/backend/yake.py 98.21% <100.00%> (ø)
annif/cli.py 99.63% <100.00%> (+<0.01%) ⬆️
annif/corpus/__init__.py 100.00% <100.00%> (ø)
annif/corpus/document.py 100.00% <100.00%> (ø)
annif/corpus/skos.py 100.00% <100.00%> (ø)
annif/corpus/subject.py 100.00% <100.00%> (ø)
annif/corpus/types.py 100.00% <100.00%> (ø)
annif/eval.py 100.00% <100.00%> (ø)
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sonarcloud
Copy link

sonarcloud bot commented Aug 15, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 11 Code Smells

No Coverage information No Coverage information
5.5% 5.5% Duplication

@osma osma marked this pull request as ready for review August 15, 2022 11:45
@osma osma requested a review from juhoinkinen August 15, 2022 11:45
@osma osma merged commit 3c50f68 into master Aug 15, 2022
@osma osma deleted the refactor-subjectindex-multilingual branch August 15, 2022 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants