Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Represent suggestion results as sparse arrays #681

Merged
merged 51 commits into from
Apr 14, 2023

Conversation

osma
Copy link
Member

@osma osma commented Mar 14, 2023

This PR reimplements the representation and handling of subject suggestions so that they are represented as sparse arrays that contain the suggestions for a batch of documents. Fixes #678 where the idea was originally proposed.

Old approach

In the old approach, there was an abstract class SuggestionResult and its concrete subclasses VectorSuggestionResult and ListSuggestionResult (as well as a LazySuggestionResult wrapper class). All of these were oriented around representing the results for a single document, so when performing operations on multiple documents (e.g. annif eval), there were lots of instances of the SuggestionResult classes and operations such as filtering (by limit and threshold) had to be performed separately on each instance, which can be inefficient. Also, VectorSuggestionResult stores 1D NumPy arrays which, in the case of a large vocabulary such as YSO, usually contain mostly zeros, which is a waste of memory. There was also extra complexity because of the two alternative representations and the need to convert between them depending on which representation was more appropriate for a given purpose.

New approach

This PR removes all the above mentioned classes. Instead suggestion results are represented using SciPy sparse arrays that contain the suggestions for a whole batch of documents (up to 32). The new classes in the annif.suggestion module are:

  • SuggestionResults: this represents the suggestions for a whole corpus of documents, which may comprise many batches. This is the return type for AnnifProject.suggest_corpus.
  • SuggestionBatch: this represents the suggestions for a single batch of documents and wraps a sparse array. This is the return type for AnnifProject.suggest as well as the backend method _suggest_batch for those backends that support batch operations.
  • SuggestionResult: this represents the suggestions for a single document. It is actually a single row view into the sparse array contained in a SuggestionBatch.

Like their predecessors, all of these classes are intended to be immutable (although I guess this could be better enforced in the code). So once created, they will not change; operations such as .filter() will return a new instance.

Notes about sparse arrays

SciPy sparse matrices have been around for a long time, but sparse arrays are relatively new (introduced in SciPy 1.8 released in February 2022). The arrays are very similar to matrices (and currently implemented as a subclass), but their API is closer to that of NumPy arrays. SciPy documentation now recommends the use of sparse arrays (instead of matrices) for new code, so I decided to follow that recommendation. However, there are a few cases in this PR where SciPy methods on sparse arrays return matrices instead of arrays, and they have to be explicitly converted to arrays.

Changes to functionality

The original idea of this PR was to simply refactor without changing any functionality, yet I decided to do a couple of changes:

  1. this PR removes the LRAP evaluation metric, because the implementation in scikit-learn doesn't accept sparse arrays as input; it's not much used anyway, and nDCG is a very similar ranking metric, so I don't think this is a big loss.
  2. The reimplemented annif optimize command now supports a --jobs parameter for parallel processing, just like annif eval - this fixes Parallelize optimize command #423 .

Performance

I had to do extensive benchmarking and many rounds of optimizations before I was happy with the results. The representation of suggestion results is very fundamental and performance-critical; any inefficient operations will easily degrade performance. Despite its flaws, the old implementation was pretty well optimized. I have categorized the benchmark results according to the outcome as measured by wall time: obvious improvement (more than ~5%), obvious slowdown (more than ~5%), and no significant difference (everything in between).

I performed the benchmarks on the 48 CPU physical server. There is sometimes quite a bit of variation in successive runs especially for short operations. In some cases (basic CLI commands) I repeated the operation 5 times and took the best wall time result. For the most part, and unless otherwise noted, I used the Finto AI 2022-11 pretrained Finnish language models and the "kirjastonhoitaja" train, validation or test sets depending on which was most appropriate for the operation.

Memory usage was mostly unchanged, though in a few cases (e.g. MLLM and SVC eval operations) there is a clear improvement, probably due to the use of sparse arrays instead of NumPy vectors which can take up more space. I have commented on memory usage in a few cases below.

Better performance

annif eval (tfidf, omikuji, fasttext, mllm, ensemble, nn_ensemble) - small improvements
tfidf eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 147.37 136.89 151.87 141.18
wall time 2:30.42 2:18.74 0:49.10 0:40.49
max rss 621332 628936 409056 428944

Here I used the Annif tutorial yso-nlf data set. Clearly faster than before, though memory usage has increased a little bit (but only around 10MB) in this case.

omikuji eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 62.32 56.21 66.66 57.51
wall time 1:04.78 0:59.01 1:01.39 0:53.45
max rss 6026176 5924520 6001608 5911072

A little bit faster, and saving around 100MB memory.

fasttext eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 24.82 16.56 33.72 25.84
wall time 0:28.59 0:18.81 0:22.19 0:15.66
max rss 7604752 7566168 7537456 7536960

Much faster proportionally, but it was already fast, so the absolute difference is only a few seconds. This shows that the calculation of evaluation metrics has been optimized, as the classifier remains the same. Memory usage 40MB less than before.

mllm eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 138.03 139.61 146.27 135.36
wall time 2:17.31 2:19.12 0:53.92 0:45.33
max rss 417892 327208 398172 312432

Overall execution time remains about the same, but memory usage has decreased by almost 100MB.

ensemble eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 190.05 180.34 215.69 208.61
wall time 3:15.67 3:06.35 1:51.35 1:45.40
max rss 13304224 13248516 13171624 13176440

Again a bit faster and uses 50MB less memory.

nn_ensemble eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 209.98 190.73 219.25 215.77
wall time 3:32.68 3:07.38 1:53.31 1:48.60
max rss 13834460 13734268 13592216 13592288

Faster, 100MB less memory.

annif optimize (mllm, fasttext, omikuji, nn_ensemble) - much faster, esp. with new -j4 option
mllm optimize before (main) after (PR) -j1 after (PR) -j4
user time 398.12 133.62 131.32
wall time 6:40.40 2:12.46 1:09.95
max rss 402928 326660 312268
fasttext optimize before (main) after (PR) -j1 after (PR) -j4
user time 310.08 51.17 61.31
wall time 5:17.04 0:52.77 0:51.56
max rss 7569820 7565416 7537044
omikuji optimize before (main) after (PR) -j1 after (PR) -j4
user time 351.47 90.24 90.98
wall time 5:57.19 1:31.88 1:29.20
max rss 6353648 5923684 5900328
nn_ensemble optimize before (main) after (PR) -j1 after (PR) -j4
user time 459.32 195.40 214.07
wall time 7:44.59 3:18.16 2:14.56
max rss 13765600 13733652 13597472

All of these are 2-3x faster than before with a single job, and in some cases (mllm, nn_ensemble) using the new --jobs 4 option further improves performance. Memory usage has decreased for MLLM and Omikuji.

Reduced performance

annif suggest (mllm) with a corpus of documents - slightly slower
mllm suggest before (main) after (PR)
user time 124.31 136.56
wall time 2:04.31 2:16.56
max rss 312952 321740

This is a little bit slower than before, probably because results have to be converted from NumPy vectors to sparse arrays which in this case just slows things down.

annif train (pav) - slightly slower
pav train before (main) after (PR)
user time 1306.44 1374.53
wall time 21:53.99 23:03.23
max rss 13397520 13394764

I didn't investigate this in detail, but clearly there is a bit of a slowdown.

annif hyperopt (ensemble) - much longer wall time due to bad threading
ensemble hyperopt before (main) after (PR)
user time 236.50 238.26
wall time 1:35.12 2:06.23
max rss 13558412 13205052

The user time is unchanged, but wall time has increased significantly. I did some profiling and it seems to me that the main reason for the slowdown is contention for the GIL; due to the way sparse arrays are implemented and used, eval calculations are now mostly done in Python code instead of NumPy operations, and NumPy is better at threading since it can release the GIL during some calculations. A silver lining here is that memory usage decreased by around 250MB. (See below for a suggestion on how to improve this)

No significant difference

Basic CLI operations
annif --help before (main) after (PR)
user time 1.48 01.09
wall time 0:00.67 0:00.72
max rss 61068 61000
annif list-projects before (main) after (PR)
user time 4.96 4.99
wall time 0:03.39 0:03.44
max rss 405000 405632
annif list-vocabs before (main) after (PR)
user time 4.81 5.26
wall time 0:03.24 0:03.18
max rss 406336 405564

Lots of variation between runs, but there is no significant difference.

annif suggest (most backends) with a corpus of documents
tfidf suggest before (main) after (PR)
user time 139.18 137.01
wall time 2:21.40 2:18.01
max rss 540700 544692
fasttext suggest before (main) after (PR)
user time 15.46 14.66
wall time 0:17.32 0:17.18
max rss 7464564 7466736
omikuji suggest before (main) after (PR)
user time 52.08 52.66
wall time 0:55.29 0:55.55
max rss 6266832 6268096
svc suggest (ykl) before (main) after (PR)
user time 18.24 18.52
wall time 0:17.87 0:17.68
max rss 1985268 1985208
ensemble suggest before (main) after (PR)
user time 183.46 180.69
wall time 3:10.45 3:10.64
max rss 13248652 13203956
pav suggest before (main) after (PR)
user time 191.90 192.24
wall time 3:17.12 3:18.13
max rss 13250576 13214236
nn_ensemble suggest before (main) after (PR)
user time 192.41 187.32
wall time 3:14.34 3:12.49
max rss 13733072 13735572
annif eval (svc, pav)
svc eval (ykl) before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 31.32 31.27 35.23 33.68
wall time 0:30.85 0:29.71 0:20.64 0:19.18
max rss 2088652 1949652 2006592 1881268

This was done using YKL as the vocabulary and kirjaesittelyt2021/ykl test set. Runtime is about the same, but memory usage decreased by 130MB!

pav eval before (main) -j1 after (PR) -j1 before (main) -j4 after (PR) -j4
user time 189.44 192.63 232.83 221.67
wall time 3:16.27 3:18.89 2:00.68 1:49.79
max rss 13298404 13236048 13215296 13178672

A bit faster when using -j4, otherwise no big difference, though memory usage decreased by 60MB.

annif train (nn_ensemble)
nn_ensemble train before (main) -j4 after (PR) -j4
user time 4243.75 4292.64
wall time 8:41.98 8:43.56
max rss 15620888 16439540

Overall train time remains about the same, but memory usage has increased. I think there is room for improvement here, as noted below.

Code quality

On the whole, this PR reduces the amount of code by more than 100 lines, mainly because having a single represenation for suggestion results simplifies the implementation. I've also both dropped some unit tests and added new ones; the total number of tests has grown by one.

I tried to make the new implementation as clean as possible, but there are a few places where Codebeat complains about too much complexity. In both SuggestionBatch.from_sequence and filter_suggestions the problem is that efficiently constructing a sparse array is a bit complicated and messy as it involves creating three parallel lists; I originally tried to use dok_array as an intermediate representation, but it turned out to be slow. Anyway, I'm running out of ideas on how to simplify the code further. Suggestions welcome!

Potential future work

I'm not quite happy with how the NN ensemble handles suggestion results from other projects, both during training and suggest operations. For example, the training samples are stored in LMDB one document at a time, but now it would be easier to store them as whole batches instead, which could be more efficient. But I decided that this PR is already much too big and it would make sense to try to improve batching in the NN ensemble in a separate follow-up PR. There is already an attempt to do part of this in PR #676; that could be a possible starting point.

The threading performance of annif hyperopt was already bad, and now it got worse. The solution could be to switch to process-based multiprocessing. This has been difficult to do with Optuna (needs an external relational database), but the Optuna FAQ now states that it could also be done with JournalFileStorage, which sounds more promising.

@osma osma self-assigned this Mar 14, 2023
@codecov
Copy link

codecov bot commented Mar 14, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (693ab21) 99.58% compared to head (3edab48) 99.61%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #681      +/-   ##
==========================================
+ Coverage   99.58%   99.61%   +0.02%     
==========================================
  Files          89       89              
  Lines        6268     6166     -102     
==========================================
- Hits         6242     6142     -100     
+ Misses         26       24       -2     
Impacted Files Coverage Δ
annif/util.py 98.41% <ø> (-0.16%) ⬇️
annif/backend/backend.py 100.00% <100.00%> (ø)
annif/backend/dummy.py 100.00% <100.00%> (ø)
annif/backend/ensemble.py 100.00% <100.00%> (ø)
annif/backend/fasttext.py 100.00% <100.00%> (ø)
annif/backend/http.py 100.00% <100.00%> (ø)
annif/backend/mixins.py 97.50% <100.00%> (+2.37%) ⬆️
annif/backend/mllm.py 100.00% <100.00%> (ø)
annif/backend/nn_ensemble.py 100.00% <100.00%> (ø)
annif/backend/omikuji.py 97.53% <100.00%> (ø)
... and 27 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

annif/eval.py Fixed Show fixed Hide fixed
@osma osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch 3 times, most recently from 1db8f53 to e0c871b Compare March 16, 2023 17:03
@osma osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from 4d4f9a2 to 830b3a5 Compare March 17, 2023 06:53
@osma
Copy link
Member Author

osma commented Mar 17, 2023

Rebased on current main and force-pushed, mainly because I need to consider the CLI refactoring done in #675 before making changes to CLI-related code.

annif/suggestion.py Fixed Show fixed Hide fixed
annif/suggestion.py Fixed Show fixed Hide fixed
@osma osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from ec878dc to 1844ad4 Compare March 17, 2023 13:03
…lso supports parallel processing of document batches
@osma osma added this to the 0.61 milestone Apr 5, 2023
@osma osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from 092cdc0 to 0d70034 Compare April 6, 2023 10:12
@osma osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from c828bbb to 573b64f Compare April 6, 2023 13:32
@sonarcloud
Copy link

sonarcloud bot commented Apr 13, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
4.8% 4.8% Duplication

@osma osma marked this pull request as ready for review April 14, 2023 08:11
@osma osma requested a review from juhoinkinen April 14, 2023 08:11
@osma osma changed the title Refactor: SuggestionBatch representing suggestion results as sparse arrays Refactor: Represent suggestion results as sparse arrays Apr 14, 2023
@osma osma merged commit 415d944 into main Apr 14, 2023
@osma osma deleted the issue678-refactor-suggestionresult-sparse-array branch April 14, 2023 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants