Refactor: Represent suggestion results as sparse arrays #681

osma · 2023-03-14T11:45:42Z

This PR reimplements the representation and handling of subject suggestions so that they are represented as sparse arrays that contain the suggestions for a batch of documents. Fixes #678 where the idea was originally proposed.

Old approach

In the old approach, there was an abstract class SuggestionResult and its concrete subclasses VectorSuggestionResult and ListSuggestionResult (as well as a LazySuggestionResult wrapper class). All of these were oriented around representing the results for a single document, so when performing operations on multiple documents (e.g. annif eval), there were lots of instances of the SuggestionResult classes and operations such as filtering (by limit and threshold) had to be performed separately on each instance, which can be inefficient. Also, VectorSuggestionResult stores 1D NumPy arrays which, in the case of a large vocabulary such as YSO, usually contain mostly zeros, which is a waste of memory. There was also extra complexity because of the two alternative representations and the need to convert between them depending on which representation was more appropriate for a given purpose.

New approach

This PR removes all the above mentioned classes. Instead suggestion results are represented using SciPy sparse arrays that contain the suggestions for a whole batch of documents (up to 32). The new classes in the annif.suggestion module are:

SuggestionResults: this represents the suggestions for a whole corpus of documents, which may comprise many batches. This is the return type for AnnifProject.suggest_corpus.
SuggestionBatch: this represents the suggestions for a single batch of documents and wraps a sparse array. This is the return type for AnnifProject.suggest as well as the backend method _suggest_batch for those backends that support batch operations.
SuggestionResult: this represents the suggestions for a single document. It is actually a single row view into the sparse array contained in a SuggestionBatch.

Like their predecessors, all of these classes are intended to be immutable (although I guess this could be better enforced in the code). So once created, they will not change; operations such as .filter() will return a new instance.

Notes about sparse arrays

SciPy sparse matrices have been around for a long time, but sparse arrays are relatively new (introduced in SciPy 1.8 released in February 2022). The arrays are very similar to matrices (and currently implemented as a subclass), but their API is closer to that of NumPy arrays. SciPy documentation now recommends the use of sparse arrays (instead of matrices) for new code, so I decided to follow that recommendation. However, there are a few cases in this PR where SciPy methods on sparse arrays return matrices instead of arrays, and they have to be explicitly converted to arrays.

Changes to functionality

The original idea of this PR was to simply refactor without changing any functionality, yet I decided to do a couple of changes:

this PR removes the LRAP evaluation metric, because the implementation in scikit-learn doesn't accept sparse arrays as input; it's not much used anyway, and nDCG is a very similar ranking metric, so I don't think this is a big loss.
The reimplemented annif optimize command now supports a --jobs parameter for parallel processing, just like annif eval - this fixes Parallelize optimize command #423 .

Performance

I had to do extensive benchmarking and many rounds of optimizations before I was happy with the results. The representation of suggestion results is very fundamental and performance-critical; any inefficient operations will easily degrade performance. Despite its flaws, the old implementation was pretty well optimized. I have categorized the benchmark results according to the outcome as measured by wall time: obvious improvement (more than ~5%), obvious slowdown (more than ~5%), and no significant difference (everything in between).

I performed the benchmarks on the 48 CPU physical server. There is sometimes quite a bit of variation in successive runs especially for short operations. In some cases (basic CLI commands) I repeated the operation 5 times and took the best wall time result. For the most part, and unless otherwise noted, I used the Finto AI 2022-11 pretrained Finnish language models and the "kirjastonhoitaja" train, validation or test sets depending on which was most appropriate for the operation.

Memory usage was mostly unchanged, though in a few cases (e.g. MLLM and SVC eval operations) there is a clear improvement, probably due to the use of sparse arrays instead of NumPy vectors which can take up more space. I have commented on memory usage in a few cases below.

Better performance

annif eval (tfidf, omikuji, fasttext, mllm, ensemble, nn_ensemble) - small improvements

tfidf eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	147.37	136.89	151.87	141.18
wall time	2:30.42	2:18.74	0:49.10	0:40.49
max rss	621332	628936	409056	428944

Here I used the Annif tutorial yso-nlf data set. Clearly faster than before, though memory usage has increased a little bit (but only around 10MB) in this case.

omikuji eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	62.32	56.21	66.66	57.51
wall time	1:04.78	0:59.01	1:01.39	0:53.45
max rss	6026176	5924520	6001608	5911072

A little bit faster, and saving around 100MB memory.

fasttext eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	24.82	16.56	33.72	25.84
wall time	0:28.59	0:18.81	0:22.19	0:15.66
max rss	7604752	7566168	7537456	7536960

Much faster proportionally, but it was already fast, so the absolute difference is only a few seconds. This shows that the calculation of evaluation metrics has been optimized, as the classifier remains the same. Memory usage 40MB less than before.

mllm eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	138.03	139.61	146.27	135.36
wall time	2:17.31	2:19.12	0:53.92	0:45.33
max rss	417892	327208	398172	312432

Overall execution time remains about the same, but memory usage has decreased by almost 100MB.

ensemble eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	190.05	180.34	215.69	208.61
wall time	3:15.67	3:06.35	1:51.35	1:45.40
max rss	13304224	13248516	13171624	13176440

Again a bit faster and uses 50MB less memory.

nn_ensemble eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	209.98	190.73	219.25	215.77
wall time	3:32.68	3:07.38	1:53.31	1:48.60
max rss	13834460	13734268	13592216	13592288

Faster, 100MB less memory.

annif optimize (mllm, fasttext, omikuji, nn_ensemble) - much faster, esp. with new -j4 option

mllm optimize	before (main)	after (PR) -j1	after (PR) -j4
user time	398.12	133.62	131.32
wall time	6:40.40	2:12.46	1:09.95
max rss	402928	326660	312268

fasttext optimize	before (main)	after (PR) -j1	after (PR) -j4
user time	310.08	51.17	61.31
wall time	5:17.04	0:52.77	0:51.56
max rss	7569820	7565416	7537044

omikuji optimize	before (main)	after (PR) -j1	after (PR) -j4
user time	351.47	90.24	90.98
wall time	5:57.19	1:31.88	1:29.20
max rss	6353648	5923684	5900328

nn_ensemble optimize	before (main)	after (PR) -j1	after (PR) -j4
user time	459.32	195.40	214.07
wall time	7:44.59	3:18.16	2:14.56
max rss	13765600	13733652	13597472

All of these are 2-3x faster than before with a single job, and in some cases (mllm, nn_ensemble) using the new --jobs 4 option further improves performance. Memory usage has decreased for MLLM and Omikuji.

Reduced performance

annif suggest (mllm) with a corpus of documents - slightly slower

mllm suggest	before (main)	after (PR)
user time	124.31	136.56
wall time	2:04.31	2:16.56
max rss	312952	321740

This is a little bit slower than before, probably because results have to be converted from NumPy vectors to sparse arrays which in this case just slows things down.

annif train (pav) - slightly slower

pav train	before (main)	after (PR)
user time	1306.44	1374.53
wall time	21:53.99	23:03.23
max rss	13397520	13394764

I didn't investigate this in detail, but clearly there is a bit of a slowdown.

annif hyperopt (ensemble) - much longer wall time due to bad threading

ensemble hyperopt	before (main)	after (PR)
user time	236.50	238.26
wall time	1:35.12	2:06.23
max rss	13558412	13205052

The user time is unchanged, but wall time has increased significantly. I did some profiling and it seems to me that the main reason for the slowdown is contention for the GIL; due to the way sparse arrays are implemented and used, eval calculations are now mostly done in Python code instead of NumPy operations, and NumPy is better at threading since it can release the GIL during some calculations. A silver lining here is that memory usage decreased by around 250MB. (See below for a suggestion on how to improve this)

No significant difference

Basic CLI operations

annif --help	before (main)	after (PR)
user time	1.48	01.09
wall time	0:00.67	0:00.72
max rss	61068	61000

annif list-projects	before (main)	after (PR)
user time	4.96	4.99
wall time	0:03.39	0:03.44
max rss	405000	405632

annif list-vocabs	before (main)	after (PR)
user time	4.81	5.26
wall time	0:03.24	0:03.18
max rss	406336	405564

Lots of variation between runs, but there is no significant difference.

annif suggest (most backends) with a corpus of documents

tfidf suggest	before (main)	after (PR)
user time	139.18	137.01
wall time	2:21.40	2:18.01
max rss	540700	544692

fasttext suggest	before (main)	after (PR)
user time	15.46	14.66
wall time	0:17.32	0:17.18
max rss	7464564	7466736

omikuji suggest	before (main)	after (PR)
user time	52.08	52.66
wall time	0:55.29	0:55.55
max rss	6266832	6268096

svc suggest (ykl)	before (main)	after (PR)
user time	18.24	18.52
wall time	0:17.87	0:17.68
max rss	1985268	1985208

ensemble suggest	before (main)	after (PR)
user time	183.46	180.69
wall time	3:10.45	3:10.64
max rss	13248652	13203956

pav suggest	before (main)	after (PR)
user time	191.90	192.24
wall time	3:17.12	3:18.13
max rss	13250576	13214236

nn_ensemble suggest	before (main)	after (PR)
user time	192.41	187.32
wall time	3:14.34	3:12.49
max rss	13733072	13735572

annif eval (svc, pav)

svc eval (ykl)	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	31.32	31.27	35.23	33.68
wall time	0:30.85	0:29.71	0:20.64	0:19.18
max rss	2088652	1949652	2006592	1881268

This was done using YKL as the vocabulary and kirjaesittelyt2021/ykl test set. Runtime is about the same, but memory usage decreased by 130MB!

pav eval	before (main) -j1	after (PR) -j1	before (main) -j4	after (PR) -j4
user time	189.44	192.63	232.83	221.67
wall time	3:16.27	3:18.89	2:00.68	1:49.79
max rss	13298404	13236048	13215296	13178672

A bit faster when using -j4, otherwise no big difference, though memory usage decreased by 60MB.

annif train (nn_ensemble)

nn_ensemble train	before (main) -j4	after (PR) -j4
user time	4243.75	4292.64
wall time	8:41.98	8:43.56
max rss	15620888	16439540

Overall train time remains about the same, but memory usage has increased. I think there is room for improvement here, as noted below.

Code quality

On the whole, this PR reduces the amount of code by more than 100 lines, mainly because having a single represenation for suggestion results simplifies the implementation. I've also both dropped some unit tests and added new ones; the total number of tests has grown by one.

I tried to make the new implementation as clean as possible, but there are a few places where Codebeat complains about too much complexity. In both SuggestionBatch.from_sequence and filter_suggestions the problem is that efficiently constructing a sparse array is a bit complicated and messy as it involves creating three parallel lists; I originally tried to use dok_array as an intermediate representation, but it turned out to be slow. Anyway, I'm running out of ideas on how to simplify the code further. Suggestions welcome!

Potential future work

I'm not quite happy with how the NN ensemble handles suggestion results from other projects, both during training and suggest operations. For example, the training samples are stored in LMDB one document at a time, but now it would be easier to store them as whole batches instead, which could be more efficient. But I decided that this PR is already much too big and it would make sense to try to improve batching in the NN ensemble in a separate follow-up PR. There is already an attempt to do part of this in PR #676; that could be a possible starting point.

The threading performance of annif hyperopt was already bad, and now it got worse. The solution could be to switch to process-based multiprocessing. This has been difficult to do with Optuna (needs an external relational database), but the Optuna FAQ now states that it could also be done with JournalFileStorage, which sounds more promising.

codecov · 2023-03-14T11:50:07Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (693ab21) 99.58% compared to head (3edab48) 99.61%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #681      +/-   ##
==========================================
+ Coverage   99.58%   99.61%   +0.02%     
==========================================
  Files          89       89              
  Lines        6268     6166     -102     
==========================================
- Hits         6242     6142     -100     
+ Misses         26       24       -2

Impacted Files	Coverage Δ
annif/util.py	`98.41% <ø> (-0.16%)`	⬇️
annif/backend/backend.py	`100.00% <100.00%> (ø)`
annif/backend/dummy.py	`100.00% <100.00%> (ø)`
annif/backend/ensemble.py	`100.00% <100.00%> (ø)`
annif/backend/fasttext.py	`100.00% <100.00%> (ø)`
annif/backend/http.py	`100.00% <100.00%> (ø)`
annif/backend/mixins.py	`97.50% <100.00%> (+2.37%)`	⬆️
annif/backend/mllm.py	`100.00% <100.00%> (ø)`
annif/backend/nn_ensemble.py	`100.00% <100.00%> (ø)`
annif/backend/omikuji.py	`97.53% <100.00%> (ø)`
... and 27 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

annif/eval.py

…convert sparse to dense arrays for some functions)

…ns using sparse array ops

…ore function

osma · 2023-03-17T06:55:39Z

Rebased on current main and force-pushed, mainly because I need to consider the CLI refactoring done in #675 before making changes to CLI-related code.

…ad of AnnifProject.suggest()

…il in anticipation of wider use

annif/suggestion.py

…lso supports parallel processing of document batches

sonarcloud · 2023-04-13T10:06:14Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
4.8% Duplication

osma added the enhancement label Mar 14, 2023

osma self-assigned this Mar 14, 2023

github-advanced-security bot found potential problems Mar 14, 2023

View reviewed changes

annif/eval.py Fixed Show fixed Hide fixed

osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch 3 times, most recently from 1db8f53 to e0c871b Compare March 16, 2023 17:03

osma added 12 commits March 17, 2023 08:53

implement SuggestionBatch class, although it doesn't do much yet

6941bcc

implement SparseSuggestionResult

33b9694

remove EvaluationBatch.evaluate() to make transition easier

df80194

use SuggestionBatch directly in annif.eval module (but still have to …

172ce3c

…convert sparse to dense arrays for some functions)

add unit tests for true/false positives/negatives eval metrics

d8d5133

reimplement true_positives/false_positives/false_negatives calculatio…

d0331f5

…ns using sparse array ops

test for eval results file in test_eval module

00c952d

reimplement eval results per subject calculation using sparse array ops

2d6b0b4

reimplement Precision@{1,3,5} metrics without using precision_at_k_sc…

e945d9e

…ore function

reimplement nDCG calculation using sparse array ops

7e3571a

remove LRAP metric & remaining use of dense arrays in annif.eval

13f4e13

remove unused imports

830b3a5

osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from 4d4f9a2 to 830b3a5 Compare March 17, 2023 06:53

osma added 4 commits March 17, 2023 09:41

create SuggestionBatch already in AnnifBackend._suggest_batch() inste…

47b4f1c

…ad of AnnifProject.suggest()

more efficient filter_pred_top_k that supports both limit and threshold

0d74f93

rename filter_pred_top_k to filter_suggestion and move it to annif.ut…

8342bde

…il in anticipation of wider use

implement SuggestionBatch.filter and .from_sequence classmethod

2bf8c65

github-advanced-security bot found potential problems Mar 17, 2023

View reviewed changes

annif/suggestion.py Fixed Show fixed Hide fixed

annif/suggestion.py Fixed Show fixed Hide fixed

osma added 2 commits March 17, 2023 14:10

use SuggestionBatch.filter() in annif.parallel

881ba97

implement SuggestionResults and use it in CLI suggest and index commands

1844ad4

osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from ec878dc to 1844ad4 Compare March 17, 2023 13:03

reimplement CLI optimize command using SuggestionResults class; now a…

818ba21

…lso supports parallel processing of document batches

osma added 11 commits March 31, 2023 16:34

remove unused import abc

46ba348

refactor: remove unnecessary size parameters

17e5b98

remove unused function generate_suggestions

f120abf

minor fixes

5b9518a

move imports inline to avoid startup overhead

548b8b2

avoid useless work creating SuggestionBatch from vectors + add tests

05549be

improve tests for PAV backend

0ebf71f

add test for fasttext suggest with empty text

17f52a1

refactor: turn vector_to_suggestions into utility function

0e1c79d

faster vector_to_suggestions using np.argpartition

cc0b6b9

refactor _normalize_suggestion_batch using list comprehensions

3112407

osma added this to the 0.61 milestone Apr 5, 2023

reimplement ensemble hyperopt making better use of sparse arrays

0d70034

osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from 092cdc0 to 0d70034 Compare April 6, 2023 10:12

osma added 2 commits April 6, 2023 16:27

refactor ensemble suggest functionality to better support batching

d7e7fa4

remove now-unused merge_hits and WeightedSuggestionsBatch

573b64f

osma force-pushed the issue678-refactor-suggestionresult-sparse-array branch from c828bbb to 573b64f Compare April 6, 2023 13:32

osma added 4 commits April 6, 2023 19:44

avoid extra .filter() operation in nn_ensemble

ec4e37d

faster creation of csr_array in .from_sequence

98fb87c

remove unnecessary import

bdb4d6e

reimplement filter_suggestion without using slow(?) dok_array

3edab48

osma marked this pull request as ready for review April 14, 2023 08:11

osma requested a review from juhoinkinen April 14, 2023 08:11

osma changed the title ~~Refactor: SuggestionBatch representing suggestion results as sparse arrays~~ Refactor: Represent suggestion results as sparse arrays Apr 14, 2023

juhoinkinen approved these changes Apr 14, 2023

View reviewed changes

osma merged commit 415d944 into main Apr 14, 2023

osma deleted the issue678-refactor-suggestionresult-sparse-array branch April 14, 2023 09:21

This was referenced Apr 14, 2023

Better support for suggestion batches in NN ensemble #687

Open

Use multiprocessing for hyperopt operation #688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Represent suggestion results as sparse arrays #681

Refactor: Represent suggestion results as sparse arrays #681

osma commented Mar 14, 2023 •

edited

Loading

codecov bot commented Mar 14, 2023 •

edited

Loading

osma commented Mar 17, 2023

sonarcloud bot commented Apr 13, 2023

Refactor: Represent suggestion results as sparse arrays #681

Refactor: Represent suggestion results as sparse arrays #681

Conversation

osma commented Mar 14, 2023 • edited Loading

Old approach

New approach

Notes about sparse arrays

Changes to functionality

Performance

Better performance

Reduced performance

No significant difference

Code quality

Potential future work

codecov bot commented Mar 14, 2023 • edited Loading

Codecov Report

osma commented Mar 17, 2023

sonarcloud bot commented Apr 13, 2023

osma commented Mar 14, 2023 •

edited

Loading

codecov bot commented Mar 14, 2023 •

edited

Loading