-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: Represent suggestion results as sparse arrays #681
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #681 +/- ##
==========================================
+ Coverage 99.58% 99.61% +0.02%
==========================================
Files 89 89
Lines 6268 6166 -102
==========================================
- Hits 6242 6142 -100
+ Misses 26 24 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
1db8f53
to
e0c871b
Compare
…convert sparse to dense arrays for some functions)
…ns using sparse array ops
4d4f9a2
to
830b3a5
Compare
Rebased on current |
…ad of AnnifProject.suggest()
…il in anticipation of wider use
ec878dc
to
1844ad4
Compare
…lso supports parallel processing of document batches
092cdc0
to
0d70034
Compare
c828bbb
to
573b64f
Compare
SonarCloud Quality Gate failed. 0 Bugs No Coverage information |
This PR reimplements the representation and handling of subject suggestions so that they are represented as sparse arrays that contain the suggestions for a batch of documents. Fixes #678 where the idea was originally proposed.
Old approach
In the old approach, there was an abstract class
SuggestionResult
and its concrete subclassesVectorSuggestionResult
andListSuggestionResult
(as well as aLazySuggestionResult
wrapper class). All of these were oriented around representing the results for a single document, so when performing operations on multiple documents (e.g.annif eval
), there were lots of instances of theSuggestionResult
classes and operations such as filtering (by limit and threshold) had to be performed separately on each instance, which can be inefficient. Also,VectorSuggestionResult
stores 1D NumPy arrays which, in the case of a large vocabulary such as YSO, usually contain mostly zeros, which is a waste of memory. There was also extra complexity because of the two alternative representations and the need to convert between them depending on which representation was more appropriate for a given purpose.New approach
This PR removes all the above mentioned classes. Instead suggestion results are represented using SciPy sparse arrays that contain the suggestions for a whole batch of documents (up to 32). The new classes in the
annif.suggestion
module are:SuggestionResults
: this represents the suggestions for a whole corpus of documents, which may comprise many batches. This is the return type forAnnifProject.suggest_corpus
.SuggestionBatch
: this represents the suggestions for a single batch of documents and wraps a sparse array. This is the return type forAnnifProject.suggest
as well as the backend method_suggest_batch
for those backends that support batch operations.SuggestionResult
: this represents the suggestions for a single document. It is actually a single row view into the sparse array contained in aSuggestionBatch
.Like their predecessors, all of these classes are intended to be immutable (although I guess this could be better enforced in the code). So once created, they will not change; operations such as
.filter()
will return a new instance.Notes about sparse arrays
SciPy sparse matrices have been around for a long time, but sparse arrays are relatively new (introduced in SciPy 1.8 released in February 2022). The arrays are very similar to matrices (and currently implemented as a subclass), but their API is closer to that of NumPy arrays. SciPy documentation now recommends the use of sparse arrays (instead of matrices) for new code, so I decided to follow that recommendation. However, there are a few cases in this PR where SciPy methods on sparse arrays return matrices instead of arrays, and they have to be explicitly converted to arrays.
Changes to functionality
The original idea of this PR was to simply refactor without changing any functionality, yet I decided to do a couple of changes:
annif optimize
command now supports a--jobs
parameter for parallel processing, just likeannif eval
- this fixes Parallelize optimize command #423 .Performance
I had to do extensive benchmarking and many rounds of optimizations before I was happy with the results. The representation of suggestion results is very fundamental and performance-critical; any inefficient operations will easily degrade performance. Despite its flaws, the old implementation was pretty well optimized. I have categorized the benchmark results according to the outcome as measured by wall time: obvious improvement (more than ~5%), obvious slowdown (more than ~5%), and no significant difference (everything in between).
I performed the benchmarks on the 48 CPU physical server. There is sometimes quite a bit of variation in successive runs especially for short operations. In some cases (basic CLI commands) I repeated the operation 5 times and took the best wall time result. For the most part, and unless otherwise noted, I used the Finto AI 2022-11 pretrained Finnish language models and the "kirjastonhoitaja" train, validation or test sets depending on which was most appropriate for the operation.
Memory usage was mostly unchanged, though in a few cases (e.g. MLLM and SVC eval operations) there is a clear improvement, probably due to the use of sparse arrays instead of NumPy vectors which can take up more space. I have commented on memory usage in a few cases below.
Better performance
annif eval (tfidf, omikuji, fasttext, mllm, ensemble, nn_ensemble) - small improvements
Here I used the Annif tutorial yso-nlf data set. Clearly faster than before, though memory usage has increased a little bit (but only around 10MB) in this case.
A little bit faster, and saving around 100MB memory.
Much faster proportionally, but it was already fast, so the absolute difference is only a few seconds. This shows that the calculation of evaluation metrics has been optimized, as the classifier remains the same. Memory usage 40MB less than before.
Overall execution time remains about the same, but memory usage has decreased by almost 100MB.
Again a bit faster and uses 50MB less memory.
Faster, 100MB less memory.
annif optimize (mllm, fasttext, omikuji, nn_ensemble) - much faster, esp. with new -j4 option
All of these are 2-3x faster than before with a single job, and in some cases (mllm, nn_ensemble) using the new
--jobs 4
option further improves performance. Memory usage has decreased for MLLM and Omikuji.Reduced performance
annif suggest (mllm) with a corpus of documents - slightly slower
This is a little bit slower than before, probably because results have to be converted from NumPy vectors to sparse arrays which in this case just slows things down.
annif train (pav) - slightly slower
I didn't investigate this in detail, but clearly there is a bit of a slowdown.
annif hyperopt (ensemble) - much longer wall time due to bad threading
The user time is unchanged, but wall time has increased significantly. I did some profiling and it seems to me that the main reason for the slowdown is contention for the GIL; due to the way sparse arrays are implemented and used, eval calculations are now mostly done in Python code instead of NumPy operations, and NumPy is better at threading since it can release the GIL during some calculations. A silver lining here is that memory usage decreased by around 250MB. (See below for a suggestion on how to improve this)
No significant difference
Basic CLI operations
Lots of variation between runs, but there is no significant difference.
annif suggest (most backends) with a corpus of documents
annif eval (svc, pav)
This was done using YKL as the vocabulary and kirjaesittelyt2021/ykl test set. Runtime is about the same, but memory usage decreased by 130MB!
A bit faster when using
-j4
, otherwise no big difference, though memory usage decreased by 60MB.annif train (nn_ensemble)
Overall train time remains about the same, but memory usage has increased. I think there is room for improvement here, as noted below.
Code quality
On the whole, this PR reduces the amount of code by more than 100 lines, mainly because having a single represenation for suggestion results simplifies the implementation. I've also both dropped some unit tests and added new ones; the total number of tests has grown by one.
I tried to make the new implementation as clean as possible, but there are a few places where Codebeat complains about too much complexity. In both
SuggestionBatch.from_sequence
andfilter_suggestions
the problem is that efficiently constructing a sparse array is a bit complicated and messy as it involves creating three parallel lists; I originally tried to usedok_array
as an intermediate representation, but it turned out to be slow. Anyway, I'm running out of ideas on how to simplify the code further. Suggestions welcome!Potential future work
I'm not quite happy with how the NN ensemble handles suggestion results from other projects, both during training and suggest operations. For example, the training samples are stored in LMDB one document at a time, but now it would be easier to store them as whole batches instead, which could be more efficient. But I decided that this PR is already much too big and it would make sense to try to improve batching in the NN ensemble in a separate follow-up PR. There is already an attempt to do part of this in PR #676; that could be a possible starting point.
The threading performance of
annif hyperopt
was already bad, and now it got worse. The solution could be to switch to process-based multiprocessing. This has been difficult to do with Optuna (needs an external relational database), but the Optuna FAQ now states that it could also be done with JournalFileStorage, which sounds more promising.