Remove the "iterative" versions of the search algorithms #3356
Labels
maintenance
Issue about maintenance (CI, tests, refacto...)
milli
Related to the milli workspace
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
v1.2.0
PRs/issues solved in v1.2.0 released on 2023-06-05
Milestone
The ranking rules
proximity
,sort
, andattribute
have two different implementation strategies. The first one (set-based
) queries milli's databases and performs set operations on roaring bitmaps to find buckets of document ids. The second one (iterative
) iterates on each candidate document and analyse their contents in order to sort them.Currently, we switch between the
set-based
anditerative
implementation strategy based on the number of candidate documents that need to be sorted. In theproximity
criterion, this is done with this constant:There are however, a few problems with this approach:
The
CANDIDATES_THRESHOLD
will always be arbitrary and suboptimal depending on the kind of data that was indexed. Maybe a value of1000
is the best choice for small documents containing just a few dozen words, but for people with documents that weigh >500 kB, we may opt into the iterative approach too soon and take a heavy performance penalty.We have to maintain two different implementations and update them both whenever we make a change to the behaviour of a ranking rule, which is difficult. It is also difficult to ensure that both implementations are equivalent. In fact, some ranking rules already behave differently depending on the implementation strategy that was chosen. For example, in
proximity
a difference occurs but only in some specific cases (e.g. when we have documents/queries with consecutive identical words), which is okay. But forattribute
, it appears that there is a large difference between the two implementations.It is harder to benchmark search requests correctly. We might make a change in the iterative or set-based version of the algorithm, and then misjudged the impact of the change because the alternative implementation is used instead. (This is partly fixed by Add a "Criterion implementation strategy" parameter to Search milli#742 ).
It is also harder to detect bugs in the implementation of the ranking rules, for the same reason as in (3)
Ideally, when we refactor the search algorithms, we should aim to make the set-based strategy fast enough such that it is reasonable to use it even when sorting only two candidate documents. It would allow us to reduce the size of the code base and make performance/correctness problems more visible.
Additionally, if we remove the iterative versions of the
proximity
andattribute
ranking rules, we can also remove thedocid_word_positions
database, which will reduce the size of the index.The text was updated successfully, but these errors were encountered: