Remove the "iterative" versions of the search algorithms #3356

loiclec · 2022-12-21T09:11:03Z

The ranking rules proximity, sort, and attribute have two different implementation strategies. The first one (set-based) queries milli's databases and performs set operations on roaring bitmaps to find buckets of document ids. The second one (iterative) iterates on each candidate document and analyse their contents in order to sort them.

Currently, we switch between the set-based and iterative implementation strategy based on the number of candidate documents that need to be sorted. In the proximity criterion, this is done with this constant:

/// Threshold on the number of candidates that will make
/// the system choose between one algorithm or another.
const CANDIDATES_THRESHOLD: u64 = 1000;

There are however, a few problems with this approach:

The CANDIDATES_THRESHOLD will always be arbitrary and suboptimal depending on the kind of data that was indexed. Maybe a value of 1000 is the best choice for small documents containing just a few dozen words, but for people with documents that weigh >500 kB, we may opt into the iterative approach too soon and take a heavy performance penalty.
We have to maintain two different implementations and update them both whenever we make a change to the behaviour of a ranking rule, which is difficult. It is also difficult to ensure that both implementations are equivalent. In fact, some ranking rules already behave differently depending on the implementation strategy that was chosen. For example, in proximity a difference occurs but only in some specific cases (e.g. when we have documents/queries with consecutive identical words), which is okay. But for attribute, it appears that there is a large difference between the two implementations.
It is harder to benchmark search requests correctly. We might make a change in the iterative or set-based version of the algorithm, and then misjudged the impact of the change because the alternative implementation is used instead. (This is partly fixed by Add a "Criterion implementation strategy" parameter to Search milli#742 ).
It is also harder to detect bugs in the implementation of the ranking rules, for the same reason as in (3)

Ideally, when we refactor the search algorithms, we should aim to make the set-based strategy fast enough such that it is reasonable to use it even when sorting only two candidate documents. It would allow us to reduce the size of the code base and make performance/correctness problems more visible.

Additionally, if we remove the iterative versions of the proximity and attribute ranking rules, we can also remove the docid_word_positions database, which will reduce the size of the index.

The text was updated successfully, but these errors were encountered:

742: Add a "Criterion implementation strategy" parameter to Search r=irevoire a=loiclec Add a parameter to search requests which determines the implementation strategy of the criteria. This can be either `set-based`, `iterative`, or `dynamic` (ie choosing between set-based or iterative at search time). See https://github.com/meilisearch/milli/issues/755 for more context about this change. Co-authored-by: Loïc Lecrenier <[email protected]>

loiclec · 2023-01-19T12:44:07Z

A data point on the performance of the iterative search algorithms when dealing with mid-size documents (containing a few thousand words):
Searching for I love in a dataset I created with random-looking data took 150ms to process (on an M1 Macbook), all of it spent in the iterative version of proximity and attribute.

loiclec · 2023-01-25T11:35:25Z

Another data point, this time from a user:

Hello! I'm doing some experiments with Meilisearch v0.30.4 and am surprised with how long queries are taking (~500ms to search through 10K documents)... any pointers for how to go about debugging this?

What sort of data is it? What queries are you running, are they big/lots of text? What server specs is it on

Thanks for the reply! Running in Docker on a box with an 8 core Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with 64GB of RAM and NVMe disks. The docs themselves do have a fair amount of text, I'd say 64K characters?.... /stats reports the databaseSize of 11427844096 [bytes]

what sort of queries are producing the latencies?

any simple one or two word queries, also returning a couple facets

loiclec · 2023-05-15T08:50:12Z

Fixed by #3542

loiclec added maintenance Issue about maintenance (CI, tests, refacto...) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Dec 21, 2022

loiclec changed the title ~~Remove "iterative" version of some search algorithms~~ Remove the "iterative" versions of the search algorithms Dec 21, 2022

loiclec mentioned this issue Dec 21, 2022

Add a "Criterion implementation strategy" parameter to Search meilisearch/milli#742

Merged

curquiza added the milli Related to the milli workspace label Jan 16, 2023

curquiza transferred this issue from meilisearch/milli Jan 18, 2023

loiclec mentioned this issue Feb 28, 2023

Search Relevancy & Performance Improvements #3547

Closed

loiclec closed this as completed May 15, 2023

curquiza added this to the v1.2.0 milestone May 15, 2023

meili-bot added the v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05 label Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the "iterative" versions of the search algorithms #3356

Remove the "iterative" versions of the search algorithms #3356

loiclec commented Dec 21, 2022 •

edited

Loading

loiclec commented Jan 19, 2023

loiclec commented Jan 25, 2023

loiclec commented May 15, 2023

Remove the "iterative" versions of the search algorithms #3356

Remove the "iterative" versions of the search algorithms #3356

Comments

loiclec commented Dec 21, 2022 • edited Loading

loiclec commented Jan 19, 2023

loiclec commented Jan 25, 2023

loiclec commented May 15, 2023

loiclec commented Dec 21, 2022 •

edited

Loading