getting performant search results whilst authorizing #279

joepio · 2022-01-07T21:11:03Z

The tantivy search API requires me to set some limit of search results.
At that time, I have no idea which of these (or how many) results can be viewed by the user making the request.
So what I currently do, is I filter all resources from the results list.
This means that the list can be less than limit amount of items.

This can lead to a confused user. Imagine searching for document, finding 3 results, but you know there are more. Where are the others?

So how can I deal with this?

Default to a higher limit

Easy, but makes things slower, and this will not work when there are many users + many private resources.
We could still limit the amount of items to return and respect the client's params here.
That would also help keep performance acceptable.

Let the front-end deal with this

The client could request some amount, and if it fails, try again with a larger amount.
Very slow, very ugly.

Use `tantivy::TopDoc::and_offset`

I think we can use this to perform a new query again (that skips n items from the first one) if we haven't got enough authorized resources yet.

The text was updated successfully, but these errors were encountered:

joepio · 2022-01-26T19:16:12Z

I asked the guys at the Tantivy discord:

was wondering whether Tantivy can return an Iterator as an alternative to setting a Limit when creating TopDocs. Basically TopDocs::without_limit, or something. Should I use TopDocs::and_offset instead?

In my usecase, I want to authorize the user before sending the result. This means I don't know in advance if my Limit is big enough. Having an Iterator would fix this problem

Got this reply:

TopDocs is really efficient when you put a hard limit, like 10, 20, 30 docs. Let's call the number of docs you want to retrieve N and the offset offset. During collection of docs at the segment level, we have to retain at most N + offset and we store them in a heap. Once the heap is full, we can skip document with a score lower than the lowest score of docs in the heap. We can even use block-max WAND to skip blocks of documents to be faster (block max wand is for terms queries only).
From what I understand, you want to retrieve possibly all the docs, so I would implement a dedicated collector. BTW there is a very simple collector that returns the set of DocAddress that matches the query: DocSetCollector. It's not ordered by score though.

joepio · 2023-02-21T08:02:54Z

Since we now support parent scoping using ?parent=uri, we can search inside a specific drive. That solves the largest problem of this issue

joepio mentioned this issue Jan 7, 2022

Admin rights check #280

Open

joepio added the performance Speed improvements label Jan 7, 2022

joepio added a commit that referenced this issue Jan 7, 2022

#279 search performance limit

ce3d5d2

joepio closed this as completed Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting performant search results whilst authorizing #279

getting performant search results whilst authorizing #279

joepio commented Jan 7, 2022 •

edited

Loading

joepio commented Jan 26, 2022 •

edited

Loading

joepio commented Feb 21, 2023

getting performant search results whilst authorizing #279

getting performant search results whilst authorizing #279

Comments

joepio commented Jan 7, 2022 • edited Loading

Default to a higher limit

Let the front-end deal with this

Use tantivy::TopDoc::and_offset

joepio commented Jan 26, 2022 • edited Loading

joepio commented Feb 21, 2023

joepio commented Jan 7, 2022 •

edited

Loading

Use `tantivy::TopDoc::and_offset`

joepio commented Jan 26, 2022 •

edited

Loading