Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting performant search results whilst authorizing #279

Closed
joepio opened this issue Jan 7, 2022 · 2 comments
Closed

getting performant search results whilst authorizing #279

joepio opened this issue Jan 7, 2022 · 2 comments
Labels
performance Speed improvements

Comments

@joepio
Copy link
Member

joepio commented Jan 7, 2022

The tantivy search API requires me to set some limit of search results.
At that time, I have no idea which of these (or how many) results can be viewed by the user making the request.
So what I currently do, is I filter all resources from the results list.
This means that the list can be less than limit amount of items.

This can lead to a confused user. Imagine searching for document, finding 3 results, but you know there are more. Where are the others?

So how can I deal with this?

Default to a higher limit

Easy, but makes things slower, and this will not work when there are many users + many private resources.
We could still limit the amount of items to return and respect the client's params here.
That would also help keep performance acceptable.

Let the front-end deal with this

The client could request some amount, and if it fails, try again with a larger amount.
Very slow, very ugly.

Use tantivy::TopDoc::and_offset

I think we can use this to perform a new query again (that skips n items from the first one) if we haven't got enough authorized resources yet.

@joepio joepio added the performance Speed improvements label Jan 7, 2022
joepio added a commit that referenced this issue Jan 7, 2022
@joepio
Copy link
Member Author

joepio commented Jan 26, 2022

I asked the guys at the Tantivy discord:

was wondering whether Tantivy can return an Iterator as an alternative to setting a Limit when creating TopDocs. Basically TopDocs::without_limit, or something. Should I use TopDocs::and_offset instead?

In my usecase, I want to authorize the user before sending the result. This means I don't know in advance if my Limit is big enough. Having an Iterator would fix this problem

Got this reply:

TopDocs is really efficient when you put a hard limit, like 10, 20, 30 docs. Let's call the number of docs you want to retrieve N and the offset offset. During collection of docs at the segment level, we have to retain at most N + offset and we store them in a heap. Once the heap is full, we can skip document with a score lower than the lowest score of docs in the heap. We can even use block-max WAND to skip blocks of documents to be faster (block max wand is for terms queries only).
From what I understand, you want to retrieve possibly all the docs, so I would implement a dedicated collector. BTW there is a very simple collector that returns the set of DocAddress that matches the query: DocSetCollector. It's not ordered by score though.

@joepio
Copy link
Member Author

joepio commented Feb 21, 2023

Since we now support parent scoping using ?parent=uri, we can search inside a specific drive. That solves the largest problem of this issue

@joepio joepio closed this as completed Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed improvements
Projects
None yet
Development

No branches or pull requests

1 participant