Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter document from score_threshold after search in QdrantDocumentStore #1055

Open
Sprizgola opened this issue Sep 6, 2024 · 0 comments
Open

Comments

@Sprizgola
Copy link

Sprizgola commented Sep 6, 2024

Is your feature request related to a problem? Please describe.
While working with the QdrantHybridRetriever I found out that the score_threshold parameter can be a little tricky to understand and seems like that it is used only during the search of the sparse vectors as shown here in the QdrantDocumentStore module:

       ...
        sparse_request = rest.SearchRequest(
            vector=rest.NamedSparseVector(
                name=SPARSE_VECTORS_NAME,
                vector=rest.SparseVector(
                    indices=query_sparse_embedding.indices,
                    values=query_sparse_embedding.values,
                ),
            ),
            filter=qdrant_filters,
            limit=top_k,
            with_payload=True,
            with_vector=return_embedding,
            score_threshold=score_threshold,
        )

        dense_request = rest.SearchRequest(
            vector=rest.NamedVector(
                name=DENSE_VECTORS_NAME,
                vector=query_embedding,
            ),
            filter=qdrant_filters,
            limit=top_k,
            with_payload=True,
            with_vector=return_embedding,
        )
       ...

The score that is returned from the requests is not normalized, so that make more difficult to set a threshold score.

Describe the solution you'd like
A simple solution to this problem can be implemented in one of these ways:

  • Solution A:
  1. Remove the score_threshold from the sparse vector search
  2. Add an optional parameter to the reciprocal_rank_fusion function so that it can optionally filter all the points that have a score lower than the threshold
  • Solution B:
  1. Remove the score_threshold from the sparse vector search
  2. Add a condition on the point score to the following list comprehension:
results = [convert_qdrant_point_to_haystack_document(point, use_sparse_embeddings=True) for point in points if point.score >= score_threshold] 

Describe alternatives you've considered
Another alternative, is to implement a component that filters out the document based on the score given from the retriever like this:

@component
class RetrieverDocumentFilter:

    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document], score_threshold: float):
        return {"documents": [d for d in documents if d.score >= score_threshold]}

Although is not a bad solution, I think it should be a cool thing to be able to perform the filtering directly on the retriever component

@anakin87 anakin87 transferred this issue from deepset-ai/haystack Sep 6, 2024
@julian-risch julian-risch added the P3 label Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants