Filter document from score_threshold after search in QdrantDocumentStore #1055

Sprizgola · 2024-09-06T09:24:16Z

Is your feature request related to a problem? Please describe.
While working with the QdrantHybridRetriever I found out that the score_threshold parameter can be a little tricky to understand and seems like that it is used only during the search of the sparse vectors as shown here in the QdrantDocumentStore module:

       ...
        sparse_request = rest.SearchRequest(
            vector=rest.NamedSparseVector(
                name=SPARSE_VECTORS_NAME,
                vector=rest.SparseVector(
                    indices=query_sparse_embedding.indices,
                    values=query_sparse_embedding.values,
                ),
            ),
            filter=qdrant_filters,
            limit=top_k,
            with_payload=True,
            with_vector=return_embedding,
            score_threshold=score_threshold,
        )

        dense_request = rest.SearchRequest(
            vector=rest.NamedVector(
                name=DENSE_VECTORS_NAME,
                vector=query_embedding,
            ),
            filter=qdrant_filters,
            limit=top_k,
            with_payload=True,
            with_vector=return_embedding,
        )
       ...

The score that is returned from the requests is not normalized, so that make more difficult to set a threshold score.

Describe the solution you'd like
A simple solution to this problem can be implemented in one of these ways:

Solution A:

Remove the score_threshold from the sparse vector search
Add an optional parameter to the reciprocal_rank_fusion function so that it can optionally filter all the points that have a score lower than the threshold

Solution B:

Remove the score_threshold from the sparse vector search
Add a condition on the point score to the following list comprehension:

results = [convert_qdrant_point_to_haystack_document(point, use_sparse_embeddings=True) for point in points if point.score >= score_threshold]

Describe alternatives you've considered
Another alternative, is to implement a component that filters out the document based on the score given from the retriever like this:

@component
class RetrieverDocumentFilter:

    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document], score_threshold: float):
        return {"documents": [d for d in documents if d.score >= score_threshold]}

Although is not a bad solution, I think it should be a cool thing to be able to perform the filtering directly on the retriever component

The text was updated successfully, but these errors were encountered:

anakin87 transferred this issue from deepset-ai/haystack Sep 6, 2024

anakin87 added the integration:qdrant label Sep 6, 2024

julian-risch added the P3 label Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter document from score_threshold after search in QdrantDocumentStore #1055

Filter document from score_threshold after search in QdrantDocumentStore #1055

Sprizgola commented Sep 6, 2024 •

edited

Loading

Filter document from score_threshold after search in QdrantDocumentStore #1055

Filter document from score_threshold after search in QdrantDocumentStore #1055

Comments

Sprizgola commented Sep 6, 2024 • edited Loading

Sprizgola commented Sep 6, 2024 •

edited

Loading