Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support Radius Search in k-NN #814

Closed
Thijsvandepoll opened this issue Mar 21, 2023 · 9 comments
Closed

[FEATURE] Support Radius Search in k-NN #814

Thijsvandepoll opened this issue Mar 21, 2023 · 9 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement v2.14.0

Comments

@Thijsvandepoll
Copy link

Is your feature request related to a problem?
I would be interested in the ability to provide a maximum radius filter to find all (approximate) nearest neighbors that lie within a certain distance from the query vector. For our use case we need to aggregate over this result. If I am not mistaken, this is currently not possible.

What solution would you like?
Something like:

GET my-knn-index-1/_search
{
  "query": {
    "knn": {
      "my_vector2": {
        "vector": [2, 3, 5, 6],
        "radius": 2.0
      }
    }
  }
}

This would allow me to aggregate over all results for the given radius.

What alternatives have you considered?
We are considering Elasticsearch as an alternative. They are working on such a feature: elastic/elasticsearch#84929

@navneet1v
Copy link
Collaborator

Assigning it to @vamshin

@navneet1v navneet1v added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged labels Mar 21, 2023
@vamshin
Copy link
Member

vamshin commented Mar 27, 2023

Please +1 if you are looking for this feature to help prioritize

@vamshin vamshin changed the title Maximum radius search [FEATURE] [FEATURE] Maximum radius search Feb 7, 2024
@junqiu-lei
Copy link
Member

With leveraging both Lucene and FAISS libraries, we can find out a way to unify the radius search API for both enginess in OpenSearch k-NN.

  • FAISS: It supports in radius searches by range_search API with distance parameter of radius for both IVF and HNSW methods.

    • Library availability in k-NN:
      • IVF: Supported
      • HNSW: It will be supported after k-NN Faiss bump PR merged
  • Lucene: Lucene recently added the feature of supporting similarity-based vector searches. It can find all vectors scoring above a resultSimilarity while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of traversalSimilarity in the lowest level.

Our initial release goal is to support both Lucene and Faiss engine for this feature.

cc: @vamshin @Thijsvandepoll

@nknize
Copy link

nknize commented Feb 13, 2024

Lucene 9.10 branch was cut yesterday and the release process will begin tomorrow. Given that OpenSearch release is starting today, it's unlikely this would make 2.12 w/o a delay.

9.10 upgrade is a good reason to suggest postponing 2.12 as there's several nice updates and features coming in Lucene, but it looks like OpenSearch core's main branch still hasn't updated to any snapshot of 9.10. Perplexing why that never happened since (at minimum) the main branch should be updated to the next minor snapshot after a release of Lucene to ensure we stay on track w/ upstream releases. ¯\_(ツ)_/¯

@junqiu-lei
Copy link
Member

Thanks @nknize for sharing the timely information on Luecne 9.10!

@vamshin
Copy link
Member

vamshin commented Feb 13, 2024

thanks @nknize. Looks like we cannot make it to 2.12. If core can unblock us with 9.10 upgrade, we can start testing end to end feature and get ready for 2.13.

cc: @bbarani

@junqiu-lei
Copy link
Member

Radial search benchmark

Cluster configuration

Configuration Value
OpenSearch Version 2.x
Leader Nodes 3
Leader Node Type c6g.xlarge
Leader Node Disk Space 50
Data Nodes 3
Data Node Type r6g.2xlarge
Data Node Disk Space 100
Primary Shard Count 3
Replica Count 1
AZ us-east-1
Test client type m5.2xlarge

Cluster created by opensearch-build and opensearch-cluster-cdk

Feature branch: https://github.com/junqiu-lei/k-NN/tree/2.x-radial-traversal

Benchmark Tool

OSB

Dataset

We need update datasets to include threshold values for top-k and true neighbors for radial threshold.

For example, the top k 100 query threshold means we use the 100th nearest doc to capture the min_score threshold and max_distance threshold.

Name Dimension Doc size Query size Space type
cohere-wikipedia-22-12-en-embeddings 768 1M 10k inner product
Min score threshold Median num of true neighboors Average num of true neighboors Num query targets of having 0 neighboors
161 118 83.1888 4048
156 1186 421.8486 1661
154 2959 778.0673 941

Algorithm

Algorithm name ef construction ef search hnsw m
HNSW 256 256 16

Results

Engine type Query type Query threshold Traversal similarity Search: 50th percentile service time (ms) Search: 90th percentile service time (ms) Search: 99th percentile service time (ms) Search: 100th percentile service time (ms) Recall
Faiss knn top k 100 N/A 6.77 7.27 7.73 17.70 0.967
Faiss min_score top k 100 N/A 7.40 8.41 10.65 310.52 0.96
Lucene knn top k 100 N/A 8.06 8.86 9.99 37.61 0.908
Lucene min_score top k 100 1.00 * 161 5.23 6.14 8.87 274.34 0.379
Lucene min_score top k 100 0.99 * 161 5.60 6.16 6.77 50.85 0.571
Lucene min_score top k 100 0.98 * 161 6.35 8.14 10.48 24.65 0.72
Lucene min_score top k 100 0.97 * 161 9.90 15.41 23.86 55.58 0.823
Lucene min_score top k 100 0.96 * 161 14.64 28.14 49.51 128.18 0.88
Lucene min_score top k 100 0.95 * 161 23.22 52.55 99.79 236.20 0.921
Lucene min_score top k 100 0.94 * 161 36.97 92.82 171.99 382.07 0.948
Faiss min_score min score 161 N/A 6.24 7.33 11.11 28.53 0.99
Lucene min_score min score 161 1.00 * 161 3.18 5.09 15.16 61.89 0.52
Lucene min_score min score 161 0.99 * 161 2.99 5.51 18.46 68.40 0.60
Lucene min_score min score 161 0.98 * 161 3.03 6.79 23.34 77.45 0.68
Lucene min_score min score 161 0.97 * 161 3.22 8.91 29.82 86.82 0.74
Lucene min_score min score 161 0.96 * 161 3.53 12.03 38.38 101.77 0.81
Lucene min_score min score 161 0.95 * 161 4.42 17.01 50.14 117.18 0.85
Lucene min_score min score 161 0.94 * 161 5.69 24.61 64.10 138.50 0.89
Lucene min_score min score 161 0.93 * 161 7.50 35.11 81.63 161.29 0.92
Lucene min_score min score 161 0.90 * 161 33.00 104.88 174.07 269.08 0.97
Faiss min_score min score 156 N/A 7.76 13.58 26.83 69.97 0.98
Lucene min_score min score 156 0.96 * 156 8.63 38.67 98.23 173.30 0.86
Lucene min_score min score 156 0.95 * 156 13.02 54.16 118.65 223.08 0.90
Lucene min_score min score 156 0.94 * 156 19.94 76.63 149.45 242.02 0.93
Faiss min_score min score 154 N/A 6.89 12.13 25.59 80.94 0.98
Lucene min_score min score 154 0.96 * 154 14.31 61.68 133.41 133.41 0.88
Lucene min_score min score 154 0.95 * 154 22.56 84.50 161.62 246.44 0.92
Lucene min_score min score 154 0.94 * 154 36.86 116.68 202.46 302.71 0.94

Observations

  • FAISS shows better performance and recall compared to Lucene in radial search. It maintains faster query time and higher recall.
  • Faiss shows similar query speed and recall from radial search to knn search.
  • In Lucene with lower traversal similarity, we can see higher recall and lower query speed performed, when traversal similarity at around 0.95 of min_score, it could get a result in balanced performance n terms of query latency and accuracy. I think it's necessary to configure the traversal similarity value when use Lucene engine. We probably can use the 0.95*threshold as the default traversal similarity value in k-NN.

CC: @jmazanec15 @navneet1v @vamshin

@jmazanec15
Copy link
Member

@junqiu-lei thanks, I think the 0.95*min_score makes sense to start. We can make this configurable in future if need be.

@junqiu-lei
Copy link
Member

Closing this issue as this feature is going to release at 2.14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement v2.14.0
Projects
Status: Done
Development

No branches or pull requests

6 participants