Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility Function for Vector Similarity Search #18

Closed
4 tasks done
devansh-shah-11 opened this issue Mar 13, 2024 · 1 comment · Fixed by #30
Closed
4 tasks done

Utility Function for Vector Similarity Search #18

devansh-shah-11 opened this issue Mar 13, 2024 · 1 comment · Fixed by #30
Assignees
Labels
enhancement New feature or request sweep

Comments

@devansh-shah-11
Copy link
Collaborator

devansh-shah-11 commented Mar 13, 2024

Description
We need a new utility function in Database.py that performs a vector similarity search. This function should take an embedding vector as input and return the most similar vectors from the MongoDB Atlas database using Euclidean distance as the similarity measure.

This utility function will be used by the recognise_face() endpoint to find the most similar face in the database.

Expected Behavior

The endpoint should take n as input from the user and return the top n most similar vectors from MongoDB Database

Benefits
This feature will automate the finding of top n most similar vectors to the given face to help identify the employee

Tasks
Explore the MongoDB vector search tutorial
Write a function to return the most similar vectors

Checklist
  • Modify API/database.pyd6366eb Edit
  • Running GitHub Actions for API/database.pyEdit
  • Modify API/route.py7b8ca4e Edit
  • Running GitHub Actions for API/route.pyEdit
Copy link

sweep-ai bot commented Mar 13, 2024

🚀 Here's the PR! #20

See Sweep's progress at the progress dashboard!
Sweep Basic Tier: I'm using GPT-4. You have 5 GPT-4 tickets left for the month and 3 for the day. (tracking ID: 0b2bf4e2dc)

For more GPT-4 tickets, visit our payment portal. For a one week free trial, try Sweep Pro (unlimited GPT-4 tickets).
Install Sweep Configs: Pull Request

Tip

I can email you next time I complete a pull request if you set up your email here!


Actions (click)

  • ↻ Restart Sweep

GitHub Actions✓

Here are the GitHub Actions logs prior to making any changes:

Sandbox logs for 91e83d1
Checking API/database.py for syntax errors... ✅ API/database.py has no syntax errors! 1/1 ✓
Checking API/database.py for syntax errors...
✅ API/database.py has no syntax errors!

Sandbox passed on the latest main, so sandbox checks will be enabled for this issue.


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description.

https://github.com/devansh-shah-11/FaceRec/blob/91e83d1e0629dfb50ad9baecd37d3e4982a29f76/API/database.py#L1-L23

https://github.com/devansh-shah-11/FaceRec/blob/91e83d1e0629dfb50ad9baecd37d3e4982a29f76/API/route.py#L152-L220


Step 2: ⌨️ Coding

Modify API/database.py with contents:
• Add a new method named `find_similar_vectors` in the `Database` class. This method should accept two parameters: `embedding_vector`, which is the vector for which we want to find similar vectors, and `n`, which is the number of top similar vectors to return.
• Inside this method, use MongoDB's aggregation framework to perform the vector similarity search. Since MongoDB does not natively support Euclidean distance calculations for vector similarity out of the box, you will need to manually implement this logic. One approach is to store the embedding vectors in a collection with a schema that includes the vector and a unique identifier. Then, use an aggregation pipeline to calculate the Euclidean distance between the input vector and the vectors stored in the database, sort the results by this calculated distance in ascending order, and limit the results to the top n entries.
• The method should return the top n most similar vectors from the MongoDB database.
• Note: This task assumes MongoDB does not have built-in support for vector similarity search based on Euclidean distance. If MongoDB introduces such a feature, the implementation should leverage that instead.
--- 
+++ 
@@ -22,3 +22,31 @@
 
     def update_one(self, collection, query, update):
         return self.db[collection].update_one(query, update)
+    def find_similar_vectors(self, collection, embedding_vector, n):
+        """
+        Find the top n most similar vectors in the database to the given embedding_vector.
+        This method uses the Euclidean distance for similarity measure.
+
+        :param collection: The MongoDB collection to search within.
+        :param embedding_vector: The embedding vector to find similar vectors for.
+        :param n: The number of top similar vectors to return.
+        :return: The top n most similar vectors from the MongoDB database.
+        """
+        pipeline = [
+            {
+                "$addFields": {
+                    "distance": {
+                        "$sqrt": {
+                            "$reduce": {
+                                "input": {"$zip": {"inputs": ["$vector", embedding_vector]}},
+                                "initialValue": 0,
+                                "in": {"$add": ["$$value", {"$pow": [{"$subtract": ["$$this.0", "$$this.1"]}, 2]}]}
+                            }
+                        }
+                    }
+                }
+            },
+            {"$sort": {"distance": 1}},
+            {"$limit": n}
+        ]
+        return list(self.db[collection].aggregate(pipeline))
  • Running GitHub Actions for API/database.pyEdit
Check API/database.py with contents:

Ran GitHub Actions for d6366ebfcc133c30f5e069c0508a89b52686ba57:

Modify API/route.py with contents:
• Add a new endpoint in the `route.py` file for the `recognise_face` functionality. This endpoint should accept an embedding vector and a parameter n from the user, and use the `find_similar_vectors` method from the `Database` class to find and return the top n most similar vectors.
• The endpoint should extract the embedding vector and the value of n from the request, call the `find_similar_vectors` method with these parameters, and return the result to the client.
• Ensure proper error handling is in place for cases where the input data is invalid or the database operation fails.
--- 
+++ 
@@ -267,3 +267,23 @@
     client.find_one_and_delete(collection, {"EmployeeCode": EmployeeCode})
 
     return {"Message": "Successfully Deleted"}
[email protected]("/recognise_face")
+async def recognise_face(embedding: List[float], n: int):
+    """
+    Recognise a face by finding the most similar face embeddings in the database.
+
+    Args:
+        embedding (List[float]): The embedding vector of the face to be recognised.
+        n (int): The number of top similar vectors to return.
+
+    Returns:
+        dict: A dictionary containing the top n most similar face embeddings.
+
+    """
+    logging.info("Recognising face")
+    try:
+        similar_faces = client.find_similar_vectors(collection, embedding, n)
+        return {"similar_faces": similar_faces}
+    except Exception as e:
+        logging.error(f"Error recognising face: {str(e)}")
+        raise HTTPException(status_code=500, detail="Internal server error")
  • Running GitHub Actions for API/route.pyEdit
Check API/route.py with contents:

Ran GitHub Actions for 7b8ca4e13c930240c7aef7d25b09dd19d42e82df:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/utility_function_for_vector_similarity_s_0cb05.


🎉 Latest improvements to Sweep:
  • New dashboard launched for real-time tracking of Sweep issues, covering all stages from search to coding.
  • Integration of OpenAI's latest Assistant API for more efficient and reliable code planning and editing, improving speed by 3x.
  • Use the GitHub issues extension for creating Sweep issues directly from your editor.

💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request.Something wrong? Let us know.

This is an automated message generated by Sweep AI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment