-
-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for plugins that implement vector indexes #216
Comments
I think the user gets to create indexes, where an index can be assigned to one or more collection (provided those collections used the same embedding model). Allowing a collection to have multiple indexes will support trying out different indexes and comparing their performance. Allowing an index to cover multiple collections is important for things like wanting to be able to run a similarity search that mixes TILs and blog posts and tweets, despite them being held in different embedding collections. So I think there are CLI and Python methods for:
Do the existing Maybe yes if a collection has a "default index" defined on it, otherwise no. |
Also worth considering if the trick I used here might fit into LLM somehow: The idea there is that sometimes you want to combine a similarity search with other filters - e.g. run a SQL query to filter for just posts in a specific category, then find the most similar matches to a vector from that subset - it can actually be faster to run the filter, then build a scratch index against just the ~1,000 rows that match it. Building a FAISS index across a few thousand items for example is fast enough that it's a better approach than trying to query the whole similarity index first. Though it's worth testing that against brute-force similarity matching too, it might turn out that once you get below ~1,000 items you should just brute-force run the comparisons and not even bother with the index. In any case, will the LLM indexing mechanism need to solve for this kind of filtering as well? It might well be out of scope. |
If this gets added then Chroma DB may be worth trying here. It's lightweight and incredibly easy to get running compared to a few others I've tried. https://docs.trychroma.com/ |
Ooh Chroma does look good - looks very easy to get it to store an index on disk: https://docs.trychroma.com/usage-guide Looks like the actual indexing is handled by |
Another option for an index could just be PyTorch with an in-memory collection of tensors. I ran some benchmarks that looked good: But when I tried a rough implementation like this it ended up slower than native Python (according to a diff --git a/llm/embeddings.py b/llm/embeddings.py
index 32d7dc8..54741ee 100644
--- a/llm/embeddings.py
+++ b/llm/embeddings.py
@@ -9,6 +9,12 @@ from sqlite_utils.db import Table
import time
from typing import cast, Any, Dict, Iterable, List, Optional, Tuple, Union
+try:
+ import torch
+ import torch.nn.functional as F
+except ImportError:
+ torch = None
+
@dataclass
class Entry:
@@ -242,6 +248,30 @@ class Collection:
"""
import llm
+ if torch is not None:
+ ids_and_embeddings = [
+ (row["id"], torch.tensor(llm.decode(row["embedding"])))
+ for row in self.db.query(
+ "select id, embedding from embeddings where collection_id = ?",
+ [self.id],
+ )
+ ]
+ input_vector = torch.tensor(vector)
+ scores = [
+ (id, F.cosine_similarity(input_vector.unsqueeze(0), embedding.unsqueeze(0)))
+ for id, embedding in ids_and_embeddings
+ ]
+ scores.sort(key=lambda id_and_score: id_and_score[1], reverse=True)
+ return [
+ Entry(
+ id=id,
+ score=score.item(),
+ content=None,
+ metadata=None,
+ )
+ for id, score in scores[:number]
+ ]
+
def distance_score(other_encoded):
other_vector = llm.decode(other_encoded)
return llm.cosine_similarity(other_vector, vector) Maybe I did something wrong here though. Would be worth spending more time seeing if PyTorch against an in-memory array can speed things up. |
@simonw happy to help here Here is how Chroma's roadmap aligns with your goals.
You may also enjoy reading this chroma proposal where we have put a lot of thought into the pipelines to support index/collection creation and access - chroma-core/chroma#1110 |
If someone ever need to move data from llm to Chroma, below is a simple script to do so. Needs a little more work to productise it though. @simonw, hope it would help if you ever need to create smth like
|
The
llm similar
andcollection.similar()
methods currently implement the slowest brute-force approach.I want to support faster approaches for this, like sqlite-vss and FAISS and and Pinecone and suchlike... but I'd like to do so through plugins.
Many vector indexes need to be rebuilt periodically, so I need an abstraction that supports that.
I added a
modified
column to theembeddings
table in:updated
timestamp to embeddings table #211With the aim of supporting this feature. I want indexes to be able to scan that table to see which items have been added or modified since they last ran, then re-index just those records.
The text was updated successfully, but these errors were encountered: