-
-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research ways to speed up brute force cosine similarity #246
Comments
I got Code Interpreter to run some benchmarks for me: https://chat.openai.com/share/3b4ae422-17eb-4855-a5d7-e39d3f58c1f7 Most interesting result: That's comparing this function with and without import matplotlib.pyplot as plt
import time
def top_n(vectors: List[List[float]], n: int, skip_numpy: bool = False) -> List[Tuple[int, float]]:
if not skip_numpy:
try:
import numpy as np
numpy_available = True
except ImportError:
numpy_available = False
else:
numpy_available = False
start_time = time.time()
if numpy_available:
# NumPy-based calculation
target_vector = np.array(vectors[0])
all_vectors = np.array(vectors[1:])
dot_products = np.dot(all_vectors, target_vector)
magnitude_a = np.linalg.norm(target_vector)
magnitude_b = np.linalg.norm(all_vectors, axis=1)
similarities = dot_products / (magnitude_a * magnitude_b)
top_indices = np.argsort(similarities)[::-1][:n]
top_scores = [(int(idx), float(similarities[idx])) for idx in top_indices]
else:
# Fallback to Python-based calculation
target_vector = vectors[0]
similarities = [(idx, cosine_similarity_fallback(target_vector, vec)) for idx, vec in enumerate(vectors[1:])]
top_scores = sorted(similarities, key=lambda x: x[1], reverse=True)[:n]
elapsed_time = time.time() - start_time
return top_scores, elapsed_time
# Sizes for benchmarking
sizes_to_test = [1000, 2000, 5000, 10000]
# Benchmarking the function
benchmark_results = {'numpy': [], 'python': []}
for size in sizes_to_test:
sample_vectors = [[float(j) / (i + 1) for j in range(1, 11)] for i in range(size)]
# With NumPy
_, elapsed_time_numpy = top_n(sample_vectors, 10)
benchmark_results['numpy'].append(elapsed_time_numpy)
# Without NumPy
_, elapsed_time_python = top_n(sample_vectors, 10, skip_numpy=True)
benchmark_results['python'].append(elapsed_time_python)
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(sizes_to_test, benchmark_results['numpy'], label='NumPy', marker='o')
plt.plot(sizes_to_test, benchmark_results['python'], label='Python', marker='x')
plt.xlabel('Number of Vectors')
plt.ylabel('Time (seconds)')
plt.title('Benchmarking top_n Function')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show() |
I integrated that into LLM but I'm not quite ready to land it yet, because I worry about the difference between loading all the vectors into memory v.s. the current approach which compares them one at a time as part of the SQL query: Lines 245 to 277 in b2fc0a1
|
Leaving this here for the moment as a future research task. The big question is if it's worth loading every embedding into memory in exchange for this performance increase. Maybe I should have a cutoff, where it counts the members of the collection first and only loads them all into memory if there are less than N, where N is 10,000 or something? Could there be a way of speeding up the SQL-based comparisons as well, maybe by using an aggregation function and just loading vectors at a time in batches? |
I do this: Lines 26 to 29 in 85e9225
Lines 264 to 278 in 85e9225
So that I can have |
Maybe I don't implement this optimization in core, but I instead make it available as one of the vector indexing plugins (the dumbest one of all, which simple maintains an in-memory numpy array for each collection) as part of this: |
I got some good results from using pytorch for this, with the |
Might be faster to use numpy or some other trick.
The text was updated successfully, but these errors were encountered: