Research ways to speed up brute force cosine similarity #246

simonw · 2023-09-06T01:52:11Z

Might be faster to use numpy or some other trick.

simonw · 2023-09-06T01:53:53Z

I got Code Interpreter to run some benchmarks for me: https://chat.openai.com/share/3b4ae422-17eb-4855-a5d7-e39d3f58c1f7

Most interesting result:

That's comparing this function with and without numpy:

import matplotlib.pyplot as plt
import time

def top_n(vectors: List[List[float]], n: int, skip_numpy: bool = False) -> List[Tuple[int, float]]:
    if not skip_numpy:
        try:
            import numpy as np
            numpy_available = True
        except ImportError:
            numpy_available = False
    else:
        numpy_available = False
    
    start_time = time.time()
    
    if numpy_available:
        # NumPy-based calculation
        target_vector = np.array(vectors[0])
        all_vectors = np.array(vectors[1:])
        
        dot_products = np.dot(all_vectors, target_vector)
        magnitude_a = np.linalg.norm(target_vector)
        magnitude_b = np.linalg.norm(all_vectors, axis=1)
        
        similarities = dot_products / (magnitude_a * magnitude_b)
        top_indices = np.argsort(similarities)[::-1][:n]
        top_scores = [(int(idx), float(similarities[idx])) for idx in top_indices]
    else:
        # Fallback to Python-based calculation
        target_vector = vectors[0]
        similarities = [(idx, cosine_similarity_fallback(target_vector, vec)) for idx, vec in enumerate(vectors[1:])]
        top_scores = sorted(similarities, key=lambda x: x[1], reverse=True)[:n]
    
    elapsed_time = time.time() - start_time
    return top_scores, elapsed_time

# Sizes for benchmarking
sizes_to_test = [1000, 2000, 5000, 10000]

# Benchmarking the function
benchmark_results = {'numpy': [], 'python': []}
for size in sizes_to_test:
    sample_vectors = [[float(j) / (i + 1) for j in range(1, 11)] for i in range(size)]
    
    # With NumPy
    _, elapsed_time_numpy = top_n(sample_vectors, 10)
    benchmark_results['numpy'].append(elapsed_time_numpy)
    
    # Without NumPy
    _, elapsed_time_python = top_n(sample_vectors, 10, skip_numpy=True)
    benchmark_results['python'].append(elapsed_time_python)

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(sizes_to_test, benchmark_results['numpy'], label='NumPy', marker='o')
plt.plot(sizes_to_test, benchmark_results['python'], label='Python', marker='x')
plt.xlabel('Number of Vectors')
plt.ylabel('Time (seconds)')
plt.title('Benchmarking top_n Function')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

simonw · 2023-09-06T01:54:53Z

I integrated that into LLM but I'm not quite ready to land it yet, because I worry about the difference between loading all the vectors into memory v.s. the current approach which compares them one at a time as part of the SQL query:

llm/llm/embeddings.py

Lines 245 to 277 in b2fc0a1

    
                   def distance_score(other_encoded): 
        
                       other_vector = llm.decode(other_encoded) 
        
                       return llm.cosine_similarity(other_vector, vector) 
        
                   self.db.register_function(distance_score, replace=True) 
        
                   where_bits = ["collection_id = ?"] 
        
                   where_args = [str(self.id)] 
        
                   if skip_id: 
        
                       where_bits.append("id != ?") 
        
                       where_args.append(skip_id) 
        
                   return [ 
        
                       Entry( 
        
                           id=row["id"], 
        
                           score=row["score"], 
        
                           content=row["content"], 
        
                           metadata=json.loads(row["metadata"]) if row["metadata"] else None, 
        
                       ) 
        
                       for row in self.db.query( 
        
                           """ 
        
                       select id, content, metadata, distance_score(embedding) as score 
        
                       from embeddings 
        
                       where {where} 
        
                       order by score desc limit {number} 
        
                   """.format( 
        
                               where=" and ".join(where_bits), 
        
                               number=number, 
        
                           ), 
        
                           where_args, 
        
                       ) 
        
                   ]

simonw · 2023-09-06T01:56:58Z

Leaving this here for the moment as a future research task. The big question is if it's worth loading every embedding into memory in exchange for this performance increase.

Maybe I should have a cutoff, where it counts the members of the collection first and only loads them all into memory if there are less than N, where N is 10,000 or something?

Could there be a way of speeding up the SQL-based comparisons as well, maybe by using an aggregation function and just loading vectors at a time in batches?

simonw · 2023-09-06T01:57:48Z

I do this:

llm/llm/__init__.py

Lines 26 to 29 in 85e9225

    
           try: 
        
               import numpy as np 
        
           except ImportError: 
        
               np = None

llm/llm/__init__.py

Lines 264 to 278 in 85e9225

    
           if np: 
        
               # NumPy-based calculation 
        
               all_vectors = np.array(vectors) 
        
               dot_products = np.dot(all_vectors, target_vector) 
        
               magnitude_a = np.linalg.norm(target_vector) 
        
               magnitude_b = np.linalg.norm(all_vectors, axis=1) 
        
               similarities = dot_products / (magnitude_a * magnitude_b) 
        
               top_indices = np.argsort(similarities)[::-1][:n] 
        
               top_scores = [(int(idx), float(similarities[idx])) for idx in top_indices] 
        
           else: 
        
               # Fallback to Python-based calculation 
        
               similarities = [ 
        
                   (idx, cosine_similarity(target_vector, vec)) 
        
                   for idx, vec in enumerate(vectors) 
        
               ]

So that I can have numpy as an optional dependency - I'd rather not pull it in for LLM by default just for this single optimization.

simonw · 2023-09-06T01:59:45Z

Maybe I don't implement this optimization in core, but I instead make it available as one of the vector indexing plugins (the dumbest one of all, which simple maintains an in-memory numpy array for each collection) as part of this:

Support for plugins that implement vector indexes #216

simonw · 2023-09-10T18:22:29Z

I got some good results from using pytorch for this, with the F.cosine_similarity() function against a bunch of Tensors.

simonw added the research label Sep 6, 2023

simonw added a commit that referenced this issue Sep 6, 2023

Experimental numpy based cosine similarity, refs #246

85e9225

simonw added a commit that referenced this issue Sep 6, 2023

Removed code that disabled numpy, refs #246

998182a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research ways to speed up brute force cosine similarity #246

Research ways to speed up brute force cosine similarity #246

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 10, 2023

Research ways to speed up brute force cosine similarity #246

Research ways to speed up brute force cosine similarity #246

Comments

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 6, 2023

simonw commented Sep 10, 2023