⚡️ Speed up `_hamming_distance()` by 50% in `libs/langchain/langchain/evaluation/embedding_distance/base.py` #7

codeflash-ai · 2024-02-02T04:55:20Z

📄 `_hamming_distance()` in `libs/langchain/langchain/evaluation/embedding_distance/base.py`

📈 Performance went up by 50% (0.50x faster)

⏱️ Runtime went down from 749.61μs to 500.81μs

Explanation and details

(click to show)

One way to optimize the given Python function is by avoiding the usage of Python's built-in function np.mean(), and instead, using the direct calculation of the mean. Moreover, the '!=' operator will return a boolean array. By summing up this array we then get all True values (which are interpreted as 1) and then normalize it by the size of the array (which is equivalent to calculating the mean).

Consider the rewrite below:

This new version of the function avoids the usage of the np.mean() function, which reduces the time complexity from O(3n) to O(2n), thus making the program run faster.

Correctness verification

The new optimized code was tested for correctness. The results are listed below.

✅ 0 Passed − ⚙️ Existing Unit Tests

✅ 0 Passed − 🎨 Inspired Regression Tests

✅ 11 Passed − 🌀 Generated Regression Tests

(click to show generated tests)

# imports
import pytest  # used for our unit tests
import numpy as np  # used for numerical operations
from langchain.evaluation.embedding_distance.base import _EmbeddingDistanceChainMixin

# unit tests

def test_equal_vectors():
    # Test with two identical vectors
    a = np.array([0, 0, 0, 0])
    b = np.array([0, 0, 0, 0])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0

def test_completely_different_vectors():
    # Test with two vectors that are completely different
    a = np.array([0, 0, 0, 0])
    b = np.array([1, 1, 1, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 1

def test_partial_differences():
    # Test with vectors that have some elements different
    a = np.array([0, 1, 0, 1])
    b = np.array([0, 0, 0, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0.25

def test_different_types():
    # Test with vectors of different data types
    a = np.array([True, False, True])
    b = np.array([True, True, False])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == pytest.approx(0.6666, 0.0001)

def test_empty_vectors():
    # Test with two empty vectors
    a = np.array([])
    b = np.array([])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0

def test_single_element_vectors():
    # Test with vectors that contain only one element
    a = np.array([0])
    b = np.array([0])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0

def test_large_vectors():
    # Test with very large vectors
    a = np.zeros(10000)
    b = np.ones(10000)
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 1

def test_non_binary_vectors():
    # Test with non-binary vectors
    a = np.array([1, 2, 3])
    b = np.array([3, 2, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == pytest.approx(0.6666, 0.0001)

def test_vectors_with_nan():
    # Test with vectors containing np.nan
    a = np.array([0, np.nan, 0])
    b = np.array([0, 0, np.nan])
    assert np.isnan(_EmbeddingDistanceChainMixin._hamming_distance(a, b))

def test_vectors_with_inf():
    # Test with vectors containing np.inf
    a = np.array([np.inf, 1, 0])
    b = np.array([np.inf, 0, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == pytest.approx(0.6666, 0.0001)

def test_vectors_with_different_lengths():
    # Test with vectors of different lengths, should raise ValueError
    a = np.array([0, 0])
    b = np.array([0, 0, 0])
    with pytest.raises(ValueError):
        _EmbeddingDistanceChainMixin._hamming_distance(a, b)

aphexcx

Correct: yes
Performant: needs manual verification
Valuable: possibly

…nce-2024-02-02T04.55.15

⚡️ Speed up _hamming_distance by 50%

f3080b8

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by CodeFlash AI label Feb 2, 2024

codeflash-ai bot requested a review from aphexcx February 2, 2024 04:55

aphexcx reviewed Feb 26, 2024

View reviewed changes

Merge branch 'master' into codeflash-optimize-function-_hamming_dista…

0fc8120

…nce-2024-02-02T04.55.15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up `_hamming_distance()` by 50% in `libs/langchain/langchain/evaluation/embedding_distance/base.py` #7

⚡️ Speed up `_hamming_distance()` by 50% in `libs/langchain/langchain/evaluation/embedding_distance/base.py` #7

codeflash-ai bot commented Feb 2, 2024

aphexcx left a comment

⚡️ Speed up _hamming_distance() by 50% in libs/langchain/langchain/evaluation/embedding_distance/base.py #7

Are you sure you want to change the base?

⚡️ Speed up _hamming_distance() by 50% in libs/langchain/langchain/evaluation/embedding_distance/base.py #7

Conversation

codeflash-ai bot commented Feb 2, 2024

📄 _hamming_distance() in libs/langchain/langchain/evaluation/embedding_distance/base.py

Explanation and details

Correctness verification

✅ 0 Passed − ⚙️ Existing Unit Tests

✅ 0 Passed − 🎨 Inspired Regression Tests

✅ 11 Passed − 🌀 Generated Regression Tests

aphexcx left a comment

Choose a reason for hiding this comment

⚡️ Speed up `_hamming_distance()` by 50% in `libs/langchain/langchain/evaluation/embedding_distance/base.py` #7

⚡️ Speed up `_hamming_distance()` by 50% in `libs/langchain/langchain/evaluation/embedding_distance/base.py` #7

📄 `_hamming_distance()` in `libs/langchain/langchain/evaluation/embedding_distance/base.py`