Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Speed up _hamming_distance() by 50% in libs/langchain/langchain/evaluation/embedding_distance/base.py #7

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Feb 2, 2024

📄 _hamming_distance() in libs/langchain/langchain/evaluation/embedding_distance/base.py

📈 Performance went up by 50% (0.50x faster)

⏱️ Runtime went down from 749.61μs to 500.81μs

Explanation and details

(click to show)

One way to optimize the given Python function is by avoiding the usage of Python's built-in function np.mean(), and instead, using the direct calculation of the mean. Moreover, the '!=' operator will return a boolean array. By summing up this array we then get all True values (which are interpreted as 1) and then normalize it by the size of the array (which is equivalent to calculating the mean).

Consider the rewrite below:

This new version of the function avoids the usage of the np.mean() function, which reduces the time complexity from O(3n) to O(2n), thus making the program run faster.

Correctness verification

The new optimized code was tested for correctness. The results are listed below.

✅ 0 Passed − ⚙️ Existing Unit Tests

✅ 0 Passed − 🎨 Inspired Regression Tests

✅ 11 Passed − 🌀 Generated Regression Tests

(click to show generated tests)
# imports
import pytest  # used for our unit tests
import numpy as np  # used for numerical operations
from langchain.evaluation.embedding_distance.base import _EmbeddingDistanceChainMixin

# unit tests

def test_equal_vectors():
    # Test with two identical vectors
    a = np.array([0, 0, 0, 0])
    b = np.array([0, 0, 0, 0])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0

def test_completely_different_vectors():
    # Test with two vectors that are completely different
    a = np.array([0, 0, 0, 0])
    b = np.array([1, 1, 1, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 1

def test_partial_differences():
    # Test with vectors that have some elements different
    a = np.array([0, 1, 0, 1])
    b = np.array([0, 0, 0, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0.25

def test_different_types():
    # Test with vectors of different data types
    a = np.array([True, False, True])
    b = np.array([True, True, False])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == pytest.approx(0.6666, 0.0001)

def test_empty_vectors():
    # Test with two empty vectors
    a = np.array([])
    b = np.array([])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0

def test_single_element_vectors():
    # Test with vectors that contain only one element
    a = np.array([0])
    b = np.array([0])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 0

def test_large_vectors():
    # Test with very large vectors
    a = np.zeros(10000)
    b = np.ones(10000)
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == 1

def test_non_binary_vectors():
    # Test with non-binary vectors
    a = np.array([1, 2, 3])
    b = np.array([3, 2, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == pytest.approx(0.6666, 0.0001)

def test_vectors_with_nan():
    # Test with vectors containing np.nan
    a = np.array([0, np.nan, 0])
    b = np.array([0, 0, np.nan])
    assert np.isnan(_EmbeddingDistanceChainMixin._hamming_distance(a, b))

def test_vectors_with_inf():
    # Test with vectors containing np.inf
    a = np.array([np.inf, 1, 0])
    b = np.array([np.inf, 0, 1])
    assert _EmbeddingDistanceChainMixin._hamming_distance(a, b) == pytest.approx(0.6666, 0.0001)

def test_vectors_with_different_lengths():
    # Test with vectors of different lengths, should raise ValueError
    a = np.array([0, 0])
    b = np.array([0, 0, 0])
    with pytest.raises(ValueError):
        _EmbeddingDistanceChainMixin._hamming_distance(a, b)

@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by CodeFlash AI label Feb 2, 2024
@codeflash-ai codeflash-ai bot requested a review from aphexcx February 2, 2024 04:55
Copy link

@aphexcx aphexcx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct: yes
Performant: needs manual verification
Valuable: possibly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by CodeFlash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants