Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Speed up PandasSeries.to_spec() by 116% in src/bentoml/_internal/io_descriptors/pandas.py #8

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jun 29, 2024

📄 PandasSeries.to_spec() in src/bentoml/_internal/io_descriptors/pandas.py

📈 Performance improved by 116% (1.16x faster)

⏱️ Runtime went down from 3.34 milliseconds to 1.55 millisecond

Explanation and details

Certainly! Here’s an optimized version of the code that maintains the same functionality but runs faster by reducing redundant operations and avoiding unnecessary dictionary operations.

Explanation of Optimizations

  1. Removed Redundancy in _convert_dtype.

    • The condition for None is now checked first to immediately return "null", reducing checks for cases where value is None.
    • Combined the checks for str and bool into a single block since both simply convert the value to a str type without any further processing.
    • Placed the check for LazyType["ext.NpNDArray"] before np.dtype with the assumption that this type check could match before a more specific np.dtype.
  2. Improved Readability and Reduced Type Checks.

    • By rearranging condition checks logically, the code achieves the same functionality with lesser comparisons and increased readability.

These optimizations reduce the number of condition checks and improve the structure of the function for faster execution.

Correctness verification

The new optimized code was tested for correctness. The results are listed below.

🔘 (none found) − ⚙️ Existing Unit Tests

✅ 33 Passed − 🌀 Generated Regression Tests

(click to show generated tests)
# imports
from __future__ import annotations

import typing as t

import numpy as np
import pytest  # used for our unit tests
from bentoml._internal import external_typing as ext
from bentoml._internal.io_descriptors.base import IODescriptor
from bentoml._internal.types import LazyType

# unit tests

# Basic Functionality
def test_default_initialization():
    ps = PandasSeries()
    spec = ps.to_spec()
    assert spec["id"] == "bentoml.io.PandasSeries"
    assert spec["args"]["orient"] == "records"
    assert spec["args"]["dtype"] == "null"
    assert spec["args"]["shape"] is None
    assert spec["args"]["enforce_dtype"] is False
    assert spec["args"]["enforce_shape"] is False

def test_custom_initialization():
    ps = PandasSeries(orient="split", dtype="int32", enforce_dtype=True, shape=(10,), enforce_shape=True)
    spec = ps.to_spec()
    assert spec["id"] == "bentoml.io.PandasSeries"
    assert spec["args"]["orient"] == "split"
    assert spec["args"]["dtype"] == "int32"
    assert spec["args"]["shape"] == (10,)
    assert spec["args"]["enforce_dtype"] is True
    assert spec["args"]["enforce_shape"] is True

# Handling Different Data Types for `dtype`
def test_string_dtype():
    ps = PandasSeries(dtype="float64")
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "float64"

def test_numpy_dtype():
    ps = PandasSeries(dtype=np.dtype('int32'))
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "int32"

def test_numpy_array():
    ps = PandasSeries(dtype=np.array([1, 2, 3]))
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "int64"

def test_dict_dtype():
    ps = PandasSeries(dtype={"column1": "int32", "column2": "float64"})
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == {"column1": "int32", "column2": "float64"}

def test_bool_dtype():
    ps = PandasSeries(dtype=True)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "True"

# Handling Edge Cases
def test_none_dtype():
    ps = PandasSeries(dtype=None)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "null"

def test_unsupported_dtype():
    class CustomType:
        pass
    ps = PandasSeries(dtype=CustomType())
    spec = ps.to_spec()
    assert spec["args"]["dtype"] is None

# Shape Variations
def test_no_shape():
    ps = PandasSeries(shape=None)
    spec = ps.to_spec()
    assert spec["args"]["shape"] is None

def test_specific_shape():
    ps = PandasSeries(shape=(10,))
    spec = ps.to_spec()
    assert spec["args"]["shape"] == (10,)

def test_multi_dimensional_shape():
    ps = PandasSeries(shape=(10, 5))
    spec = ps.to_spec()
    assert spec["args"]["shape"] == (10, 5)

# Enforcement Flags
def test_enforce_dtype():
    ps = PandasSeries(enforce_dtype=True)
    spec = ps.to_spec()
    assert spec["args"]["enforce_dtype"] is True

def test_enforce_shape():
    ps = PandasSeries(enforce_shape=True)
    spec = ps.to_spec()
    assert spec["args"]["enforce_shape"] is True

# Large Scale Test Cases
def test_large_dtype_dict():
    dtype_dict = {f"column{i}": "float64" for i in range(1000)}
    ps = PandasSeries(dtype=dtype_dict)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == dtype_dict

def test_complex_nested_dict():
    nested_dict = {"level1": {"level2": {"level3": "int32"}}}
    ps = PandasSeries(dtype=nested_dict)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == nested_dict

# Rare or Unexpected Edge Cases
def test_deeply_nested_dict():
    nested_dict = {"level1": {"level2": {"level3": {"level4": "int32"}}}}
    ps = PandasSeries(dtype=nested_dict)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == nested_dict

def test_mixed_dtype_dict():
    mixed_dict = {"column1": "int32", "column2": np.dtype('float64'), "column3": {"sub_column": "bool"}}
    ps = PandasSeries(dtype=mixed_dict)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == {"column1": "int32", "column2": "float64", "column3": {"sub_column": "bool"}}

def test_custom_object_dtype():
    class CustomObject:
        pass
    ps = PandasSeries(dtype=CustomObject())
    spec = ps.to_spec()
    assert spec["args"]["dtype"] is None

def test_callable_dtype():
    ps = PandasSeries(dtype=lambda x: x)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] is None

def test_zero_dimensions_shape():
    ps = PandasSeries(shape=())
    spec = ps.to_spec()
    assert spec["args"]["shape"] == ()

def test_negative_dimensions_shape():
    ps = PandasSeries(shape=(-1, 10))
    spec = ps.to_spec()
    assert spec["args"]["shape"] == (-1, 10)

def test_special_characters_in_dtype():
    ps = PandasSeries(dtype="int32\nfloat64")
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "int32\nfloat64"

def test_unicode_characters_in_dtype():
    ps = PandasSeries(dtype="int32\u2603")
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "int32\u2603"

def test_empty_string_orient():
    ps = PandasSeries(orient="")
    spec = ps.to_spec()
    assert spec["args"]["orient"] == ""

def test_non_standard_orient():
    ps = PandasSeries(orient="non_standard_orient")
    spec = ps.to_spec()
    assert spec["args"]["orient"] == "non_standard_orient"

def test_large_nested_dict():
    large_dict = {f"level{i}": {f"sub_level{i}": "int32"} for i in range(100)}
    ps = PandasSeries(dtype=large_dict)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == large_dict

def test_large_numpy_array():
    large_array = np.random.rand(1000, 1000)
    ps = PandasSeries(dtype=large_array)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] == str(large_array.dtype)

# Assuming logger is defined somewhere in the module
import logging

logger = logging.getLogger(__name__)

def test_logging_unsupported_dtype(caplog):
    class UnsupportedType:
        pass
    with caplog.at_level(logging.WARNING):
        ps = PandasSeries(dtype=UnsupportedType())
        ps.to_spec()
        assert "is not yet supported" in caplog.text

def test_invalid_shape_type():
    ps = PandasSeries(shape="invalid_shape")
    spec = ps.to_spec()
    assert spec["args"]["shape"] == "invalid_shape"

def test_concurrent_access():
    import threading

    ps = PandasSeries(dtype="int32")

    def access_to_spec():
        for _ in range(1000):
            ps.to_spec()

    threads = [threading.Thread(target=access_to_spec) for _ in range(10)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()

    spec = ps.to_spec()
    assert spec["args"]["dtype"] == "int32"

def test_mutable_default_argument():
    ps1 = PandasSeries(dtype=[])
    ps2 = PandasSeries(dtype=[])
    assert ps1.to_spec()["args"]["dtype"] == "null"
    assert ps2.to_spec()["args"]["dtype"] == "null"

def test_corrupted_dtype():
    corrupted_data = b'\x80\x03}q\x00(X\x04\x00\x00\x00dataq\x01X\x05\x00\x00\x00valueq\x02u.'
    ps = PandasSeries(dtype=corrupted_data)
    spec = ps.to_spec()
    assert spec["args"]["dtype"] is None

🔘 (none found) − ⏪ Replay Tests

Certainly! Here’s an optimized version of the code that maintains the same functionality but runs faster by reducing redundant operations and avoiding unnecessary dictionary operations.



### Explanation of Optimizations

1. **Removed Redundancy in `_convert_dtype`**.
   - The condition for `None` is now checked first to immediately return `"null"`, reducing checks for cases where `value` is `None`.
   - Combined the checks for `str` and `bool` into a single block since both simply convert the value to a `str` type without any further processing.
   - Placed the check for `LazyType["ext.NpNDArray"]` before `np.dtype` with the assumption that this type check could match before a more specific `np.dtype`.

2. **Improved Readability and Reduced Type Checks**.
   - By rearranging condition checks logically, the code achieves the same functionality with lesser comparisons and increased readability.

These optimizations reduce the number of condition checks and improve the structure of the function for faster execution.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 29, 2024
@codeflash-ai codeflash-ai bot requested a review from ivillar June 29, 2024 04:44
@ivillar
Copy link

ivillar commented Jul 1, 2024

PR modifies function, creates a new function of the same name as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant