Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: add optimized BinaryViewArray comparison kernels #13839

Merged
merged 1 commit into from
Jan 19, 2024

Conversation

orlp
Copy link
Collaborator

@orlp orlp commented Jan 19, 2024

This should speed up our string comparisons quite a bit, especially for exact equality, and doubly so for exact equality with a constant small string.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jan 19, 2024
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah buddy!

@ritchie46 ritchie46 merged commit 585b9f3 into pola-rs:main Jan 19, 2024
23 checks passed
@orlp
Copy link
Collaborator Author

orlp commented Jan 19, 2024

A sample benchmark for testing equality with a constant string. Input is 1000x mobydick.txt (200 million words):

import polars as pl
from timeit import timeit

with open("mobydick.txt") as f:
    words = f.read().split()

df = pl.DataFrame({"words": words * 1000})
print(df)

print(timeit(lambda: df.select((pl.col.words == "the").sum()), number=10))
print(timeit(lambda: df.select((pl.col.words == "and").sum()), number=10))
print(timeit(lambda: df.select((pl.col.words == "Captain Ahab").sum()), number=10))

print(timeit(lambda: df.select((pl.col.words < "the").sum()), number=10))
print(timeit(lambda: df.select((pl.col.words < "and").sum()), number=10))
print(timeit(lambda: df.select((pl.col.words < "Captain Ahab").sum()), number=10))

pairs = pl.DataFrame({
    "b": df["words"].slice(0, len(df) - 1),
    "a": df["words"].slice(1, len(df) - 1),
})

print(timeit(lambda: pairs.select((pl.col.a == pl.col.b).sum()), number=10))

Polars 0.20.5:

5.933478082995862
5.084461957914755
1.9035077500157058
8.33037916594185
7.153790874872357
5.330617999890819
3.9881368749774992

This PR:

2.0321385839488357
1.5042283749207854
1.5023376671597362
2.973445334006101
2.181169332936406
1.5706394580192864
1.2495929999276996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants