-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Refactor Binary Ops #2876
Conversation
CodSpeed Performance ReportMerging #2876 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2876 +/- ##
==========================================
+ Coverage 78.37% 78.39% +0.02%
==========================================
Files 596 597 +1
Lines 69688 69687 -1
==========================================
+ Hits 54616 54634 +18
+ Misses 15072 15053 -19
Flags with carried forward coverage won't be shown. Click here to find out more.
|
#[cfg(feature = "python")] | ||
DataType::Python => run_python_binary_operator_fn(lhs, rhs, "add"), | ||
DataType::Utf8 => { | ||
Ok(cast_downcast_op!(lhs, rhs, &DataType::Utf8, Utf8Array, add)?.into_series()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this was likely already an issue, but won't this cause us to do a bunch of extra work casting and downcasting when datatypes match the output type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast logic should check if it already matches the target dtype
Cargo.toml
Outdated
@@ -82,6 +82,8 @@ parquet2 = {path = "src/parquet2"} | |||
debug = true | |||
|
|||
[profile.dev] | |||
debug = "line-tables-only" | |||
opt-level = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need opt-level=1 in dev? That seems like it'll greatly slow down the build times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left that artifact in by mistake, I meant to push up the optimization level change for build macros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as @universalmind303 regarding the opt-level 1 for debug, otherwise it looks good to me!
This PR removes 1.4 million lines of llvm code and reduces the size of `daft-core` by 44%! This was accomplished by dropping the `SeriesBinaryOps` trait which required every Array Type to implement their own binary ops which the default implementation doing a macro expand on both the the rhs type and the output type. This caused a `O(Dtype^2)` expansion for every Array Type. This was done as a way to let each Array define their own behavior for binary ops but we didn't really leverage that outside of a few temporal types. For example if wanted to implement `Timestamp + Duration` we could implement it on `TimestampArray ` But since we may also have `Duration + Timestamp`, we would also have to implement it on `DurationArray`. The new approach is much simpler where we dispatch to the target implementation based of `left_dtype, right_dtype`. the numerics path pretty much stays the same but for temporals theres only a handful of pairs to consider. I also factored out a bunch of macros into functions, especially ones that would perform the binary ops on PythonArrays Breakdown of llvm lines: current main: ``` Lines Copies Function name ----- ------ ------------- 3926120 70662 (TOTAL) 162996 (4.2%, 4.2%) 1208 (1.7%, 1.7%) <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::fold 135172 (3.4%, 7.6%) 1201 (1.7%, 3.4%) <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter 127022 (3.2%, 10.8%) 1136 (1.6%, 5.0%) alloc::vec::Vec<T,A>::extend_trusted 82450 (2.1%, 12.9%) 34 (0.0%, 5.1%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::equal 82450 (2.1%, 15.0%) 34 (0.0%, 5.1%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::gt 82450 (2.1%, 17.1%) 34 (0.0%, 5.2%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::gte 82450 (2.1%, 19.2%) 34 (0.0%, 5.2%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::lt 82450 (2.1%, 21.3%) 34 (0.0%, 5.3%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::lte 82450 (2.1%, 23.4%) 34 (0.0%, 5.3%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::not_equal 79322 (2.0%, 25.5%) 34 (0.0%, 5.4%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::mul 79322 (2.0%, 27.5%) 34 (0.0%, 5.4%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::rem 76880 (2.0%, 29.4%) 31 (0.0%, 5.4%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::add 72323 (1.8%, 31.3%) 31 (0.0%, 5.5%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::sub 67116 (1.7%, 33.0%) 34 (0.0%, 5.5%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::and 67116 (1.7%, 34.7%) 34 (0.0%, 5.6%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::or 67116 (1.7%, 36.4%) 34 (0.0%, 5.6%) daft_core::series::array_impl::binary_ops::SeriesBinaryOps::xor 47428 (1.2%, 37.6%) 334 (0.5%, 6.1%) core::slice::sort::unstable::quicksort::partition_lomuto_branchless_cyclic ``` after: ``` Lines Copies Function name ----- ------ ------------- 2512523 73042 (TOTAL) 136529 (5.4%, 5.4%) 1208 (1.7%, 1.7%) <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::fold 127090 (5.1%, 10.5%) 1201 (1.6%, 3.3%) <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter 106918 (4.3%, 14.7%) 1136 (1.6%, 4.9%) alloc::vec::Vec<T,A>::extend_trusted 42752 (1.7%, 16.4%) 334 (0.5%, 5.3%) core::slice::sort::unstable::quicksort::partition_lomuto_branchless_cyclic 39085 (1.6%, 18.0%) 1371 (1.9%, 7.2%) core::iter::adapters::map::map_fold::{{closure}} 37698 (1.5%, 19.5%) 383 (0.5%, 7.7%) alloc::vec::Vec<T,A>::extend_desugared 36300 (1.4%, 20.9%) 150 (0.2%, 7.9%) core::slice::sort::shared::find_existing_run 36236 (1.4%, 22.4%) 521 (0.7%, 8.6%) core::iter::traits::iterator::Iterator::try_fold 36018 (1.4%, 23.8%) 373 (0.5%, 9.1%) <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold::{{closure}} 34050 (1.4%, 25.2%) 150 (0.2%, 9.3%) core::slice::sort::shared::smallsort::small_sort_general_with_scratch 33082 (1.3%, 26.5%) 278 (0.4%, 9.7%) arrow2::array::utf8::mutable::MutableUtf8Array<O>::try_from_iter 32417 (1.3%, 27.8%) 193 (0.3%, 10.0%) daft_core::array::ops::utf8::substr_compute_result::{{closure}} ```
This PR removes 1.4 million lines of llvm code and reduces the size of
daft-core
by 44%!This was accomplished by dropping the
SeriesBinaryOps
trait which required every Array Type to implement their own binary ops which the default implementation doing a macro expand on both the the rhs type and the output type. This caused aO(Dtype^2)
expansion for every Array Type. This was done as a way to let each Array define their own behavior for binary ops but we didn't really leverage that outside of a few temporal types. For example if wanted to implementTimestamp + Duration
we could implement it onTimestampArray
But since we may also haveDuration + Timestamp
, we would also have to implement it onDurationArray
.The new approach is much simpler where we dispatch to the target implementation based of
left_dtype, right_dtype
. the numerics path pretty much stays the same but for temporals theres only a handful of pairs to consider.I also factored out a bunch of macros into functions, especially ones that would perform the binary ops on PythonArrays
Breakdown of llvm lines:
current main:
after: