Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify computation of return type in broadcast #39295

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nalimilan
Copy link
Member

Since we rely on inference, we can use _return_type directly instead of via a complex machinery.

As suggested by @mbauman at #39185 (comment).

Since we rely on inference, we can use `_return_type` directly instead of via
a complex machinery.
Comment on lines -420 to +421
g() = (a = 1; Broadcast.combine_eltypes(x -> x + a, (1.0,)))
@test @inferred(g()) === Float64
g() = (a = 1; x -> x + a)
@test @inferred(broadcast(g(), 1.0)) === 2.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pabloferz @Sacha0 Since you worked on these tests (this one and the one below), could you confirm that the new ones covers the same use case as the old ones? That wasn't completely clear to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrettably sufficient time has elapsed since I looked at these tests and such that I no longer have much memory of them. Sorry Milan! :)

@timholy
Copy link
Sponsor Member

timholy commented Jan 17, 2021

Does this have an impact on the inference & codegen time? The broadcast infrastructure is already a big piece of the latency for many packages, just curious whether this makes it better or worse.

@nalimilan
Copy link
Member Author

Here's a small benchmark with x = rand(10), each time in a fresh Julia session.

Master:

julia> @time exp.(x);
  0.071312 seconds (207.54 k allocations: 12.900 MiB, 99.55% compilation time)

julia> @time exp.(x);
  0.073889 seconds (207.54 k allocations: 12.900 MiB, 99.48% compilation time)

julia> @time exp.(x);
  0.072427 seconds (207.54 k allocations: 12.900 MiB, 99.54% compilation time)

PR:

julia> @time exp.(x);
  0.075400 seconds (223.03 k allocations: 13.804 MiB, 99.46% compilation time)

julia> @time exp.(x);
  0.071174 seconds (223.03 k allocations: 13.804 MiB, 99.56% compilation time)

julia> @time exp.(x);
  0.077204 seconds (223.03 k allocations: 13.804 MiB, 99.58% compilation time)

So there are a few more allocations, and it might be a bit slower, but it's not super clear. Do you have ideas about other possible benchmarks?

@timholy
Copy link
Sponsor Member

timholy commented Jan 17, 2021

Maybe one where f has multiple arguments? As long as that looks good too, then I'm fine with this idea.

@nalimilan
Copy link
Member Author

Here's what I get for slightly more complex cases (still with a fresh session for each pair of commands):
master:

julia> @time x .+ 1;
  0.062047 seconds (165.10 k allocations: 10.133 MiB, 99.52% compilation time)

julia> @time Float32.(x) .+ x .+ 1;
  0.132495 seconds (281.68 k allocations: 16.544 MiB, 99.32% compilation time)

julia> @time x .+ 1;
  0.063509 seconds (165.10 k allocations: 10.133 MiB, 99.48% compilation time)

julia> @time Float32.(x) .+ x .+ 1;
  0.137296 seconds (281.68 k allocations: 16.544 MiB, 99.36% compilation time)

julia> @time x .+ 1;
  0.061300 seconds (165.10 k allocations: 10.133 MiB, 99.34% compilation time)

julia> @time Float32.(x) .+ x .+ 1;
  0.136219 seconds (281.68 k allocations: 16.544 MiB, 99.40% compilation time)

PR:

julia> @time x .+ 1;
  0.065680 seconds (180.26 k allocations: 11.009 MiB, 99.58% compilation time)

julia> @time Float32.(x) .+ x .+ 1;
  0.134870 seconds (296.66 k allocations: 17.251 MiB, 99.40% compilation time)

julia> @time x .+ 1;
  0.065040 seconds (180.26 k allocations: 11.009 MiB, 99.35% compilation time)

julia> @time Float32.(x) .+ x .+ 1;
  0.141318 seconds (296.66 k allocations: 17.251 MiB, 99.39% compilation time)

julia> @time x .+ 1;
  0.068847 seconds (180.26 k allocations: 11.009 MiB, 99.54% compilation time)

julia> @time Float32.(x) .+ x .+ 1;
  0.135004 seconds (296.66 k allocations: 17.251 MiB, 99.32% compilation time)

So still a slight increase in allocations.

But I've found a more serious problem: the CI failure is due to combine_eltypes being used at

entrytypeC = Base.Broadcast.combine_eltypes(f, (A, Bs...))

and
entrytypeC = Base.Broadcast.combine_eltypes(f, (A, Bs...))

I'm not sure we can actually get rid of these without reinventing most of combine_eltypes. What do you think? BTW, I'm surprised that combine_type is used to determine the type of the result (even when not empty) as it relies on inference.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 18, 2021

SparseArrays may have some legacy issues with the way it forms the eltype. I think this PR seems reasonable. Should be similar time, since we're about to infer in to the methods (for the runtime code path) anyways.

@@ -901,7 +888,8 @@ copy(bc::Broadcasted{<:Union{Nothing,Unknown}}) =
const NonleafHandlingStyles = Union{DefaultArrayStyle,ArrayConflict}

@inline function copy(bc::Broadcasted{Style}) where {Style}
ElType = combine_eltypes(bc.f, bc.args)
ElType = promote_typejoin_union(Base._return_type(_broadcast_getindex,
Tuple{typeof(bc), Int}))
Copy link
Sponsor Member

@mbauman mbauman Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be:

Suggested change
Tuple{typeof(bc), Int}))
Tuple{typeof(bc), ndims(bc) == 1 ? eltype(axes(bc)[1]) : CartesianIndex{ndims(bc)}})

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it Base._return_type(iterate, Base._return_type(eachindex, Tuple{typeof(bc)})) ?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, dropped a function. I meant:

index_type(bc) = iterate(eachindex(bc))[1]
Base._return_type(index_type, Tuple{typeof(bc)})

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting that all together:

Suggested change
Tuple{typeof(bc), Int}))
_broadcast_getindex_eltype(bc) = _broadcast_getindex(bc, iterate(eachindex(bc))[1])
ElType = promote_typejoin_union(Base._return_type(_broadcast_getindex_eltype, Tuple{typeof(bc)}))

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure this is better than the existing code, which does pretty much the same calls, but bases it on calling eltype, instead of inference, which has at least different tradeoffs for better or worse 🤔

Copy link
Sponsor Member

@vtjnash vtjnash Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to try to proceed with this PR / design (inferring iterate), or keep the current one (call eltype)?

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 20, 2021

Do you have ideas about other possible benchmarks?

The tests themselves tend to lend themselves fairly nicely to compile-time benchmarking. E.g., time julia test/broadcast.jl or some such.

@kshyatt kshyatt added broadcast Applying a function over a collection compiler:inference Type inference labels Feb 6, 2021
@StefanKarpinski
Copy link
Sponsor Member

Bump?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 18, 2021

Note the currently open question of whether this is actually better or worse (#39295 (comment))

@N5N3
Copy link
Member

N5N3 commented Dec 29, 2021

Can we wake up this? My local bench shows that this PR is able to reduce about 10% ~ 13% of the time cost of Base.runtests("broadcast")

@nalimilan
Copy link
Member Author

Can we wake up this? My local bench shows that this PR is able to reduce about 10% ~ 13% of the time cost of Base.runtests("broadcast")

The time it takes to run tests isn't usually a very interesting benchmark as tests are a very atypical coding pattern. Do you have evidence that this PR improves performance (or compile times) on real use cases? This isn't to say that I'm opposed to merging it.

@N5N3
Copy link
Member

N5N3 commented Dec 29, 2021

Well l have no further envidence, I just follows above advice from @mbauman to bench.
And found that the time cost and mem usage reduced after similar commit.
IIRC, we also use the test itself to bench the codegen improvement from avoiding always inline.
Maybe we need a package whose TTFP is dominated by broadcast?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 7, 2022

The PR is currently wrong, though I have a suggestion above to fix it. The remaining question however, as before, is whether we want this design change (#39295 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection compiler:inference Type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants