fusion of nested f.(args) calls into a single broadcast call #17300

stevengj · 2016-07-06T18:34:19Z

As discussed in #16285 and suggested by @yuyichao, this PR implements fusion of nested f.(args) calls into a single broadcast loop. For example, sin.(cos.(x)) is transformed by the parser into broadcast(x -> sin(cos(x)), x).

Because this is a syntactic guarantee, there is no need to worry about side effects or function purity, so it is very different from loop fusion viewed as a compiler optimization.

To do:

Tests, lots of tests!
Documentation
Fix handling of splatting (after fix f.(args...) with splatting #17307 is merged)
- more splatting tests
Needs broadcasting over scalars should produce a scalar #17318 or similar to correct func.(literals...).
Fix handling of keyword arguments.

Note that operators like .+ are still handled separately, but the long-term plan in #16285 is to treat dot operators as .( calls.

stevengj · 2016-07-06T18:36:02Z

Note that this is still an early preview. I literally committed the instant sin.(cos.(x)) seemed to work. All of my other testing so far has been at the Scheme level (calling individual sub-functions of expand-fuse-broadcast on various symbolic expressions to make sure they are sensible). That being said, I think it is feature-complete, modulo bugs.

stevengj · 2016-07-06T18:54:36Z

There are also some potential optimizations that I haven't implemented. For example, if you do sin.(atan.(x,1)), that is currently transformed into the equivalent of broadcast((x,y) -> sin(atan(x,y)), x, 1). In principle, the parser can detect the literal constant 1 and omit the second broadcast argument, giving broadcast(x -> sin(atan(x,1)), x) instead.

Are there any other arguments besides numeric literals than can be "inlined" in this way?

stevengj · 2016-07-06T20:33:23Z

Performance is looking good:

f(x) = x + 1
f1(x) = broadcast(f, broadcast(f, broadcast(f, x)))
f2(x) = [f(f(f(x))) for x in x]
f3(x) = f.(f.(f.(x)))
@time f1(x)
@time f2(x)
@time f3(x)

gives (after the usual repetitions):

  0.145307 seconds (66 allocations: 228.884 MB, 55.97% gc time)
  0.031480 seconds (2 allocations: 76.294 MB, 21.96% gc time)
  0.028810 seconds (22 allocations: 76.295 MB, 21.56% gc time)

i.e. it is doing about as well as manual loop fusion.

nalimilan · 2016-07-06T21:19:12Z

Wait... If you add that now, what feature are we going to introduce next to justify another release after 0.5? ;-)

Keno · 2016-07-06T21:20:43Z

It's not clear that we should do this now. 0.5 is \epsilon close to release, this is untested and mostly a performance question (people can write this syntax, it'll get faster in the next release).

stevengj · 2016-07-06T21:58:22Z

@Keno, that's true. On the other hand, this feature is relatively non-disruptive because the f.(args...) syntax itself is new in 0.5, and there is some argument in favor of settling on the semantics now.

stevengj · 2016-07-07T02:06:06Z

Will need a rebase and some fixes after #17307.

martinholters · 2016-07-07T07:10:15Z

people can write this syntax, it'll get faster in the next release

...and it will change the result when applied to non-pure functions. So either merge this PR for 0.5, or put a big warning in the docs.

ivarne · 2016-07-07T07:36:18Z

Are there any other arguments besides numeric literals than can be "inlined" in this way?

Calls to @pure functions? non-numeric immutable literals (eg. Strings)

martinholters · 2016-07-07T09:21:11Z

Not meaning to ruin anyones day, but with the promote_op fallback, there can be nasty surprises:

julia> log.(log.(log.([1000]))) # if this
1-element Array{Float64,1}:
 0.658889

julia> broadcast(x -> log(log(log(x))), [1000]) # is rewritten to this
ERROR: DomainError:
 in nan_dom_err at ./math.jl:0 [inlined]
 in log(::Float64) at ./math.jl:140
 in (::##1#2)(::Int64) at ./REPL[15]:1
 in promote_op at ./number.jl:0 [inlined]
 in promote_eltype_op at ./abstractarray.jl:0 [inlined]
 in broadcast(::Function, ::Array{Int64,1}) at ./broadcast.jl:197
 in eval(::Module, ::Any) at ./boot.jl:234
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

This is not meant as an argument against this PR though, but rather as motivation to rethink that fallback...

nalimilan · 2016-07-07T09:38:58Z

This is not meant as an argument against this PR though, but rather as motivation to rethink that fallback...

Damn fallback. We really need to use type inference instead of actually calling the function on one(T)...

andyferris · 2016-07-07T12:51:17Z

+100 for this PR

Note that operators like .+ are still handled separately, but the long-term plan in #16285 is to treat dot operators as .( calls.

@stevengj I'm not sure of their current status (I only checked a couple weeks ago), but can we at minimum have a default (fallback) definition of .+ etc in terms of broadcast in v0.5? This would significantly reduce the number of functions needed for new container types.

It's not clear that we should do this now. 0.5 is \epsilon close to release, this is untested and mostly a performance question (people can write this syntax, it'll get faster in the next release).

@Keno - it may be a performance question, but this change is totally kick-ass. I was disappointed when I first switched to Julia reading the improvements over MATLAB in speed to find out that the same memory mistakes MATLAB made for the kind of work I was doing at the time (large, memory-limited linear algebra problems) also existed in Julia. Of course, there was Devectorize.jl and other tricks, but for a newbie it was difficult to know what to do.

My personal opinion was always that reducing the memory overhead for Julia's core, ex-MATLAB audience should be one of the priorities. On the other hand, you definitely make valid points about rushed features! (In fact, I think changing the parsing of .+ etc for 0.5 is more important in terms of presenting a uniform interface with the performance improvements here coming later).

andreasnoack · 2016-07-07T13:03:41Z

@martinholters Maybe a reason not to throw so much

julia> import NaNMath

julia> broadcast(x -> NaNMath.log(NaNMath.log(NaNMath.log(x))), [1000])
1-element Array{Float64,1}:
 0.658889

stevengj · 2016-07-07T13:09:53Z

@andreasnoack, please continue that discussion at #17314, since it is orthogonal to this PR.

tkelman · 2016-07-07T20:59:00Z

doc/manual/functions.rst

+Moreover, *nested* ``f.(args...)`` calls are *fused* into a single ``broadcast``
+loop.  For example, ``sin.(cos.(X))`` is equivalent to ``broadcast(x -> sin(cos(x)), X)``,
+similar to ``[sin(cos(x)) for x in X]``: there is only a single loop over ``X``,
+and a single array is allocated for the result.   [In contrast, ``sin(cos(X))``


I hope brackets don't have special meaning in rst

I don't think we use square brackets for parenthetical comments elsewhere in the docs?

The usual typographical convention is to use square brackets for parenthetical comments that have nested parentheses.

Huh, I haven't come across that convention. Do you have a citation for it?

In math, the usual style is to nest parens as {[(...)]}. In text, it is common to recommend square inside round parens if you must nest, but that isn't possible here because the nested parens are code and hence are constrained by Julia syntax.

http://blog.apastyle.org/apastyle/2013/05/punctuation-junction-parentheses-and-brackets.html
http://www.chicagomanualofstyle.org/16/ch12/ch12_sec026.html
http://www.chicagomanualofstyle.org/16/ch06/ch06_sec099.html

martinholters · 2016-07-08T08:08:54Z

Another subtlety: If f and g define their own promote_ops, they won't be used in f.(g.(x)), potentially leading to a surprising return type. Silly example:

julia> Number.(sin.([0 pi/2]))
1×2 Array{Number,2}:
 0.0  1.0

julia> broadcast(x -> Number(sin(x)), [0.0 pi/2])
1×2 Array{Float64,2}:
 0.0  1.0

One may argue whether the promote_op for Number is flawed, but the point here is that fusing the operation gives a different return type. The solution here probably would be to implement a promote_op for the composed function invoking promote_op for the fused functions.

stevengj · 2016-07-08T13:34:42Z

As @JeffBezanson remarked, promote_op inherently doesn't scale. @martinholters, what you're describing seems essentially equivalent to type inference, possibly combined with return-type declarations for a few functions.

stevengj · 2016-07-10T23:32:51Z

(Test failure should be fixed by #17318.)

martinholters · 2016-07-11T07:15:29Z

what you're describing seems essentially equivalent to type inference

Yes, well, but without any dependencies on the compiler's mood, so more predictable across Julia versions. But if we could get rid of promote_op completely, that would be all the better...

StefanKarpinski · 2016-07-11T23:09:50Z

I would mention that this is not just a performance issue: if we release f.(x...) syntax in 0.5 and dot fusion in 0.6, we will potentially break people's code precisely because of all of the above gotchas. If we do this now, the fused dot behavior and the unfused non-dot behavior may be surprisingly different, but code won't break.

StefanKarpinski · 2016-07-11T23:11:12Z

So if this can be made ready this week, I think we should consider doing it now.

…f they are the last argument in the fusion)

… duplicates

stevengj · 2016-07-12T15:24:43Z

Okay, dot calls with keyword arguments are now handled. (They are not handled correctly on master even for non-fused dot calls.)

If the tests pass, this should be ready to merge as far as I can tell right now.

JeffBezanson · 2016-07-12T19:15:37Z

Some impressive sexpr-slinging in here. :)

stevengj · 2016-07-12T19:25:38Z

It's always a questionable sign if you find yourself wishing that cadaddr and cddaddr were defined. 😉

davidanthoff · 2016-07-12T21:53:08Z

Thanks for finishing this on time, very, very cool that this made it for 0.5!!!

gabrielgellner · 2016-07-12T22:36:30Z

Does it makes sense that:

x = rand(10000)
@time sin(cos(x)); #  0.000243 seconds (12 allocations: 156.625 KB)
@time sin.(cos.(x)); # 0.012385 seconds (6.85 k allocations: 386.261 KB)
@time broadcast(xval->sin(cos(xval)), x); #  0.012997 seconds (6.77 k allocations: 382.771 KB)

@time sin.(cos.(x) .+ 0) #  0.000325 seconds (50 allocations: 235.953 KB)

seems to be a strange speed regression on my machine using the dotted syntax if the fusion kicks in (what I imagine is caused by this, hence the last example trying to use the .+ to get different code...). But it is consistent with the broadcast call, so maybe this is how the code is meant to work? Just feels strange that adding the .+ gives such a huge speedup!

Is there a way to see what the lowered code looks like from the REPL?

stevengj · 2016-07-12T22:41:00Z

The .+ prevents fusion, because dot operators are not handled as f.(args...) calls yet. Maybe this is a global-scope thing? Try f(x) = sin.(cos.(x)).

yuyichao · 2016-07-12T22:50:26Z

Note that each line that uses the dot function call syntax generates a new anonymous function. You need to wrap it in a function to avoid benchmarking compiler

stevengj · 2016-07-12T22:51:15Z

Timings look good when wrapped in a function:

x = rand(10^4)
f1(x) = sin(cos(x))
f2(x) = sin.(cos.(x))
@time f1(x); @time f2(x);
...repeat a few times...
@time f1(x); @time f2(x);

  0.000407 seconds (12 allocations: 156.625 KB)
  0.000370 seconds (15 allocations: 78.500 KB)

In this case, the sin and cos functions are expensive enough that there is not a huge performance advantage to fusing the two loops, but the memory advantage is clear.

andyferris · 2016-07-12T22:55:53Z

Nice! I have to say I'm impressed by all of this work 👍

Out of curiosity, will .+ etc. be dealt with for 0.5?

stevengj · 2016-07-12T22:57:04Z

No, I think dot operators will be left as-is for 0.5.

(In consequence, this loop fusion probably won't have a big practical impact until 0.6. But it is good to get at least a preview in 0.5.)

gabrielgellner · 2016-07-12T22:59:09Z

Ohhhh I see. It is the anonymous function call in the global scope, versus having broadcast(sin, broadcast(cos, x)) not creating the anonymous function. Kind of funny that the more elegant way of doing this will make using it in the natural way in the REPL slower ;)

Thanks so much for the clarification. I am so excited for this feature (the .( syntax, not the fusion specifically).

stevengj · 2016-07-12T23:00:31Z

Yeah, it would be nice if the compiler were smart enough to re-use identical anonymous functions.

andyferris · 2016-07-12T23:43:43Z

No, I think dot operators will be left as-is for 0.5.

That leads me to predict you'll see some packages using code like +.(A, *.(B, C)), instead of A .+ B .* C, and broadcast!((a,b,c) -> a + b*c, A, A, B, C) instead of A .+= B.*C.

It will be a super-nice syntax change, whenever it lands. :)

gabrielgellner · 2016-07-12T23:48:18Z

@andyferris as @StefanKarpinski has mentioned, in another issue, it would be really great if we could find a way for Julia to slap you if you have a tendency to write too much code that way... I just hope this is all that happens and they don't also use 0-based arrays on top of such abominations.

andyferris · 2016-07-12T23:49:38Z

as @StefanKarpinski has mentioned, in another issue, it would be really great if we could find a way for Julia to slap you if you have a tendency to write too much code that way

lol :)

stevengj mentioned this pull request Jul 6, 2016

Vectorization Roadmap #16285

Closed

5 tasks

nalimilan mentioned this pull request Jul 7, 2016

fix #4883, result type of broadcast for arbitrary functions #17172

Merged

stevengj mentioned this pull request Jul 7, 2016

promote_op/broadcast should not call the function for one(T) #17314

Closed

tkelman reviewed Jul 7, 2016
View reviewed changes

stevengj mentioned this pull request Jul 9, 2016

broadcasting over scalars should produce a scalar #17318

Merged

stevengj force-pushed the dot-fusion branch 2 times, most recently from 67f8239 to 03ea7e0 Compare July 10, 2016 21:02

stevengj force-pushed the dot-fusion branch from 03ea7e0 to c03e069 Compare July 11, 2016 19:10

stevengj added 7 commits July 12, 2016 10:25

partial fix for loop fusion and splatting (but only handles args... i…

0c4b67e

…f they are the last argument in the fusion)

compress fused broadcast args by eliminating literals and some (pure)…

d8f4b60

… duplicates

simplify/fix dot broadcasting of anonymous functions

4637b3c

work around #17318

8caf3fd

more fusion tests (splatting, literals)

05bc2fa

correct description of #17314

4367fc6

undo #17318 workaround, add more constant-folding tests

61d2d39

stevengj force-pushed the dot-fusion branch from 77b9ecf to 61d2d39 Compare July 12, 2016 14:25

handle dot calls with keyword arguments

fb8f1e1

JeffBezanson changed the title ~~WIP: fusion of nested f.(args) calls into a single broadcast call~~ fusion of nested f.(args) calls into a single broadcast call Jul 12, 2016

JeffBezanson merged commit 8fdaf91 into master Jul 12, 2016

stevengj deleted the dot-fusion branch July 12, 2016 19:32

stevengj mentioned this pull request Jul 20, 2016

treat .= as syntactic sugar for broadcast! #17510

Merged

4 tasks

stevengj added the broadcast Applying a function over a collection label Aug 2, 2016

stevengj mentioned this pull request Aug 3, 2016

restore getfield deprecation for foo.(1) etc. #17773

Merged

fusion of nested f.(args) calls into a single broadcast call #17300

fusion of nested f.(args) calls into a single broadcast call #17300

Conversation

stevengj commented Jul 6, 2016 • edited Loading

stevengj commented Jul 6, 2016 • edited Loading

stevengj commented Jul 6, 2016 • edited Loading

stevengj commented Jul 6, 2016 • edited Loading

nalimilan commented Jul 6, 2016

Keno commented Jul 6, 2016

stevengj commented Jul 6, 2016

stevengj commented Jul 7, 2016

martinholters commented Jul 7, 2016

ivarne commented Jul 7, 2016 • edited Loading

martinholters commented Jul 7, 2016

nalimilan commented Jul 7, 2016

andyferris commented Jul 7, 2016

andreasnoack commented Jul 7, 2016

stevengj commented Jul 7, 2016

tkelman Jul 7, 2016 • edited Loading

Choose a reason for hiding this comment

stevengj Jul 10, 2016

Choose a reason for hiding this comment

tkelman Jul 10, 2016

Choose a reason for hiding this comment

stevengj Jul 11, 2016

Choose a reason for hiding this comment

martinholters commented Jul 8, 2016

stevengj commented Jul 8, 2016 • edited Loading

stevengj commented Jul 10, 2016 • edited Loading

martinholters commented Jul 11, 2016

StefanKarpinski commented Jul 11, 2016 • edited Loading

StefanKarpinski commented Jul 11, 2016

stevengj commented Jul 12, 2016 • edited Loading

JeffBezanson commented Jul 12, 2016

stevengj commented Jul 12, 2016

davidanthoff commented Jul 12, 2016

gabrielgellner commented Jul 12, 2016

stevengj commented Jul 12, 2016

yuyichao commented Jul 12, 2016

stevengj commented Jul 12, 2016 • edited Loading

andyferris commented Jul 12, 2016

stevengj commented Jul 12, 2016

gabrielgellner commented Jul 12, 2016

stevengj commented Jul 12, 2016

andyferris commented Jul 12, 2016 • edited Loading

gabrielgellner commented Jul 12, 2016

andyferris commented Jul 12, 2016

stevengj commented Jul 6, 2016 •

edited

Loading

stevengj commented Jul 6, 2016 •

edited

Loading

stevengj commented Jul 6, 2016 •

edited

Loading

stevengj commented Jul 6, 2016 •

edited

Loading

ivarne commented Jul 7, 2016 •

edited

Loading

tkelman Jul 7, 2016 •

edited

Loading

stevengj commented Jul 8, 2016 •

edited

Loading

stevengj commented Jul 10, 2016 •

edited

Loading

StefanKarpinski commented Jul 11, 2016 •

edited

Loading

stevengj commented Jul 12, 2016 •

edited

Loading

stevengj commented Jul 12, 2016 •

edited

Loading

andyferris commented Jul 12, 2016 •

edited

Loading