BitArray broadcasting #6035

carlobaldassi · 2014-03-03T15:46:43Z

This fixes #5877 and touches related issues, in that it makes element-wise comparisons (.==, .<, etc.) broadcasting, and also completes broadcasting behaviour for all element-wise operations which involve BitArrays (.*, ./, etc.), unless I missed something.

Besides that, there is some minor stuff in here, the most relevant being that it fixes .+ and .- for Bools, which now return Array{Int}. There's also a specialization for Bool^Int.

This is a pull request even though to me it's ready for merging, mainly to highlight some issues I found. Here's a list of reasons why I didn't merge it directly:

Since broadcast is @toivoh's (and others) creature, review is in order
The code adds a specialization for broadcast!(f::Function, B::BitArray, ...), but adds another function bitbroadcast, which is not exported: should it be exported?
Performance is good, unless broadcasting is actually needed (i.e. broadcast_shape yields something different from promote_shape) and BitArrays are involved in the argument list.

3a. In the case promote_shape succeeds, I added fallbacks which make things efficient. One possibly questionable issue is my use of try blocks for that purpose.

3b. In case it fails, and BitArrays are involved (e.g. trues(5) .+ falses(1,5)), cartesian iteration kills performance, to the point that it is much faster to invoke bitunpack on the arguments first; however, that may clearly become very memory-expensive. I haven't been able to overcome this issue yet, so advice would be welcome on this. (One might imagine unpacking below a certain size, but that is not a very satisfactory solution.)

Example of timings for the last issue: here's a specialized version which doesn't truly broadcast:

julia> n1=110; n2=323; n3=681; a = randbool(n1,n2,n3); b = randbool(n1,n2,n3);

julia> @time a .* b; # not really broadcasting
elapsed time: 0.004324182 seconds (3024816 bytes allocated)

And here's one where it does:

julia> n1=110; n2=323; n3=681; a = randbool(n1,1,n3); b = randbool(1,n2,1);

julia> @time a .* b; # broadcasting
elapsed time: 1.087884438 seconds (3029848 bytes allocated)

julia> @time bitunpack(a) .* bitunpack(b); # broadcasting
elapsed time: 0.124664672 seconds (24273712 bytes allocated)

Similar timings are obtained for comparison operations like .==.

Part of the problem is that @inbounds is ineffective for BitArrays, but I don't think it's only that (even if it would be nice to have some way of making it work, cc @JeffBezanson ). Probably, with a sufficient effort, one could come up with some (contrived) way out from this by avoiding multidimensional indexing for BitArrays, but even in that case it is hard to think of a solution for general AbstractArrays (e.g. comparing a BitArray with an Array).

toivoh · 2014-03-04T06:23:15Z

I'm not sure that I will recognize broadcast anymore now that @timholy
has redone it.

It would be nice to understand why it becomes slow when doing actual
broadcasting. Is it because it's not able to exploit the bit-twiddling
optimizations that are in place for the other cases?

Anyway, thanks for doing this!

timholy · 2014-03-04T11:11:06Z

First, these are some very nice fixes for addition of bools, etc.

The performance challenges are indeed unfortunate. I'm well aware that this is extremely challenging code to write, and I am amazed by the efficiency of what you've pulled off in other domains. But I can't help but think that because the arrays are traversed in order, there must be an efficient solution here too.

I confess I don't even understand the storage of BitArrays; are they always densely packed, or (in e.g., 2d) is each column aligned to the start of a new Uint64? If the latter is true, you could simply use a stride of 64 along dimension 1.

Presuming that they are densely packed, then a stride along dimension 1 won't work. In that case, I wonder about using the post-expression to do something like this (3d example) for each BitArray A:

Aindex = 1
bitbuffer = A.chunks[Aindex]
bitsremaining = 64 # or whatever is appropriate
newbit = bitbuffer & 0x1 # or whatever is appropriate
@nexprs k->(bitpostadvance_k = size(A,k) > 1 && !any(size(A,k+1:N) .== 1))
for i_3 = 1:sz_3
    # Standard pre-expression
    for i_2 = 1:sz_2
        # Standard pre-expression
        for i_1 = 1:sz_1
            # Standard pre-expression
            # body, which for the BitArray just uses newbit
            if bitpostadvance_1
                # do bit advance
            end
        end
        if bitpostadvance_2
            # do bit advance
        end
    end
    if bitpostadvance_3
        # do bit advance
    end
end

where bit advance means something like

if bitsremaining > 0
    bitbuffer >> 1
    bitsremaining -= 1
    newbit = bitbuffer & 0x1
else
    Aindex += 1
    bitbuffer = A.chunks[Aindex]
    bitsremaining = 64
end

This is still pretty vague in the details, but perhaps it's a viable direction?

timholy · 2014-03-04T11:21:22Z

I should say that I'd also be comfortable with the notion of restricting broadcasting to two arrays when a BitArray is involved. You may need to write specialized versions

bcast(A::AbstractArray, B::BitArray)
bcast(A::BitArray, B::AbstractArray)
bcast(A::BitArray, B::BitArray)

and I'm not sure I'd wish it on anyone to have to generalize that to more inputs.

timholy · 2014-03-04T12:42:01Z

Come to think of it, there's an alternative to writing specialized versions. We could add one more element to the dictionary key: whether each relevant array (including the output) is a BitArray. You could pass that vector/tuple/whatever of flags to gen_broadcast_body, and generate any BitArray-specific code using the same strategy used for checkshape (i.e., outside the quote block).

carlobaldassi · 2014-03-07T02:18:14Z

Ok, I think I made it (see last commit). @timholy, the key observation was indeed that the arrays are traversed in order; I tried a plethora of different approaches, but in the end the fastest is to use start/next (together with a hierarchy of states), which also has the nice feature of being generic and keeping the code surprisingly clean.
Now in most cases the performance when using BitArrays is on par with that for Arrays, or at least not too different (always well within a factor of 2 in my tests, but I did not cover all cases for sure).
The only drawback I can think of is that in some sense this goes back to linear indexing, while I know there has been an effort to go full-cartesian. But that's actually a detail of how start/next are defined for AbstractArrays, so I don't think it should be much of a concern.

So, unless there are objections, I'm going to merge this tomorrow.

timholy · 2014-03-07T04:36:13Z

I'm totally on board with this. For BitArrays, I agree with what I believe is your view, that views don't make sense. So linear indexing is fine.

carlobaldassi · 2014-03-07T10:56:51Z

Tim, I might have misinterpreted your last comment, but just to clarify: in the current version of this PR, the array type is not taken into account, the code is written generically for AbstractArray arguments, in terms of start/next. It just happens to be fast for both Arrays and BitArrays in this way. So it basically delegates efficiency to the implementation of next. Note that all my attempts to adapt the iteration based on the concrete AbstractArray type have resulted in massive performance penalties.
My comment on linear indexing was based on the fact the the current definition of next for AbstractArrays (also used by Arrays) is next(A, i) = (A[i], i+1). It looks like there is an issue for Array views, since this is less efficient than cartesian indexing. I wonder if it's possible to make next efficient for SubArrays.

So this PR is on hold again at the moment.

carlobaldassi · 2014-03-07T15:51:29Z

Update: I experimented a bit with SubArrays, and found out that start/next/done iteration could be made as efficient as cartesian nested loops, provided that

Tuples become as efficient as immutable types
Inlining is improved (or at least can be forced)

This is because we can use the index tuple as iteration state, and if everything is manually inlined it basically does the same thing as cartesian. See this gist: the last test function (manually inlined iteration based on custom immutable states) is as fast as the first (standard cartesian loops). Without manual inlining, the penalty is about a factor of 2.

In any case, it doesn't seem that this is feasible, right at this moment. Might be worth revisiting after some of the work which is being done about points 1 and 2 above gets merged.

Perhaps for the time being I could write specialized broadcast versions, e.g. using cartesian when all arguments are StridedArrays and start/next/done iteration otherwise, or something like that. I think that with some experiments I may be able to determine the best strategy in each case.

carlobaldassi · 2014-03-07T15:59:56Z

I forgot to add: the proof of principle that efficient tuples + more inlining could provide fast start/next/done iteration on SubArrays is of course totally general, i.e. there is actually nothing specific to SubArrays there. At the core, it's just calling A[I_1, I_2,...] and using a fancy way of performing nested loops. So that could be used as a general fallback in stead of the currently used linear indexing at some point.

timholy · 2014-03-07T16:02:06Z

@carlobaldassi, see #3796 (comment)

carlobaldassi · 2014-03-07T16:27:03Z

Ah yes, that's it, I missed that, thanks for pointing it out. Those iteration functions are exactly what I also wrote when testing tuples (not shown in the gist). As I said, if you also substitute tuples with immutable types, the 2.5-fold performance penalty which you measured disappears completely.

timholy · 2014-03-07T16:59:01Z

Can you clarify what you mean? Do you mean the iterator state itself is an immutable? I was hoping we wouldn't have to generate a separate type for each dimensionality. But if needed, so be it.

timholy · 2014-03-07T17:00:11Z

Oh, and that's quite exciting about getting rid of the 2.5x penalty! Will be nice to finally have some good multidimensional iterators, once that inlining branch gets merged.

carlobaldassi · 2014-03-07T17:20:50Z

Yes, I got rid of the difference in performance by using an immutable iterator state, as in the gist, where they are generated up to dimension 10 (see the comment at the top to see what they end up looking like).

But I'd rather wait for Tuples to be finally treated like immutable types (which I believe is a long-standing issue) rather then introducing such utter ugliness into Base. Of course, others may feel differently, especially if that optimization is not in sight.

For example, now [true].+[true] is the same as [true]+[true], i.e. [2] As for Arrays, this is obtained by special-casing Bools.

renamed as gen_broadcast_function, since its purpose is different from the exported version

Closes #5877 This also specializes broadcast! when the target is a BitArray and adds an (unexported) bitbroadcast function which creates a BitArray target rather then an Array. Also, some comparison operations between AbstractArray{Bool} are specialized if possible.

also specialize Bool^Integer while at it

simply by removing a method!

addresses #3171

introduces specialized versions of the broadcasting core which use start/next instead of cartesian iteration. Used when the arguments only involve Arrays or BitArrays.

carlobaldassi · 2014-03-09T22:59:24Z

My tests showed that as soon as a SubArray was involved, the cartesian iteration version was the one performing best. So in last commit there are 4 versions of broadcast!, which dispatch according to a) if the return type is a BitArray or not, and b) if all arguments are Union(Array,BitArray) or not. Having 4 different get_broadcast_body versions is a bit unfortunate/inelegant, but that's the only way I could find to have decent performance in all situations.

I also added bitbroadcast to the exports and added documentation.

I think this is ready for merging, I'll just wait for Travis to turn green.

BitArray broadcasting

timholy · 2014-09-10T09:06:13Z

@carlobaldassi, I think given that the tuple stuff still hasn't materialized, I'd lobby strongly for inclusion of your gist in base rather than waiting for #6437. We can switch to using tuples when that's feasible.

carlobaldassi added 8 commits March 9, 2014 23:26

Fix return type of .+ and .- on Bool arrays

f7ce740

For example, now [true].+[true] is the same as [true]+[true], i.e. [2] As for Arrays, this is obtained by special-casing Bools.

Rename the internal broadcast!_function

287e6f4

renamed as gen_broadcast_function, since its purpose is different from the exported version

Broadcast .^ for BitArrays

3ec48e4

also specialize Bool^Integer while at it

Broadcast ./ for BitArrays

6fb4f86

simply by removing a method!

Broadcast .* for BitArrays

1443fff

addresses #3171

Broadcast: use start/next for Arrays/BitArrays

a1d4fc7

introduces specialized versions of the broadcasting core which use start/next instead of cartesian iteration. Used when the arguments only involve Arrays or BitArrays.

Export bitbroadcast

65276d7

carlobaldassi added a commit that referenced this pull request Mar 9, 2014

Merge pull request #6035 from JuliaLang/cb/bitbroadcast

58065e2

BitArray broadcasting

carlobaldassi merged commit 58065e2 into master Mar 9, 2014

carlobaldassi deleted the cb/bitbroadcast branch March 9, 2014 23:08

carlobaldassi mentioned this pull request Sep 21, 2014

Efficient cartesian iteration (new version of #6437) #8432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitArray broadcasting #6035

BitArray broadcasting #6035

carlobaldassi commented Mar 3, 2014

toivoh commented Mar 4, 2014

timholy commented Mar 4, 2014

timholy commented Mar 4, 2014

timholy commented Mar 4, 2014

carlobaldassi commented Mar 7, 2014

timholy commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

timholy commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

timholy commented Mar 7, 2014

timholy commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

carlobaldassi commented Mar 9, 2014

timholy commented Sep 10, 2014

BitArray broadcasting #6035

BitArray broadcasting #6035

Conversation

carlobaldassi commented Mar 3, 2014

toivoh commented Mar 4, 2014

timholy commented Mar 4, 2014

timholy commented Mar 4, 2014

timholy commented Mar 4, 2014

carlobaldassi commented Mar 7, 2014

timholy commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

timholy commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

timholy commented Mar 7, 2014

timholy commented Mar 7, 2014

carlobaldassi commented Mar 7, 2014

carlobaldassi commented Mar 9, 2014

timholy commented Sep 10, 2014