Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BitArray broadcasting #6035

Merged
merged 8 commits into from
Mar 9, 2014
Merged

BitArray broadcasting #6035

merged 8 commits into from
Mar 9, 2014

Conversation

carlobaldassi
Copy link
Member

This fixes #5877 and touches related issues, in that it makes element-wise comparisons (.==, .<, etc.) broadcasting, and also completes broadcasting behaviour for all element-wise operations which involve BitArrays (.*, ./, etc.), unless I missed something.

Besides that, there is some minor stuff in here, the most relevant being that it fixes .+ and .- for Bools, which now return Array{Int}. There's also a specialization for Bool^Int.

This is a pull request even though to me it's ready for merging, mainly to highlight some issues I found. Here's a list of reasons why I didn't merge it directly:

  1. Since broadcast is @toivoh's (and others) creature, review is in order

  2. The code adds a specialization for broadcast!(f::Function, B::BitArray, ...), but adds another function bitbroadcast, which is not exported: should it be exported?

  3. Performance is good, unless broadcasting is actually needed (i.e. broadcast_shape yields something different from promote_shape) and BitArrays are involved in the argument list.

    3a. In the case promote_shape succeeds, I added fallbacks which make things efficient. One possibly questionable issue is my use of try blocks for that purpose.

    3b. In case it fails, and BitArrays are involved (e.g. trues(5) .+ falses(1,5)), cartesian iteration kills performance, to the point that it is much faster to invoke bitunpack on the arguments first; however, that may clearly become very memory-expensive. I haven't been able to overcome this issue yet, so advice would be welcome on this. (One might imagine unpacking below a certain size, but that is not a very satisfactory solution.)

Example of timings for the last issue: here's a specialized version which doesn't truly broadcast:

julia> n1=110; n2=323; n3=681; a = randbool(n1,n2,n3); b = randbool(n1,n2,n3);

julia> @time a .* b; # not really broadcasting
elapsed time: 0.004324182 seconds (3024816 bytes allocated)

And here's one where it does:

julia> n1=110; n2=323; n3=681; a = randbool(n1,1,n3); b = randbool(1,n2,1);

julia> @time a .* b; # broadcasting
elapsed time: 1.087884438 seconds (3029848 bytes allocated)

julia> @time bitunpack(a) .* bitunpack(b); # broadcasting
elapsed time: 0.124664672 seconds (24273712 bytes allocated)

Similar timings are obtained for comparison operations like .==.

Part of the problem is that @inbounds is ineffective for BitArrays, but I don't think it's only that (even if it would be nice to have some way of making it work, cc @JeffBezanson ). Probably, with a sufficient effort, one could come up with some (contrived) way out from this by avoiding multidimensional indexing for BitArrays, but even in that case it is hard to think of a solution for general AbstractArrays (e.g. comparing a BitArray with an Array).

@toivoh
Copy link
Contributor

toivoh commented Mar 4, 2014

I'm not sure that I will recognize broadcast anymore now that @timholy
has redone it.

It would be nice to understand why it becomes slow when doing actual
broadcasting. Is it because it's not able to exploit the bit-twiddling
optimizations that are in place for the other cases?

Anyway, thanks for doing this!

@timholy
Copy link
Sponsor Member

timholy commented Mar 4, 2014

First, these are some very nice fixes for addition of bools, etc.

The performance challenges are indeed unfortunate. I'm well aware that this is extremely challenging code to write, and I am amazed by the efficiency of what you've pulled off in other domains. But I can't help but think that because the arrays are traversed in order, there must be an efficient solution here too.

I confess I don't even understand the storage of BitArrays; are they always densely packed, or (in e.g., 2d) is each column aligned to the start of a new Uint64? If the latter is true, you could simply use a stride of 64 along dimension 1.

Presuming that they are densely packed, then a stride along dimension 1 won't work. In that case, I wonder about using the post-expression to do something like this (3d example) for each BitArray A:

Aindex = 1
bitbuffer = A.chunks[Aindex]
bitsremaining = 64 # or whatever is appropriate
newbit = bitbuffer & 0x1 # or whatever is appropriate
@nexprs k->(bitpostadvance_k = size(A,k) > 1 && !any(size(A,k+1:N) .== 1))
for i_3 = 1:sz_3
    # Standard pre-expression
    for i_2 = 1:sz_2
        # Standard pre-expression
        for i_1 = 1:sz_1
            # Standard pre-expression
            # body, which for the BitArray just uses newbit
            if bitpostadvance_1
                # do bit advance
            end
        end
        if bitpostadvance_2
            # do bit advance
        end
    end
    if bitpostadvance_3
        # do bit advance
    end
end

where bit advance means something like

if bitsremaining > 0
    bitbuffer >> 1
    bitsremaining -= 1
    newbit = bitbuffer & 0x1
else
    Aindex += 1
    bitbuffer = A.chunks[Aindex]
    bitsremaining = 64
end

This is still pretty vague in the details, but perhaps it's a viable direction?

@timholy
Copy link
Sponsor Member

timholy commented Mar 4, 2014

I should say that I'd also be comfortable with the notion of restricting broadcasting to two arrays when a BitArray is involved. You may need to write specialized versions

bcast(A::AbstractArray, B::BitArray)
bcast(A::BitArray, B::AbstractArray)
bcast(A::BitArray, B::BitArray)

and I'm not sure I'd wish it on anyone to have to generalize that to more inputs.

@timholy
Copy link
Sponsor Member

timholy commented Mar 4, 2014

Come to think of it, there's an alternative to writing specialized versions. We could add one more element to the dictionary key: whether each relevant array (including the output) is a BitArray. You could pass that vector/tuple/whatever of flags to gen_broadcast_body, and generate any BitArray-specific code using the same strategy used for checkshape (i.e., outside the quote block).

@carlobaldassi
Copy link
Member Author

Ok, I think I made it (see last commit). @timholy, the key observation was indeed that the arrays are traversed in order; I tried a plethora of different approaches, but in the end the fastest is to use start/next (together with a hierarchy of states), which also has the nice feature of being generic and keeping the code surprisingly clean.
Now in most cases the performance when using BitArrays is on par with that for Arrays, or at least not too different (always well within a factor of 2 in my tests, but I did not cover all cases for sure).
The only drawback I can think of is that in some sense this goes back to linear indexing, while I know there has been an effort to go full-cartesian. But that's actually a detail of how start/next are defined for AbstractArrays, so I don't think it should be much of a concern.

So, unless there are objections, I'm going to merge this tomorrow.

@timholy
Copy link
Sponsor Member

timholy commented Mar 7, 2014

I'm totally on board with this. For BitArrays, I agree with what I believe is your view, that views don't make sense. So linear indexing is fine.

@carlobaldassi
Copy link
Member Author

Tim, I might have misinterpreted your last comment, but just to clarify: in the current version of this PR, the array type is not taken into account, the code is written generically for AbstractArray arguments, in terms of start/next. It just happens to be fast for both Arrays and BitArrays in this way. So it basically delegates efficiency to the implementation of next. Note that all my attempts to adapt the iteration based on the concrete AbstractArray type have resulted in massive performance penalties.
My comment on linear indexing was based on the fact the the current definition of next for AbstractArrays (also used by Arrays) is next(A, i) = (A[i], i+1). It looks like there is an issue for Array views, since this is less efficient than cartesian indexing. I wonder if it's possible to make next efficient for SubArrays.

So this PR is on hold again at the moment.

@carlobaldassi
Copy link
Member Author

Update: I experimented a bit with SubArrays, and found out that start/next/done iteration could be made as efficient as cartesian nested loops, provided that

  1. Tuples become as efficient as immutable types
  2. Inlining is improved (or at least can be forced)

This is because we can use the index tuple as iteration state, and if everything is manually inlined it basically does the same thing as cartesian. See this gist: the last test function (manually inlined iteration based on custom immutable states) is as fast as the first (standard cartesian loops). Without manual inlining, the penalty is about a factor of 2.

In any case, it doesn't seem that this is feasible, right at this moment. Might be worth revisiting after some of the work which is being done about points 1 and 2 above gets merged.

Perhaps for the time being I could write specialized broadcast versions, e.g. using cartesian when all arguments are StridedArrays and start/next/done iteration otherwise, or something like that. I think that with some experiments I may be able to determine the best strategy in each case.

@carlobaldassi
Copy link
Member Author

I forgot to add: the proof of principle that efficient tuples + more inlining could provide fast start/next/done iteration on SubArrays is of course totally general, i.e. there is actually nothing specific to SubArrays there. At the core, it's just calling A[I_1, I_2,...] and using a fancy way of performing nested loops. So that could be used as a general fallback in stead of the currently used linear indexing at some point.

@timholy
Copy link
Sponsor Member

timholy commented Mar 7, 2014

@carlobaldassi, see #3796 (comment)

@carlobaldassi
Copy link
Member Author

Ah yes, that's it, I missed that, thanks for pointing it out. Those iteration functions are exactly what I also wrote when testing tuples (not shown in the gist). As I said, if you also substitute tuples with immutable types, the 2.5-fold performance penalty which you measured disappears completely.

@timholy
Copy link
Sponsor Member

timholy commented Mar 7, 2014

Can you clarify what you mean? Do you mean the iterator state itself is an immutable? I was hoping we wouldn't have to generate a separate type for each dimensionality. But if needed, so be it.

@timholy
Copy link
Sponsor Member

timholy commented Mar 7, 2014

Oh, and that's quite exciting about getting rid of the 2.5x penalty! Will be nice to finally have some good multidimensional iterators, once that inlining branch gets merged.

@carlobaldassi
Copy link
Member Author

Yes, I got rid of the difference in performance by using an immutable iterator state, as in the gist, where they are generated up to dimension 10 (see the comment at the top to see what they end up looking like).

But I'd rather wait for Tuples to be finally treated like immutable types (which I believe is a long-standing issue) rather then introducing such utter ugliness into Base. Of course, others may feel differently, especially if that optimization is not in sight.

For example, now [true].+[true] is the same as [true]+[true], i.e. [2]
As for Arrays, this is obtained by special-casing Bools.
renamed as gen_broadcast_function, since its purpose is
different from the exported version
Closes #5877

This also specializes broadcast! when the target is a BitArray
and adds an (unexported) bitbroadcast function which creates a
BitArray target rather then an Array.

Also, some comparison operations between AbstractArray{Bool} are
specialized if possible.
also specialize Bool^Integer while at it
simply by removing a method!
introduces specialized versions of the broadcasting
core which use start/next instead of cartesian iteration.
Used when the arguments only involve Arrays or BitArrays.
@carlobaldassi
Copy link
Member Author

My tests showed that as soon as a SubArray was involved, the cartesian iteration version was the one performing best. So in last commit there are 4 versions of broadcast!, which dispatch according to a) if the return type is a BitArray or not, and b) if all arguments are Union(Array,BitArray) or not. Having 4 different get_broadcast_body versions is a bit unfortunate/inelegant, but that's the only way I could find to have decent performance in all situations.

I also added bitbroadcast to the exports and added documentation.

I think this is ready for merging, I'll just wait for Travis to turn green.

carlobaldassi added a commit that referenced this pull request Mar 9, 2014
@carlobaldassi carlobaldassi merged commit 58065e2 into master Mar 9, 2014
@carlobaldassi carlobaldassi deleted the cb/bitbroadcast branch March 9, 2014 23:08
@timholy
Copy link
Sponsor Member

timholy commented Sep 10, 2014

@carlobaldassi, I think given that the tuple stuff still hasn't materialized, I'd lobby strongly for inclusion of your gist in base rather than waiting for #6437. We can switch to using tuples when that's feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comparison operators do not broadcast
3 participants