Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count(f, x) should not be equivalent to sum(f, x) #20404

Closed
stevengj opened this issue Feb 2, 2017 · 22 comments
Closed

count(f, x) should not be equivalent to sum(f, x) #20404

stevengj opened this issue Feb 2, 2017 · 22 comments
Labels
collections Data structures holding multiple items, e.g. sets

Comments

@stevengj
Copy link
Member

stevengj commented Feb 2, 2017

I took a look at the count implementation for #20403 (cc @cossio), and was surprised to discover that the current definition is nearly identical to sum (albeit with a type instability):

function count(pred, itr)
    n = 0
    for x in itr
        n += pred(x)
    end
    return n
end

For example, count(sqrt, [1,2,3,5]) returns 6.382332347441762.

We could disallow this case entirely, but one possibility would be to make count(f, x) equivalent to countnz for non-boolean data. Indeed, I'm not sure why we have both countnz and count — couldn't we merge the two functions into a single count function?

@stevengj stevengj added the collections Data structures holding multiple items, e.g. sets label Feb 2, 2017
@stevengj
Copy link
Member Author

stevengj commented Feb 2, 2017

i.e. define

iszero(b::Bool) = !b # maybe more efficient than the fallback b == zero(b) definition
function count{F}(pred::F, itr)
    n = 0
    for x in itr
        n += !iszero(pred(x))
    end
    return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(itr)

@nalimilan
Copy link
Member

+1, that would be more consistent with find.

@ararslan
Copy link
Member

ararslan commented Feb 2, 2017

I think I'd prefer something like

function count(f, itr)
    n = 0
    for x in itr
        if x
            n += 1
        end
    end
    return n
end
count(itr) = count(identity, itr)
@deprecate countnz(itr) count(!iszero, itr)

This has the benefit of throwing an error for non-boolean data when using count directly.

@stevengj
Copy link
Member Author

stevengj commented Feb 2, 2017

@ararslan, that might actually be slower because it forces a branch. One could do n += f(x)::Bool, I suppose.

However, I'd prefer to have one function that does more rather than restrict the functionality to Bool here… I'm not sure that throwing an error for non-boolean data is a feature. And, as @nalimilan says, this behavior is consistent with the find functions.

stevengj added a commit that referenced this issue Feb 2, 2017
@ararslan
Copy link
Member

ararslan commented Feb 2, 2017

I'm not sure that throwing an error for non-boolean data is a feature

I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.

Having f(x)::Bool seems fine to me, especially if a branch is costly, though I'd be curious to see whether the branch makes an appreciable difference in benchmarks.

@stevengj
Copy link
Member Author

stevengj commented Feb 2, 2017

@ararslan, your version with the branch (corrected to call if f(x)) is more than 2x slower than Base.count for @btime count(iseven, 1:10^6) evals=1.

@ararslan
Copy link
Member

ararslan commented Feb 2, 2017

Dang. f(x)::Bool it is, then, in my proposal.

@nalimilan
Copy link
Member

I disagree because without a predicate for non-boolean data, the name count doesn't mean much. What is it counting? I have similar issues with find, and IIRC there's a proposal somewhere to improve the consistency and clarity of find and related functions.

That's https://github.com/JuliaLang/Juleps/blob/master/Find.md, but it doesn't really address the question of whether find should return non-zero entries or true entries. I agree the former behavior can be surprising at first, but it's also more general and potentially useful, so why not keep it?

@ararslan
Copy link
Member

ararslan commented Feb 2, 2017

Because it's unclear. I think being more explicit and saying count(!iszero, x) or find(!iszero, x) makes the intention completely obvious. IMO the one-argument versions should stick to boolean data, where there's no room for confusion, and the two-argument version gives you the ability to supply any kind of data with an appropriate predicate. I think we should be striving for clarity here rather than terseness.

@nalimilan
Copy link
Member

Now that we can do !iszero, I'd tend to agree with you.

@stevengj
Copy link
Member Author

stevengj commented Feb 2, 2017

Probably should continue to have a countnz function if the alternative is to require everyone to do count(!iszero, x). e.g. we want specialized versions for sparse arrays, and I don't know if that is possible with !iszero, because the ! will generate a specialized anonymous function for each call and hence defining a specialized count(::typeof(!iszero), x) method won't work. Of course, we could define a specialized !(::typeof(iszero)) = notiszero too, with notiszero(x) = !iszero(x), I guess, but that seems to be getting messy (and is harder for external packages to overload).

@ararslan
Copy link
Member

ararslan commented Feb 2, 2017

For sparse there's nnz, which should arguably be renamed, though that's somewhat tangential. I did a bit of playing around and it seems that you actually can specialize on typeof(!iszero), which is super cool. So count(::typeof(!iszero), x) is a valid overloadable method.

@tkelman
Copy link
Contributor

tkelman commented Feb 2, 2017

There's an important difference between nnz and countnz though, the former is a structural property whereas the latter tests for and does not include stored zeros. I can't remember if there's a good reason countnz and count are separate though.

@bramtayl
Copy link
Contributor

bramtayl commented Feb 3, 2017

I'd imagine most uses of count would be for the boolean case, in which case sum works just fine. If someone wants the number of non-zero values, why not just sum a generator: sum(!iszero(x) for x in X)?

@nalimilan
Copy link
Member

@ararslan Why do you think having count even if it does the same thing as sum but only for boolean is a good idea? For safety?

@cossio
Copy link
Contributor

cossio commented Feb 3, 2017

@nalimilan Good point. I think count only makes sense if it does something different. (And the difference cannot be that count throws an error with non-booleans while sum doesn't).

@ararslan
Copy link
Member

ararslan commented Feb 3, 2017

@nalimilan I think it's a good idea because it provides both clarity and safety in terms of knowing exactly what you're getting. I think counting nonzeros is a rather odd and surprising default. I assume the reason that we don't currently have a one-argument count method is that it's equivalent to sum in that case, but should we make Bool not a subtype of Number (see #19168), it makes more sense to count non-numbers than to sum them, even if we do still define arithmetic on Bools.

@fp4code
Copy link
Contributor

fp4code commented Feb 19, 2017

To solve the speed point, just add @simd in front of the loop:

function fast_count(pred, itr)
    n = 0
    @simd for x in itr
        n += pred(x)::Bool
    end
    return n
end
a = randn(1000000)
@benchmark fast_count(x->x>1.96, a) # median time:      535.944 μs (0.00% GC)
@benchmark sum(x->x>1.96, a)        # median time:      565.832 μs (0.00% GC)
@benchmark count(x->x>1.96, a)      # median time:        1.370 ms (0.00% GC)

@nalimilan
Copy link
Member

I wouldn't have expected @simd to make a difference here. I thought it mainly allowed floating point optimizations?

@KristofferC
Copy link
Sponsor Member

There are many SIMD instructions for integers.

@nalimilan
Copy link
Member

Yes, but I would expect them to be enabled by default. Do they change the behavior of the code?

@KristofferC
Copy link
Sponsor Member

KristofferC commented Feb 19, 2017

Perhaps some aliasing checks are turned off. I'm not sure, I don't get SIMD without the macro at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

No branches or pull requests

8 participants