Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add insorted for in function in sorted arrays #37490

Merged
merged 16 commits into from
Sep 28, 2020
Merged

Add insorted for in function in sorted arrays #37490

merged 16 commits into from
Sep 28, 2020

Conversation

remi-garcia
Copy link
Contributor

@remi-garcia remi-garcia commented Sep 9, 2020

Closes #37442
The purpose is to add a function that determine whether an item is in the given sorted collection.

EDIT: Usage:

julia> insorted(4, [1, 2, 4, 5, 5, 7]) # single match
true
julia> insorted(5, [1, 2, 4, 5, 5, 7]) # multiple matches
true
julia> insorted(3, [1, 2, 4, 5, 5, 7]) # no match
false
julia> insorted(9, [1, 2, 4, 5, 5, 7]) # no match
false
julia> insorted(0, [1, 2, 4, 5, 5, 7]) # no match
false

Implemented in a similar way as searchsorted. Left a TODO in insorted(a::AbstractRange{<:Real}, x::Real, o::DirectOrdering)::Bool

@remi-garcia remi-garcia changed the title Add insorted for in function in sorted arrays Add insorted for in function in sorted arrays Sep 9, 2020
@jw3126
Copy link
Contributor

jw3126 commented Sep 11, 2020

Thanks a lot! It would be nice to have such a function! This PR adds a lot of code though, maybe that can be reduced.

function insorted(x, coll; kw...)
    !isempty(searchsorted(coll, x; kw...))
end
  • There is are already optimized in methods for ranges. If these are faster then the above for ranges, we could just call them.

base/sort.jl Outdated Show resolved Hide resolved
@remi-garcia
Copy link
Contributor Author

* Can this maybe share more code with `searchsorted`? Following @goretkin [#37442 (comment)](https://github.com/JuliaLang/julia/issues/37442#issuecomment-689248118)
function insorted(x, coll; kw...)
    !isempty(searchsorted(coll, x; kw...))
end

I'll report this possibility on benchmarks. It should be close because intuitively O(insorted)=O(searchsorted) in both cases. Yet I tried to avoid using length(coll) since, if I remember well, it is in O(n).

* There is are already optimized `in` methods for ranges. If these are faster then the above for ranges, we could just call them.

To be honnest I didn't check the source code of in for ranges. But, now that got me thinking, why there is a searchsorted for ranges while find should be as efficient. Ranges are not always sorted?

@jw3126
Copy link
Contributor

jw3126 commented Sep 13, 2020

I'll report this possibility on benchmarks.

Awesome!

Yet I tried to avoid using length(coll) since, if I remember well, it is in O(n).

It of depends on the collection, but usually either a collection does not implement length at all or length is fast.
If we have slow/no length, then I don't think we can implement a fast insorted anyway. At least not in a generic way, maybe in a specialized way for a collection at hand.
I think the right way is to have a short insorted method as above. And if it turns out to be too slow on some collection, prefer adding faster searchsorted methods over adding insorted methods.

But, now that got me thinking, why there is a searchsorted for ranges while find should be as efficient.

The main reason is that a code that uses searchsorted does not crash, when you feed it a range.

Ranges are not always sorted?

Yes they are. At least by the standard <. They are not sorted e.g. by=sin. Anyway if searchsorted (with standard order) is not superfast on a range, that's a bug.

@remi-garcia
Copy link
Contributor Author

Julia 1.0.5:

using Base.Order

midpoint(lo::T, hi::T) where T<:Integer = lo + ((hi - lo) >>> 0x01)
midpoint(lo::Integer, hi::Integer) = midpoint(promote(lo, hi)...)

function _insortedfirst(v::AbstractVector, x, lo::T, hi::T, o::Ordering)::Bool where T<:Integer
    if lt(o, v[hi], x)
        return false
    end
    u = T(1)
    lo = lo - u
    hi = hi + u
    @inbounds while lo < hi - u
        m = midpoint(lo, hi)
        if lt(o, v[m], x)
            lo = m
        else
            hi = m
        end
    end
    if v[hi] == x
        return true
    end
    return false
end

function _insortedlast(v::AbstractVector, x, lo::T, hi::T, o::Ordering)::Bool where T<:Integer
    if lt(o, x, v[lo])
        return false
    end
    u = T(1)
    lo = lo - u
    hi = hi + u
    @inbounds while lo < hi - u
        m = midpoint(lo, hi)
        if lt(o, x, v[m])
            hi = m
        else
            lo = m
        end
    end
    if v[lo] == x
        return true
    end
    return false
end

function insorted(v::AbstractVector, x, ilo::T, ihi::T, o::Ordering)::Bool where T<:Integer
    if isempty(v)
        return false
    elseif lt(o, v[ihi], x)
        return false
    end
    u = T(1)
    lo = ilo - u
    hi = ihi + u
    @inbounds while lo < hi - u
        m = midpoint(lo, hi)
        if lt(o, v[m], x)
            lo = m
        elseif lt(o, x, v[m])
            hi = m
        else
            return _insortedfirst(v, x, max(lo,ilo), m, o) || _insortedlast(v, x, m, min(hi,ihi), o)
        end
    end
    if hi <= ihi && v[hi] == x
        return true
    end
    return false
end

function other_insorted(v::AbstractVector, x, ilo::T, ihi::T, o::Ordering)::Bool where T<:Integer
    r = searchsorted(v, x)
    return length(r) > 0
end

for s in [:insorted, :other_insorted]
    @eval begin
        $s(v::AbstractVector, x, o::Ordering) = (inds = axes(v, 1); $s(v,x,first(inds),last(inds),o))
        $s(v::AbstractVector, x;
           lt=isless, by=identity, rev::Union{Bool,Nothing}=nothing, order::Ordering=Forward) =
            $s(v,x,ord(lt,by,rev,order))
    end
end

using BenchmarkTools

@benchmark:

julia> @benchmark insorted([1:5; 1:5; 1:5], 1, 6, 10, Forward)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     66.106 ns (0.00% GC)
  median time:      67.544 ns (0.00% GC)
  mean time:        80.966 ns (11.95% GC)
  maximum time:     47.349 μs (99.72% GC)
  --------------
  samples:          10000
  evals/sample:     979

julia> @benchmark other_insorted([1:5; 1:5; 1:5], 1, 6, 10, Forward)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     70.034 ns (0.00% GC)
  median time:      70.964 ns (0.00% GC)
  mean time:        84.005 ns (11.40% GC)
  maximum time:     46.641 μs (99.75% GC)
  --------------
  samples:          10000
  evals/sample:     976

julia> @benchmark insorted([1,2,3], 0)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     45.492 ns (0.00% GC)
  median time:      46.430 ns (0.00% GC)
  mean time:        58.798 ns (17.14% GC)
  maximum time:     47.850 μs (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark other_insorted([1,2,3], 0)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     45.584 ns (0.00% GC)
  median time:      46.561 ns (0.00% GC)
  mean time:        59.128 ns (17.20% GC)
  maximum time:     46.606 μs (99.75% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark insorted([1,2,3], 2)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     47.190 ns (0.00% GC)
  median time:      48.155 ns (0.00% GC)
  mean time:        62.074 ns (17.19% GC)
  maximum time:     48.128 μs (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark other_insorted([1,2,3], 2)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     48.591 ns (0.00% GC)
  median time:      49.597 ns (0.00% GC)
  mean time:        62.978 ns (17.44% GC)
  maximum time:     50.001 μs (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark insorted([1,2,3], 4)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     43.004 ns (0.00% GC)
  median time:      43.998 ns (0.00% GC)
  mean time:        57.017 ns (19.12% GC)
  maximum time:     47.788 μs (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     991

julia> @benchmark other_insorted([1,2,3], 4)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     45.670 ns (0.00% GC)
  median time:      46.598 ns (0.00% GC)
  mean time:        59.793 ns (18.05% GC)
  maximum time:     46.788 μs (99.80% GC)
  --------------
  samples:          10000
  evals/sample:     990

Using a dedicated insorted implementation or searchsorted (other_insorted) does not make a big change for AbstractVector. Should I remove the dedicated function?

I checked, for ranges, insorted could be equal to in.

@jw3126
Copy link
Contributor

jw3126 commented Sep 14, 2020

I think this PR introduces a lot of code, not all of which is really needed. So I would approach it like this. Take a step back and replace all the code by just:

function insorted(x, coll; kw...)
    !isempty(searchsorted(coll, x; kw...))
end

Is it missing features? Is it slow in some cases? Is it buggy?
If the answer is yes to one of these, we should think about how to fix it.

@remi-garcia
Copy link
Contributor Author

Ok, here are some @benchmark with the simplified insorted:

julia> @benchmark insorted(1, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.659 ns (0.00% GC)
  median time:      13.394 ns (0.00% GC)
  mean time:        14.758 ns (0.00% GC)
  maximum time:     57.431 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(1, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.641 ns (0.00% GC)
  median time:      30.255 ns (0.00% GC)
  mean time:        19.357 ns (0.00% GC)
  maximum time:     68.061 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(100, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.492 ns (0.00% GC)
  median time:      10.973 ns (0.00% GC)
  mean time:        12.726 ns (0.00% GC)
  maximum time:     43.886 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(100, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.232 ns (0.00% GC)
  median time:      43.665 ns (0.00% GC)
  mean time:        44.449 ns (0.00% GC)
  maximum time:     101.706 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark insorted(30, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.148 ns (0.00% GC)
  median time:      15.348 ns (0.00% GC)
  mean time:        15.536 ns (0.00% GC)
  maximum time:     37.629 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(30, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.726 ns (0.00% GC)
  median time:      30.250 ns (0.00% GC)
  mean time:        22.406 ns (0.00% GC)
  maximum time:     57.277 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(60, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.010 ns (0.00% GC)
  median time:      14.994 ns (0.00% GC)
  mean time:        15.677 ns (0.00% GC)
  maximum time:     43.850 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(60, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.078 ns (0.00% GC)
  median time:      44.064 ns (0.00% GC)
  mean time:        39.295 ns (0.00% GC)
  maximum time:     77.879 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

For ranges it is slower:

julia> @benchmark insorted(1, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.176 ns (0.00% GC)
  median time:      9.247 ns (0.00% GC)
  mean time:        9.483 ns (0.00% GC)
  maximum time:     24.069 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(1, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.208 ns (0.00% GC)
  median time:      2.217 ns (0.00% GC)
  mean time:        2.225 ns (0.00% GC)
  maximum time:     12.624 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(100, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.752 ns (0.00% GC)
  median time:      10.023 ns (0.00% GC)
  mean time:        10.087 ns (0.00% GC)
  maximum time:     38.033 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(100, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.208 ns (0.00% GC)
  median time:      2.219 ns (0.00% GC)
  mean time:        2.236 ns (0.00% GC)
  maximum time:     34.658 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(50, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.745 ns (0.00% GC)
  median time:      9.210 ns (0.00% GC)
  mean time:        9.273 ns (0.00% GC)
  maximum time:     37.059 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(50, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.207 ns (0.00% GC)
  median time:      2.218 ns (0.00% GC)
  mean time:        2.225 ns (0.00% GC)
  maximum time:     30.371 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

I suggest to keep insorted = in for AbstractRange.

@jw3126
Copy link
Contributor

jw3126 commented Sep 14, 2020

Thanks for taking a look at ranges. I think for arrays insorted will really begin to shine for larger arrays. Say > 100 elements.
insorted doing binary search has more branches then in doing a for loop. And for small array sizes the former cost can matter.
So how about this:

"""
some docs....
"""
function insorted end
insorted(x, coll; kw...) = !isempty(searchsorted(coll, x; kw...))
insorted(x, coll::AbstractRange; kw...) = in(x, coll)

@remi-garcia
Copy link
Contributor Author

I just pushed the last changes that roughly correspond to what you suggested 👍🏼

base/sort.jl Outdated Show resolved Hide resolved
base/sort.jl Outdated Show resolved Hide resolved
base/sort.jl Outdated Show resolved Hide resolved
base/sort.jl Show resolved Hide resolved
Copy link
Contributor

@jw3126 jw3126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Needs also an entry in News.md.

Copy link
Contributor

@jw3126 jw3126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks a lot @remi-garcia !

@remi-garcia
Copy link
Contributor Author

I just removed tests that called the "old" insorted(x,v,lo,hi,o)` and fixed a typo. Should be working fine now!
Thanks to you @jw3126 for guiding me through this PR! 👍

base/sort.jl Outdated Show resolved Hide resolved
@remi-garcia
Copy link
Contributor Author

I don't get why the buildbot is so "mad" at me. Is that normal or I'm missing something I should fix?

Copy link
Member

@dkarrasch dkarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can click on the "details" button behind the individual CI tests, and check the stack trace of error messages, just like in the normal REPL. Then you can start fixing bugs as usual. Some more style comments on the tests: for functions that return Bools and sound like "a test" already (insorted), you don't need to compare to true or false, but can simply do @test insorted(...) and @test !insorted(...), resp.

test/sorting.jl Outdated Show resolved Hide resolved
@remi-garcia
Copy link
Contributor Author

Thanks for the correction and suggestions. I just understood that I had to click on view all 17373 lines to see something that I understand in CI tests Details... Whoops.

@@ -1063,7 +1063,8 @@ splat(f) = args->f(args...)
in(x)

Create a function that checks whether its argument is [`in`](@ref) `x`, i.e.
a function equivalent to `y -> y in x`.
a function equivalent to `y -> y in x`. See also [`insorted`](@ref) for the use
in sorted collections.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in sorted collections.
with sorted collections.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment, maybe a native speaker could quickly help, and then we merge this.

Co-authored-by: Daniel Karrasch <[email protected]>
@dkarrasch dkarrasch merged commit f6d1032 into JuliaLang:master Sep 28, 2020
@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Sep 28, 2020
@JeffBezanson
Copy link
Sponsor Member

Is there any precedent for this function in other systems? If so, I wonder what it has been called before? We should make sure to use a standard name if one exists.

@remi-garcia
Copy link
Contributor Author

I found binary_search or bsearch. I wanted to be consistent with search-sorted but I should have looked into the std before 😕

@jw3126
Copy link
Contributor

jw3126 commented Sep 29, 2020

I like insorted. I agree, if there is sufficient precedence in other languages, we should adapt out name. binary_search is not a good name for us I think. Since it is neither search (in the sense of searchsorted and friends) nor guaranteed to be "binary" (e.g. it is O(1) for ranges).

@kmsquire
Copy link
Member

There has been some discussion in the past about creating a separate SortedArray type. This would have the benefit of code which uses in (or ) just working, as it does for most other container types, and wouldn't require an additional function.

More generally, a simple Sorted wrapper class could have the same benefits, again without introducing a new function.

xs = sort(ys)  # xs is an array
...
if element in Sorted(xs)
    ...
end

For a more pie-in-the sky idea, it would be nice to have a way of tagging arrays with attributes (like "sorted"):

xs = sort(ys)     # xs is tagged with "sorted"
issorted(xs)      # O(1) if xs is tagged with "sorted"
if element in xs  # O(log N) if xs is tagged with "sorted"
    ...
end

Related: @andyferris's AcceleratedArrays.jl

@jw3126
Copy link
Contributor

jw3126 commented Sep 29, 2020

@goretkin had basically the same suggestion in the issue fixed by this PR

@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: in sorted array
6 participants