Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add view to filter, sort, dropmissing, and unique #2386

Merged
merged 19 commits into from
Sep 9, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Aug 27, 2020

This is a non-breaking proposal to add an option to filter, sort, dropmissing, and unique to return a view instead of a copy. This is relevant for very large data frames on which these operations can easily eat up all memory. It is also faster than doing a copy (of course if it is worth doing it depends on what one would want to do with the data frame later).

The drawback is that we become type unstable, but this is a case when we have a small union (DataFrame or SubDataFrame) so compiler should handle it. An alternative with Val seemed to be an overkill (though it would be type stable then).

Now I just have written an implementation and proposed docstring updates. If we think this PR is OK I will add tests and update NEWS.md.

@bkamins bkamins added feature performance non-breaking The proposed change is not breaking labels Aug 27, 2020
@bkamins bkamins added this to the 1.0 milestone Aug 27, 2020
@bkamins
Copy link
Member Author

bkamins commented Sep 1, 2020

@nalimilan - no rush, but do you have any opinion on this (this is the first step in deciding what we do with updating of views later)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea, and I don't see what else this could mean so it's quite safe to add. It's also nice to see that the implementation is quite clean.

Regarding type instability, ideally the compiler will use the fact that view=true is known at compile time (in particular as the default value for a keyword argument) to infer the type as ::DataFrame. I checked that recently for categorical's compress=false argument and it worked quite well. Can you check that it's the case? In some large functions, it could help to move some of the code into helpers, so that the top-level function in which if view appears can be inlined.

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented Sep 3, 2020

Actually for categorical we have that categorical([1,2,3]) is correctly inferred, but for categorical([1,2,3], compress=true) and categorical([1,2,3], compress=false) the return type is inferred as Any.

@nalimilan
Copy link
Member

OK, that's not great, but I guess the compiler cannot do better in that case since there are many possible return types. For view there are only two (concrete) return types. Anyway what matters the most is that when you don't pass view you don't suffer from this.

@bkamins
Copy link
Member Author

bkamins commented Sep 3, 2020

Regarding view keyword argument I will investigate and let you know. I to not think it is a huge problem (it is a small union and DataFrame is type unstable anyway).

We actually have the same issue with select/transfrom/combine where we have ungroup keyword argument.

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Sep 3, 2020

Regarding view keyword argument I will investigate and let you know.

It seems that @inline solves the instability thanks to JuliaLang/julia#35976. I will write appropriate tests tomorrow.

@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2020

OK, with @inline on Julia nightly we have a similar situation as for categorical:

julia> df = DataFrame(rand(2,3));

julia> @code_warntype sort(df)
Variables
  #self#::Core.Const(sort, false)
  df::DataFrame

Body::DataFrame
1 ─      nothing
│   %2 = Base.vect()::Vector{Any}
│   %3 = (#self#)(df, %2)::DataFrame
└──      return %3

julia> @code_warntype sort(df, view=true)
Variables
  #s211::Core.Const(Base.var"#sort##kw"(), false)
  @_2::NamedTuple{(:view,),Tuple{Bool}}
  @_3::Core.Const(sort, false)
  df::DataFrame

Body::Union{DataFrame, SubDataFrame{DataFrame,DataFrames.Index,Vector{Int64}}}
1 ─ %1 = Base.vect()::Vector{Any}
│   %2 = (#s211)(@_2, @_3, df, %1)::Union{DataFrame, SubDataFrame{DataFrame,DataFrames.Index,Vector{Int64}}}
└──      return %2

(that is - sometimes we have a proper inference and sometimes we do not unfortunately)

@bkamins bkamins added the bug label Sep 4, 2020
@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2020

I have added tests (and caught and old bug in filter with SubDataFrames which is fixed now)

@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2020

@nalimilan - this should be good to have a look at. Thank you!

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
test/data.jl Outdated
Comment on lines 187 to 191
@test fun(view(df, 1:2, 1:2)) isa DataFrame
@test fun(df, view=false) isa DataFrame
@test fun(view(df, 1:2, 1:2), view=false) isa DataFrame
@test fun(df, view=true) isa SubDataFrame
@test fun(view(df, 1:2, 1:2), view=true) isa SubDataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test returned value? Or is that covered elsewhere?

It would also be nice to use @inferred when view=true/false isn't specified to prevent any regression: it would be easy to remove one of the @inlined without realizing why they are here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the test and @inferred (I thought it would fail on Julia 1.0, but it passes - which is good)

@nalimilan
Copy link
Member

nalimilan commented Sep 5, 2020

(that is - sometimes we have a proper inference and sometimes we do not unfortunately)

It's not just "sometimes", right? If you specify view, the type of the result is a Union, and otherwise it's a concrete type (including if you specify other keyword arguments but not view), right? At least that means this PR doesn't introduce any regression. It's not ideal that view=true implies a type instability, but that shouldn't be too bad and hopefully the compiler will improve in the future.

@bkamins
Copy link
Member Author

bkamins commented Sep 5, 2020

and otherwise it's a concrete type (including if you specify other keyword arguments but not view)

I had to do some refactoring of filter, but after these changes the answer is "yes".


@inline function _filter_helper(df::AbstractDataFrame, f, cols...; view::Bool)
@inline function Base.filter(f, df::AbstractDataFrame; view::Bool=false)
rowidxs::BitVector = _filter_helper(f, eachrow(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that inference fails here. Anyway, maybe better put the type assertion directly in _filter_helper itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have tried this and this was crashing my Julia earlier. But now it seems to work, so probably I had some weird problem earlier. I pushed the fix for this. Also I think it is not a problem to add this conversion and assertion as it is no-op and at least we signal what we intend to return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Do we have tests to catch an inference failure if somebody removed the assertion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember why I did it the other way - it was because nightly fails this way, see e.g. https://travis-ci.org/github/JuliaData/DataFrames.jl/jobs/725236829 (I have not tested this in 100% detail to nail down the reason as I have a lot of classes this week 😩).

@JeffBezanson + @Keno : this is a regression on nightly vs Julia 1.5.1 - some problems with type inference. Please let me know if you need more information to investigate.

Use this branch on Julia nightly and run the following to reproduce (interestingly - a correct result is produced, but an error is thrown in the process):

julia> using DataFrames

julia> df = DataFrame(rand(10,1))
10×1 DataFrame
│ Row │ x1        │
│     │ Float64   │
├─────┼───────────┤
│ 1   │ 0.389567  │
│ 2   │ 0.0901768 │
│ 3   │ 0.179161  │
│ 4   │ 0.0274051 │
│ 5   │ 0.696103  │
│ 6   │ 0.321395  │
│ 7   │ 0.476313  │
│ 8   │ 0.628184  │
│ 9   │ 0.610729  │
│ 10  │ 0.048924  │

julia> filter(:x1 => >(0.5), df)
Internal error: encountered unexpected error in runtime:
BoundsError(a=Array{UInt64, (68,)}[0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000000003, 0x0000000000000004, 0x0000000000000005, 0x0000000000000006, 0x0000000000000007, 0x0000000000000008, 0x0000000000000009, 0x000000000000000a, 0x000000000000000b, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x000000000000000c, 0x000000000000000d, 0x000000000000000e, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x000000000000000f, 0x0000000000000010, 0x0000000000000011, 0x0000000000000012, 0x0000000000000013, 0x0000000000000000, 0x0000000000000014, 0x0000000000000015, 0x0000000000000016, 0x0000000000000017, 0x0000000000000000, 0x0000000000000018, 0x0000000000000019, 0x000000000000001a, 0x000000000000001b, 0x0000000000000000, 0x000000000000001c, 0x000000000000001d, 0x000000000000001e, 0x000000000000001f, 0x0000000000000020, 0x0000000000000021, 0x0000000000000022, 0x0000000000000023], i=(69,))
jl_bounds_error_ints at /buildworker/worker/package_linux64/build/src/rtutils.c:183
getindex at ./array.jl:809 [inlined]
getindex at ./abstractarray.jl:1122 [inlined]
DFS at ./compiler/ssair/domtree.jl:197
SNCA at ./compiler/ssair/domtree.jl:269
construct_domtree at ./compiler/ssair/domtree.jl:121
run_passes at ./compiler/ssair/driver.jl:131
optimize at ./compiler/optimize.jl:172
typeinf at ./compiler/typeinfer.jl:32
typeinf_edge at ./compiler/typeinfer.jl:522
abstract_call_method at ./compiler/abstractinterpretation.jl:453
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:129
abstract_call_known at ./compiler/abstractinterpretation.jl:991
abstract_call at ./compiler/abstractinterpretation.jl:1014
abstract_call at ./compiler/abstractinterpretation.jl:998
abstract_eval_statement at ./compiler/abstractinterpretation.jl:1119
typeinf_local at ./compiler/abstractinterpretation.jl:1375
typeinf_nocycle at ./compiler/abstractinterpretation.jl:1431
typeinf at ./compiler/typeinfer.jl:12
typeinf_ext at ./compiler/typeinfer.jl:609
typeinf_ext_toplevel at ./compiler/typeinfer.jl:642
typeinf_ext_toplevel at ./compiler/typeinfer.jl:638
jfptr_typeinf_ext_toplevel_10082.clone_1 at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/lib/julia/sys.so (unknown line)
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1750 [inlined]
jl_type_infer at /buildworker/worker/package_linux64/build/src/gf.c:300
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:293
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1890
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1841 [inlined]
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2336
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1750 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:117
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:206
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:157 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:551
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:659
top-level scope at REPL[15]:1
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:838
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:788
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:881
eval at ./boot.jl:344
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:134
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:195
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:180
#run_repl#41 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:311
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:299
#844 at ./client.jl:386
jfptr_YY.844_30029.clone_1 at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/lib/julia/sys.so (unknown line)
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1750 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:705
#invokelatest#2 at ./essentials.jl:718 [inlined]
invokelatest at ./essentials.jl:717 [inlined]
run_main_repl at ./client.jl:371
exec_options at ./client.jl:301
_start at ./client.jl:484
jfptr__start_22499.clone_1 at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/lib/julia/sys.so (unknown line)
jl_apply at /buildworker/worker/package_linux64/build/ui/../src/julia.h:1750 [inlined]
true_main at /buildworker/worker/package_linux64/build/ui/repl.c:106
main at /buildworker/worker/package_linux64/build/ui/repl.c:227
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/bin/julia (unknown line)
3×1 DataFrame
│ Row │ x1       │
│     │ Float64  │
├─────┼──────────┤
│ 1   │ 0.696103 │
│ 2   │ 0.628184 │
│ 3   │ 0.610729 │

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests to catch an inference failure if somebody removed the assertion?

Yes - tests will catch this then.

test/data.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2020

Thank you for approving, but I think let us wait if we get a feedback from core devs on nightly issues.

@bkamins
Copy link
Member Author

bkamins commented Sep 9, 2020

@nalimilan - added NEWS.md entry. Can you have a quick look please before merging?

@bkamins bkamins merged commit 63b5025 into JuliaData:master Sep 9, 2020
@bkamins bkamins deleted the filter_view branch September 9, 2020 08:51
@bkamins
Copy link
Member Author

bkamins commented Sep 9, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug feature non-breaking The proposed change is not breaking performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants