Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop LazyArrays dependency in favor of our own CSV.Column2 which we c… #546

Merged
merged 3 commits into from
Dec 13, 2019

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Dec 12, 2019

…an control with regards to copying and iteration. Fixes #539

…an control with regards to copying and iteration. Fixes #539
@codecov
Copy link

codecov bot commented Dec 12, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@b63e628). Click here to learn what that means.
The diff coverage is 47.05%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #546   +/-   ##
=========================================
  Coverage          ?   82.61%           
=========================================
  Files             ?        7           
  Lines             ?     1455           
  Branches          ?        0           
=========================================
  Hits              ?     1202           
  Misses            ?      253           
  Partials          ?        0
Impacted Files Coverage Δ
src/iteration.jl 18% <0%> (ø)
src/utils.jl 75.84% <100%> (ø)
src/tables.jl 78.49% <48.57%> (ø)
src/CSV.jl 84.4% <66.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b63e628...fe037b4. Read the comment docs.

@bkamins
Copy link
Member

bkamins commented Dec 12, 2019

@tpapp - could you please check your example from https://discourse.julialang.org/t/dataframe-join-error/32141/7 with this change and find out if it fixes the issue (if this is the case the original problem is most probably caused by LazyArrays.jl and an issue should be submitted there). Thank you!

@tpapp
Copy link

tpapp commented Dec 12, 2019

Using this branch, I get a

free(): corrupted unsorted chunks

signal (6): Aborted
in expression starting at REPL[5]:1

error while doing a DataFrame(CSV.File("some_path")). Package statuses (for CSV):

  [336ed68f] CSV v0.5.18 #jq/539 (https://github.com/JuliaData/CSV.jl.git)
  [324d7699] CategoricalArrays v0.7.4
  [a93c6f00] DataFrames v0.20.0
  [48062228] FilePathsBase v0.7.0
  [69de0a69] Parsers v0.3.10
  [2dfb63ee] PooledArrays v0.5.2
  [bd369af6] Tables v0.2.11
  [ea10d353] WeakRefStrings v0.6.1
  [ade2ca70] Dates 
  [a63ad114] Mmap 
  [4ec0a83e] Unicode 

Julia v"1.3.1-pre.0".

I have the tests set up for this now in a separate branch, so it is really easy for me to test new versions, just let me know.

@bkamins
Copy link
Member

bkamins commented Dec 12, 2019

Thank you for following up with this as such tests are much needed to polish things (CSV.jl - although it seems simple - is really complex under the hood from my experience).

@tpapp
Copy link

tpapp commented Dec 12, 2019

I am really sorry that I cannot contribute an MWE data file; I tried making a random one but it fails to exhibit the original problem. FWIW, the two dataframes I am merging have sizes (2666, 63) and (4990, 25), and have integer and categorical columns, with occasional missing data.

Also, while I appreciate the effort that went into making CSV.jl fast, I find difficult to contribute to it because of its coding style: some methods are very long and complex, which I imagine makes CI difficult.

src/tables.jl Outdated
throw(BoundsError(c, i))
end

function Base.copy(c::Column2{T}) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this method be used also for Column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, wouldn't it be faster to loop over c.columns and call copyto! on each? I guess you benchmarked that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be faster. Actually, we could basically do a memcpy for most of the types which would be even faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK copyto! does that for common array types already. Anyway that's the right level of abstraction for this kind of optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is we're not dealing with copyto! from Vector{T} to Vector{T}, but from Vector{UInt64} to Vector{T}. For Int/Float64, the bits are the same, so we can manually memcpy the pointers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like an optimization that could be applied by Base too.

src/tables.jl Outdated
@@ -29,6 +85,55 @@ function Base.copy(c::Column{T, T}) where {T <: Union{String, Union{String, Miss
return A
end

function Base.copy(c::Column2{T, S}) where {T <: Union{String, Union{String, Missing}}, S <: Union{PooledString, Union{PooledString, Missing}}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: it looks like you could at least factor out a large part of code which is common with Column.

@nalimilan
Copy link
Member

@tpapp Have you tried with --check-bounds=yes?

Regarding the creation of a MWE, maybe you can replace each letter/digit with another, random one? Would that be enough to anonymize the data?

@tpapp
Copy link

tpapp commented Dec 12, 2019

With that flag, I get

> df = DataFrame(CSV.File("clean.csv")))
ERROR: BoundsError: attempt to access 4990-element Array{UInt32,1} at index [4991]

Sorry, even if I randomize on the basis of this data it counts as a "transformation" and I think is explicitly forbidden, so I am cautious about these things.

@nalimilan
Copy link
Member

Interesting. Can you post the full backtrace?

@tpapp
Copy link

tpapp commented Dec 12, 2019

ERROR: BoundsError: attempt to access 4990-element Array{UInt32,1} at index [4991]
Stacktrace:
 [1] macro expansion at ./array.jl:782 [inlined]
 [2] macro expansion at ./simdloop.jl:77 [inlined]
 [3] copy(::CSV.Column2{String,PooledString}) at /home/tamas/.julia/packages/CSV/RykhE/src/tables.jl:100
 [4] (::DataFrames.var"#DataFrame#99#102")(::Bool, ::Type{DataFrame}, ::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index) at /home/tamas/.julia/packages/DataFrames/uPgZV/src/dataframe/dataframe.jl:126
 [5] Type at ./none:0 [inlined]
 [6] #DataFrame#114(::Bool, ::Bool, ::Type{DataFrame}, ::Array{Union{CSV.Column, CSV.Column2},1}, ::Array{Symbol,1}) at /home/tamas/.julia/packages/DataFrames/uPgZV/src/dataframe/dataframe.jl:189
 [7] Type at ./none:0 [inlined]
 [8] #DataFrame#75 at /home/tamas/.julia/packages/CSV/RykhE/src/CSV.jl:1087 [inlined]
 [9] DataFrame(::CSV.File{true}) at /home/tamas/.julia/packages/CSV/RykhE/src/CSV.jl:1087
 [10] top-level scope at REPL[6]:1

@quinnj
Copy link
Member Author

quinnj commented Dec 12, 2019

Ok, just pushed a cleanup/fix commit. @tpapp, would you be willing to try this branch out again?

@tpapp
Copy link

tpapp commented Dec 12, 2019

I get:

julia> df = DataFrame(CSV.File("the_path.csv")))
malloc(): invalid next size (unsorted)

and then I don't get the prompt back, even though the Julia process uses no CPU.

EDIT: this is with --check-bounds=yes.

@quinnj
Copy link
Member Author

quinnj commented Dec 13, 2019

I can replicate a weird malloc-related issue on a very wide dataset. Looking into it

@quinnj
Copy link
Member Author

quinnj commented Dec 13, 2019

Ok, I think I've got it figured out, if you wouldn't mind trying again @tpapp

@tpapp
Copy link

tpapp commented Dec 13, 2019

@quinnj, the latest commit fixes all of my issues.

I appreciate that you guys followed up on this without an MWE.

A = (T == String || T == Union{String, Missing}) ? StringVector{T}(undef, len) : Vector{T}(undef, len)
if T <: Int64 || T <: Float64 || T <: Dates.TimeType
if c isa Column
memcpy!(pointer(A), 1, pointer(c.tape), 1, len * 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe unsafe_copy! would work here? There's a unsafe_copyto!(dest::Ptr{T}, src::Ptr{T}, n) method which calls memmove.

@quinnj quinnj merged commit 9ce09db into master Dec 13, 2019
@quinnj quinnj deleted the jq/539 branch December 13, 2019 15:02
@bkamins
Copy link
Member

bkamins commented Dec 13, 2019

Thank you! I think it would be good to have a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV breaks filter!
4 participants