-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop LazyArrays dependency in favor of our own CSV.Column2 which we c… #546
Conversation
…an control with regards to copying and iteration. Fixes #539
Codecov Report
@@ Coverage Diff @@
## master #546 +/- ##
=========================================
Coverage ? 82.61%
=========================================
Files ? 7
Lines ? 1455
Branches ? 0
=========================================
Hits ? 1202
Misses ? 253
Partials ? 0
Continue to review full report at Codecov.
|
@tpapp - could you please check your example from https://discourse.julialang.org/t/dataframe-join-error/32141/7 with this change and find out if it fixes the issue (if this is the case the original problem is most probably caused by LazyArrays.jl and an issue should be submitted there). Thank you! |
Using this branch, I get a free(): corrupted unsorted chunks
signal (6): Aborted
in expression starting at REPL[5]:1 error while doing a [336ed68f] CSV v0.5.18 #jq/539 (https://github.com/JuliaData/CSV.jl.git)
[324d7699] CategoricalArrays v0.7.4
[a93c6f00] DataFrames v0.20.0
[48062228] FilePathsBase v0.7.0
[69de0a69] Parsers v0.3.10
[2dfb63ee] PooledArrays v0.5.2
[bd369af6] Tables v0.2.11
[ea10d353] WeakRefStrings v0.6.1
[ade2ca70] Dates
[a63ad114] Mmap
[4ec0a83e] Unicode Julia I have the tests set up for this now in a separate branch, so it is really easy for me to test new versions, just let me know. |
Thank you for following up with this as such tests are much needed to polish things (CSV.jl - although it seems simple - is really complex under the hood from my experience). |
I am really sorry that I cannot contribute an MWE data file; I tried making a random one but it fails to exhibit the original problem. FWIW, the two dataframes I am merging have sizes Also, while I appreciate the effort that went into making CSV.jl fast, I find difficult to contribute to it because of its coding style: some methods are very long and complex, which I imagine makes CI difficult. |
src/tables.jl
Outdated
throw(BoundsError(c, i)) | ||
end | ||
|
||
function Base.copy(c::Column2{T}) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this method be used also for Column
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, wouldn't it be faster to loop over c.columns
and call copyto!
on each? I guess you benchmarked that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be faster. Actually, we could basically do a memcpy
for most of the types which would be even faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK copyto!
does that for common array types already. Anyway that's the right level of abstraction for this kind of optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is we're not dealing with copyto!
from Vector{T}
to Vector{T}
, but from Vector{UInt64}
to Vector{T}
. For Int/Float64, the bits are the same, so we can manually memcpy
the pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like an optimization that could be applied by Base too.
src/tables.jl
Outdated
@@ -29,6 +85,55 @@ function Base.copy(c::Column{T, T}) where {T <: Union{String, Union{String, Miss | |||
return A | |||
end | |||
|
|||
function Base.copy(c::Column2{T, S}) where {T <: Union{String, Union{String, Missing}}, S <: Union{PooledString, Union{PooledString, Missing}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: it looks like you could at least factor out a large part of code which is common with Column
.
@tpapp Have you tried with Regarding the creation of a MWE, maybe you can replace each letter/digit with another, random one? Would that be enough to anonymize the data? |
With that flag, I get > df = DataFrame(CSV.File("clean.csv")))
ERROR: BoundsError: attempt to access 4990-element Array{UInt32,1} at index [4991] Sorry, even if I randomize on the basis of this data it counts as a "transformation" and I think is explicitly forbidden, so I am cautious about these things. |
Interesting. Can you post the full backtrace? |
ERROR: BoundsError: attempt to access 4990-element Array{UInt32,1} at index [4991]
Stacktrace:
[1] macro expansion at ./array.jl:782 [inlined]
[2] macro expansion at ./simdloop.jl:77 [inlined]
[3] copy(::CSV.Column2{String,PooledString}) at /home/tamas/.julia/packages/CSV/RykhE/src/tables.jl:100
[4] (::DataFrames.var"#DataFrame#99#102")(::Bool, ::Type{DataFrame}, ::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index) at /home/tamas/.julia/packages/DataFrames/uPgZV/src/dataframe/dataframe.jl:126
[5] Type at ./none:0 [inlined]
[6] #DataFrame#114(::Bool, ::Bool, ::Type{DataFrame}, ::Array{Union{CSV.Column, CSV.Column2},1}, ::Array{Symbol,1}) at /home/tamas/.julia/packages/DataFrames/uPgZV/src/dataframe/dataframe.jl:189
[7] Type at ./none:0 [inlined]
[8] #DataFrame#75 at /home/tamas/.julia/packages/CSV/RykhE/src/CSV.jl:1087 [inlined]
[9] DataFrame(::CSV.File{true}) at /home/tamas/.julia/packages/CSV/RykhE/src/CSV.jl:1087
[10] top-level scope at REPL[6]:1 |
Ok, just pushed a cleanup/fix commit. @tpapp, would you be willing to try this branch out again? |
I get: julia> df = DataFrame(CSV.File("the_path.csv")))
malloc(): invalid next size (unsorted) and then I don't get the prompt back, even though the Julia process uses no CPU. EDIT: this is with |
I can replicate a weird malloc-related issue on a very wide dataset. Looking into it |
Ok, I think I've got it figured out, if you wouldn't mind trying again @tpapp |
@quinnj, the latest commit fixes all of my issues. I appreciate that you guys followed up on this without an MWE. |
A = (T == String || T == Union{String, Missing}) ? StringVector{T}(undef, len) : Vector{T}(undef, len) | ||
if T <: Int64 || T <: Float64 || T <: Dates.TimeType | ||
if c isa Column | ||
memcpy!(pointer(A), 1, pointer(c.tape), 1, len * 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe unsafe_copy!
would work here? There's a unsafe_copyto!(dest::Ptr{T}, src::Ptr{T}, n)
method which calls memmove
.
Thank you! I think it would be good to have a release. |
…an control with regards to copying and iteration. Fixes #539