CSV breaks filter! #539

clintonTE · 2019-11-27T05:14:41Z

Judging by the error, it probably breaks other things as well. I find filter! useful for in place operations on large DataFrames.

using DataFrames, CSV

function mwe()
  df = DataFrame(rand(1000,10))

  #seems to work without the csv usage
  dfnocsv = deepcopy(df)
  filter!(r->r.x1<0.5, dfnocsv)
  println("size dfnocsv: ", size(dfnocsv))

  #sending the df to a CSV first breaks the code
  df |> CSV.write("test.csv")
  dfcsv = CSV.read("test.csv") |> DataFrame
  filter!(r->r.x1<0.5, dfcsv)
  println("size dfcsv: ", size(dfcsv))
end

mwe()

Output:

size dfnocsv: (510, 10)
ERROR: LoadError: MethodError: no method matching deleteat!(::LazyArrays.ApplyArray{Float64,1,typeof(vcat),NTuple{6,Array{Float64,1}}}, ::Array{Int64,1})
Closest candidates are:
  deleteat!(::Array{T,1} where T, ::AbstractArray{T,1} where T) at array.jl:1228
  deleteat!(::Array{T,1} where T, ::Any) at array.jl:1227
  deleteat!(::BitArray{1}, ::Any) at bitarray.jl:940
  ...
Stacktrace:
 [1] (::DataFrames.var"#118#119"{Array{Int64,1}})(::LazyArrays.ApplyArray{Float64,1,typeof(vcat),NTuple{6,Array{Float64,1}}}) at C:\Users\Clinton\.julia\packages\DataFrames\yH0f6\src\dataframe\dataframe.jl:707
 [2] foreach(::DataFrames.var"#118#119"{Array{Int64,1}}, ::Array{AbstractArray{T,1} where T,1}) at .\abstractarray.jl:1920
 [3] deleterows! at C:\Users\Clinton\.julia\packages\DataFrames\yH0f6\src\dataframe\dataframe.jl:707 [inlined]
 [4] filter! at C:\Users\Clinton\.julia\packages\DataFrames\yH0f6\src\abstractdataframe\abstractdataframe.jl:789 [inlined]
 [5] mwe() at C:\Users\Clinton\Dropbox\Projects\LeverageChanges\mwe.jl:14
 [6] top-level scope at C:\Users\Clinton\Dropbox\Projects\LeverageChanges\mwe.jl:19
in expression starting at C:\Users\Clinton\Dropbox\Projects\LeverageChanges\mwe.jl:19

clintonTE · 2019-11-27T21:14:03Z

In-placeish temporary work-around:

Edit: work-around doesn't work with bits types. Can just make a new df with DataFrame(view(dfcsv, :, :))

@inline function fixcsv!(df::DataFrame)::DataFrame
  for c ∈ 1:size(df,2)
    df[!,c] = Array(df[!,c])
  end
  return df
end

function mwe()
  df = DataFrame(rand(1000,5))

  #seems to work without the csv usage
  dfnocsv = deepcopy(df)
  filter!(r->r.x1<0.5, dfnocsv)
  println("size dfnocsv: ", size(dfnocsv))

  #sending the df to a CSV first breaks the code
  df |> CSV.write("test.csv")
  dfcsv = CSV.read("test.csv") |> DataFrame |> fixcsv!

  filter!(r->r.x1<0.5, dfcsv)
  println("size dfcsv: ", size(dfcsv))
end

mwe()

Output:

size dfnocsv: (466, 5)
size dfcsv: (466, 5)

nalimilan · 2019-11-28T13:52:43Z

Pass copycols=true or use DataFrame(CSV.File(...)) instead of CSV.read.

@quinnj I really think we should do something about this, it keeps confusing people. :-)

clintonTE · 2019-11-28T18:02:30Z

I should have mentioned that I had tried copycols=true. Following your post, I tried DataFrame(CSV.File(...)) which also did not work.

nalimilan · 2019-11-28T21:04:01Z

Are you using the latest CSV release? There was a bug in a recent release.

clintonTE · 2019-11-28T21:15:54Z

Tried it on master, no dice.

Edit: Just to be clear, here is the self-contained code I am using to test this (my project code is much longer). I put in copycols=true, although previously (v5.13) it worked without that option.

using DataFrames, CSV

function mwe()
  df = DataFrame(rand(1000,5))

  #seems to work without the csv usage
  dfnocsv = deepcopy(df)
  filter!(r->r.x1<0.5, dfnocsv)
  println("size dfnocsv: ", size(dfnocsv))

  #sending the df to a CSV first breaks the code
  df |> CSV.write("test.csv")
  dfcsv = CSV.read("test.csv", copycols=true) |> DataFrame


  filter!(r->r.x1<0.5, dfcsv)
  println("size dfcsv: ", size(dfcsv))
end

mwe()

nalimilan · 2019-11-28T21:21:54Z

Mmm, I guess you're using multiple threads?

clintonTE · 2019-11-28T21:24:27Z

Yep, just the default parameters. However I just tried it with Julia set to a single thread and had the same issue.

nalimilan · 2019-11-28T21:31:39Z

Weird. Can you check the type of df's columns? If they are ApplyArray, then you are using multiple threads.

clintonTE · 2019-11-28T21:38:06Z

Interesting. With Julia set to a single thread (and confirmed withThreads.nthreads()) it fails. But I tried setting threaded=false, and it works. So irrespective of the number of threads I set at startup, it seems threaded makes or breaks the code. Again, copycols makes no difference either way.

nalimilan · 2019-11-28T21:40:17Z

OK, yes, what matters it the type of array you get. With ApplyArray the failure is expected.

@quinnj I think when copycols=true we'd better use plain Array columns. copy(::ApplyArray) returns another array, which doesn't have any clear advantage, but clear drawbacks. We should probably apply this to the DataFrames constructor too if we can.

clintonTE · 2019-11-28T21:41:28Z

Yeah the threaded option is reflected in the column type (ApplyArray vs Array).

bkamins · 2019-11-28T22:00:39Z

I think most people will expect columns of DataFrame to be resizable by default, so what @nalimilan suggests seems reasonable.

ExpandingMan · 2019-12-03T17:07:38Z

I definitely think there needs to be an option to return something mutable. I think it's fine if that's not the default, but right now the problem is that copycols=true does not even render the dataframe mutable. There definitely needs to be a clear, obvious option for returning mutable objects.

bkamins · 2019-12-03T17:31:23Z

Exactly - by default CSV.jl returns immutable columns, but when you pass them to DataFrame constructor with copycols=true it should change them to mutable ones.

quinnj · 2019-12-03T17:45:00Z

I'll be working on fixing this today; sorry for the hassle everyone. I had tried to do quite a bit of robustness testing on LazyArrays before adding it as a dependency, but other people's workflows are hard to predict, and I didn't catch this case that breaks. I have an idea on how to fix things.

ExpandingMan · 2019-12-03T19:38:35Z

Might it be worth introducing a mutable keyword argument for the sake of clarity?

I'm in favor of keeping the default whatever is moset efficient.

nalimilan · 2019-12-04T08:46:24Z

I don't think that's needed: when copycols=true, we have to make a copy, so better allocate a plain Vector instead of a LazyVector anyway.

JianghuiDu · 2019-12-10T16:03:17Z

same problem here. Just updated to Julia 1.3 and now I get LazyArrays.ApplyArray instead of Array and other packages do not have methods to deal with it.

bkamins · 2019-12-10T19:03:33Z

It seems that LazyArrays.jl interacts badly with DataFrames.jl in general, see https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/86. I do not know that package, but we should resolve it somehow (maybe also some fix to DataFrames.jl is in place - I am not sure here).

ExpandingMan · 2019-12-10T20:00:07Z

Does it really "interact badly" or is it just that these are immutable? It seems to me that DataFrames should work with immutable arrays as much as possible, and there should be a clearly defined subset of functions that fail in the immutable case.

@bkamins is standardizing and clarifying this something that has been discussed for DataFrames 1.0?

bkamins · 2019-12-10T20:30:43Z

I always thought this is relatively clear:

all in-place operations require columns to be mutable
all in-place resizing operations require columns to be resizable
all other operations should work fine on immutable columns

We have two problems though:

in some codes we assumed that copy(col) is the same as col[:] and col[axes(col,1)], which does not seem to be the case always - this should be reviewed and standardized;
the problem reported in https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/86 is more severe - join does not mutate source but uses similar on it and then performs copyto! of a view of a ApplyArray object - and this is the operation that fails/produces random results; most likely this is a problem with LazyArrays.jl, but I am not sure for 100%.

ExpandingMan · 2019-12-10T20:39:02Z

Yeah, I'm not too sure if copy(col) being equivalent to col[:] is supposed to be a rule. The best argument against it would seem to be that in some cases you might want indexing to give you a completely different type of array.

For the latter problem, it sounds like the issue is whether similar should be guaranteed to give a mutable array. It seems to me that this function is only really useful if that is true. This might call for us to make a PR to LazyArrays.

bkamins · 2019-12-10T20:55:24Z

I'm not too sure if copy(col) being equivalent to col[:] is supposed to be a rule. The best argument against it would seem to be that in some cases you might want indexing to give you a completely different type of array.

I am not saying it should be, but that this is what the code assumes in some places, so we should review it. Note that copy also may change the type of an array in general (e.g. when you copy a view).

it sounds like the issue is whether similar should be guaranteed to give a mutable array.

And in the case of LazyArrays.jl it does (it produces a standard array from Base). The problem is later with running copyto! on a view of ApplyArray object I think.

ExpandingMan · 2019-12-10T21:16:15Z

And in the case of LazyArrays.jl it does (it produces a standard array from Base). The problem is later with running copyto! on a view of ApplyArray object I think.

That sounds like something that should not be assumed to work. Could you re-produce the problem and create an issue where you deem appropriate?

bkamins · 2019-12-10T21:58:31Z

I cannot reproduce it and that is why I am posting it here (as for sure it can be fixed by changing what CSV.jl returns when a copy is requested). I do not know which package causes the problem, i.e. is the problem in:

how CSV.jl uses LazyArrays.jl (i.e. it is used incorrectly and this causes a bug)
how LazyArrays.jl implements copyto!
how DataFrames.jl uses the above (this is a least likely reason)

…an control with regards to copying and iteration. Fixes #539

quinnj · 2019-12-12T06:04:26Z

Alright, PR up that drops LazyArrays all together in favor of a new CSV.Column2 that has LazyArrays.Vcat-like capabilities, but implements copy like the existing CSV.Column as DataFrames expects. This means that copycols=true with CSV.read will return plain, mutable vectors as before.

#546) * Drop LazyArrays dependency in favor of our own CSV.Column2 which we can control with regards to copying and iteration. Fixes #539 * Fixup copy for Column2; remove some duplicated code; add some fast-paths * Fixes

nalimilan mentioned this issue Dec 1, 2019

Problems with threads and LazyArrays #536

Closed

quinnj added a commit that referenced this issue Dec 12, 2019

Drop LazyArrays dependency in favor of our own CSV.Column2 which we c…

e32e8e6

…an control with regards to copying and iteration. Fixes #539

quinnj mentioned this issue Dec 12, 2019

Drop LazyArrays dependency in favor of our own CSV.Column2 which we c… #546

Merged

quinnj closed this as completed in #546 Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV breaks filter! #539

CSV breaks filter! #539

clintonTE commented Nov 27, 2019

clintonTE commented Nov 27, 2019 •

edited

Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019 •

edited

Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019 •

edited

Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019 •

edited

Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019

bkamins commented Nov 28, 2019

ExpandingMan commented Dec 3, 2019

bkamins commented Dec 3, 2019

quinnj commented Dec 3, 2019

ExpandingMan commented Dec 3, 2019

nalimilan commented Dec 4, 2019

JianghuiDu commented Dec 10, 2019

bkamins commented Dec 10, 2019

ExpandingMan commented Dec 10, 2019

bkamins commented Dec 10, 2019

ExpandingMan commented Dec 10, 2019

bkamins commented Dec 10, 2019

ExpandingMan commented Dec 10, 2019

bkamins commented Dec 10, 2019

quinnj commented Dec 12, 2019

CSV breaks filter! #539

CSV breaks filter! #539

Comments

clintonTE commented Nov 27, 2019

clintonTE commented Nov 27, 2019 • edited Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019 • edited Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019 • edited Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019 • edited Loading

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019

nalimilan commented Nov 28, 2019

clintonTE commented Nov 28, 2019

bkamins commented Nov 28, 2019

ExpandingMan commented Dec 3, 2019

bkamins commented Dec 3, 2019

quinnj commented Dec 3, 2019

ExpandingMan commented Dec 3, 2019

nalimilan commented Dec 4, 2019

JianghuiDu commented Dec 10, 2019

bkamins commented Dec 10, 2019

ExpandingMan commented Dec 10, 2019

bkamins commented Dec 10, 2019

ExpandingMan commented Dec 10, 2019

bkamins commented Dec 10, 2019

ExpandingMan commented Dec 10, 2019

bkamins commented Dec 10, 2019

quinnj commented Dec 12, 2019

clintonTE commented Nov 27, 2019 •

edited

Loading

clintonTE commented Nov 28, 2019 •

edited

Loading

clintonTE commented Nov 28, 2019 •

edited

Loading

clintonTE commented Nov 28, 2019 •

edited

Loading