Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV breaks filter! #539

Closed
clintonTE opened this issue Nov 27, 2019 · 26 comments · Fixed by #546
Closed

CSV breaks filter! #539

clintonTE opened this issue Nov 27, 2019 · 26 comments · Fixed by #546

Comments

@clintonTE
Copy link

Judging by the error, it probably breaks other things as well. I find filter! useful for in place operations on large DataFrames.

using DataFrames, CSV

function mwe()
  df = DataFrame(rand(1000,10))

  #seems to work without the csv usage
  dfnocsv = deepcopy(df)
  filter!(r->r.x1<0.5, dfnocsv)
  println("size dfnocsv: ", size(dfnocsv))

  #sending the df to a CSV first breaks the code
  df |> CSV.write("test.csv")
  dfcsv = CSV.read("test.csv") |> DataFrame
  filter!(r->r.x1<0.5, dfcsv)
  println("size dfcsv: ", size(dfcsv))
end

mwe()

Output:

size dfnocsv: (510, 10)
ERROR: LoadError: MethodError: no method matching deleteat!(::LazyArrays.ApplyArray{Float64,1,typeof(vcat),NTuple{6,Array{Float64,1}}}, ::Array{Int64,1})
Closest candidates are:
  deleteat!(::Array{T,1} where T, ::AbstractArray{T,1} where T) at array.jl:1228
  deleteat!(::Array{T,1} where T, ::Any) at array.jl:1227
  deleteat!(::BitArray{1}, ::Any) at bitarray.jl:940
  ...
Stacktrace:
 [1] (::DataFrames.var"#118#119"{Array{Int64,1}})(::LazyArrays.ApplyArray{Float64,1,typeof(vcat),NTuple{6,Array{Float64,1}}}) at C:\Users\Clinton\.julia\packages\DataFrames\yH0f6\src\dataframe\dataframe.jl:707
 [2] foreach(::DataFrames.var"#118#119"{Array{Int64,1}}, ::Array{AbstractArray{T,1} where T,1}) at .\abstractarray.jl:1920
 [3] deleterows! at C:\Users\Clinton\.julia\packages\DataFrames\yH0f6\src\dataframe\dataframe.jl:707 [inlined]
 [4] filter! at C:\Users\Clinton\.julia\packages\DataFrames\yH0f6\src\abstractdataframe\abstractdataframe.jl:789 [inlined]
 [5] mwe() at C:\Users\Clinton\Dropbox\Projects\LeverageChanges\mwe.jl:14
 [6] top-level scope at C:\Users\Clinton\Dropbox\Projects\LeverageChanges\mwe.jl:19
in expression starting at C:\Users\Clinton\Dropbox\Projects\LeverageChanges\mwe.jl:19
@clintonTE
Copy link
Author

clintonTE commented Nov 27, 2019

In-placeish temporary work-around:

Edit: work-around doesn't work with bits types. Can just make a new df with DataFrame(view(dfcsv, :, :))

@inline function fixcsv!(df::DataFrame)::DataFrame
  for c  1:size(df,2)
    df[!,c] = Array(df[!,c])
  end
  return df
end

function mwe()
  df = DataFrame(rand(1000,5))

  #seems to work without the csv usage
  dfnocsv = deepcopy(df)
  filter!(r->r.x1<0.5, dfnocsv)
  println("size dfnocsv: ", size(dfnocsv))

  #sending the df to a CSV first breaks the code
  df |> CSV.write("test.csv")
  dfcsv = CSV.read("test.csv") |> DataFrame |> fixcsv!

  filter!(r->r.x1<0.5, dfcsv)
  println("size dfcsv: ", size(dfcsv))
end

mwe()

Output:

size dfnocsv: (466, 5)
size dfcsv: (466, 5)

@nalimilan
Copy link
Member

Pass copycols=true or use DataFrame(CSV.File(...)) instead of CSV.read.

@quinnj I really think we should do something about this, it keeps confusing people. :-)

@clintonTE
Copy link
Author

clintonTE commented Nov 28, 2019

I should have mentioned that I had tried copycols=true. Following your post, I tried DataFrame(CSV.File(...)) which also did not work.

@nalimilan
Copy link
Member

Are you using the latest CSV release? There was a bug in a recent release.

@clintonTE
Copy link
Author

clintonTE commented Nov 28, 2019

Tried it on master, no dice.

Edit: Just to be clear, here is the self-contained code I am using to test this (my project code is much longer). I put in copycols=true, although previously (v5.13) it worked without that option.

using DataFrames, CSV

function mwe()
  df = DataFrame(rand(1000,5))

  #seems to work without the csv usage
  dfnocsv = deepcopy(df)
  filter!(r->r.x1<0.5, dfnocsv)
  println("size dfnocsv: ", size(dfnocsv))

  #sending the df to a CSV first breaks the code
  df |> CSV.write("test.csv")
  dfcsv = CSV.read("test.csv", copycols=true) |> DataFrame


  filter!(r->r.x1<0.5, dfcsv)
  println("size dfcsv: ", size(dfcsv))
end

mwe()

@nalimilan
Copy link
Member

Mmm, I guess you're using multiple threads?

@clintonTE
Copy link
Author

clintonTE commented Nov 28, 2019

Yep, just the default parameters. However I just tried it with Julia set to a single thread and had the same issue.

@nalimilan
Copy link
Member

Weird. Can you check the type of df's columns? If they are ApplyArray, then you are using multiple threads.

@clintonTE
Copy link
Author

Interesting. With Julia set to a single thread (and confirmed withThreads.nthreads()) it fails. But I tried setting threaded=false, and it works. So irrespective of the number of threads I set at startup, it seems threaded makes or breaks the code. Again, copycols makes no difference either way.

@nalimilan
Copy link
Member

OK, yes, what matters it the type of array you get. With ApplyArray the failure is expected.

@quinnj I think when copycols=true we'd better use plain Array columns. copy(::ApplyArray) returns another array, which doesn't have any clear advantage, but clear drawbacks. We should probably apply this to the DataFrames constructor too if we can.

@clintonTE
Copy link
Author

Yeah the threaded option is reflected in the column type (ApplyArray vs Array).

@bkamins
Copy link
Member

bkamins commented Nov 28, 2019

I think most people will expect columns of DataFrame to be resizable by default, so what @nalimilan suggests seems reasonable.

@ExpandingMan
Copy link
Contributor

I definitely think there needs to be an option to return something mutable. I think it's fine if that's not the default, but right now the problem is that copycols=true does not even render the dataframe mutable. There definitely needs to be a clear, obvious option for returning mutable objects.

@bkamins
Copy link
Member

bkamins commented Dec 3, 2019

Exactly - by default CSV.jl returns immutable columns, but when you pass them to DataFrame constructor with copycols=true it should change them to mutable ones.

@quinnj
Copy link
Member

quinnj commented Dec 3, 2019

I'll be working on fixing this today; sorry for the hassle everyone. I had tried to do quite a bit of robustness testing on LazyArrays before adding it as a dependency, but other people's workflows are hard to predict, and I didn't catch this case that breaks. I have an idea on how to fix things.

@ExpandingMan
Copy link
Contributor

Might it be worth introducing a mutable keyword argument for the sake of clarity?

I'm in favor of keeping the default whatever is moset efficient.

@nalimilan
Copy link
Member

I don't think that's needed: when copycols=true, we have to make a copy, so better allocate a plain Vector instead of a LazyVector anyway.

@JianghuiDu
Copy link

same problem here. Just updated to Julia 1.3 and now I get LazyArrays.ApplyArray instead of Array and other packages do not have methods to deal with it.

@bkamins
Copy link
Member

bkamins commented Dec 10, 2019

It seems that LazyArrays.jl interacts badly with DataFrames.jl in general, see https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/86. I do not know that package, but we should resolve it somehow (maybe also some fix to DataFrames.jl is in place - I am not sure here).

@ExpandingMan
Copy link
Contributor

Does it really "interact badly" or is it just that these are immutable? It seems to me that DataFrames should work with immutable arrays as much as possible, and there should be a clearly defined subset of functions that fail in the immutable case.

@bkamins is standardizing and clarifying this something that has been discussed for DataFrames 1.0?

@bkamins
Copy link
Member

bkamins commented Dec 10, 2019

I always thought this is relatively clear:

  • all in-place operations require columns to be mutable
  • all in-place resizing operations require columns to be resizable
  • all other operations should work fine on immutable columns

We have two problems though:

  • in some codes we assumed that copy(col) is the same as col[:] and col[axes(col,1)], which does not seem to be the case always - this should be reviewed and standardized;
  • the problem reported in https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/86 is more severe - join does not mutate source but uses similar on it and then performs copyto! of a view of a ApplyArray object - and this is the operation that fails/produces random results; most likely this is a problem with LazyArrays.jl, but I am not sure for 100%.

@ExpandingMan
Copy link
Contributor

Yeah, I'm not too sure if copy(col) being equivalent to col[:] is supposed to be a rule. The best argument against it would seem to be that in some cases you might want indexing to give you a completely different type of array.

For the latter problem, it sounds like the issue is whether similar should be guaranteed to give a mutable array. It seems to me that this function is only really useful if that is true. This might call for us to make a PR to LazyArrays.

@bkamins
Copy link
Member

bkamins commented Dec 10, 2019

I'm not too sure if copy(col) being equivalent to col[:] is supposed to be a rule. The best argument against it would seem to be that in some cases you might want indexing to give you a completely different type of array.

I am not saying it should be, but that this is what the code assumes in some places, so we should review it. Note that copy also may change the type of an array in general (e.g. when you copy a view).

it sounds like the issue is whether similar should be guaranteed to give a mutable array.

And in the case of LazyArrays.jl it does (it produces a standard array from Base). The problem is later with running copyto! on a view of ApplyArray object I think.

@ExpandingMan
Copy link
Contributor

And in the case of LazyArrays.jl it does (it produces a standard array from Base). The problem is later with running copyto! on a view of ApplyArray object I think.

That sounds like something that should not be assumed to work. Could you re-produce the problem and create an issue where you deem appropriate?

@bkamins
Copy link
Member

bkamins commented Dec 10, 2019

I cannot reproduce it and that is why I am posting it here (as for sure it can be fixed by changing what CSV.jl returns when a copy is requested). I do not know which package causes the problem, i.e. is the problem in:

  • how CSV.jl uses LazyArrays.jl (i.e. it is used incorrectly and this causes a bug)
  • how LazyArrays.jl implements copyto!
  • how DataFrames.jl uses the above (this is a least likely reason)

@quinnj
Copy link
Member

quinnj commented Dec 12, 2019

Alright, PR up that drops LazyArrays all together in favor of a new CSV.Column2 that has LazyArrays.Vcat-like capabilities, but implements copy like the existing CSV.Column as DataFrames expects. This means that copycols=true with CSV.read will return plain, mutable vectors as before.

quinnj added a commit that referenced this issue Dec 13, 2019
#546)

* Drop LazyArrays dependency in favor of our own CSV.Column2 which we can control with regards to copying and iteration. Fixes #539

* Fixup copy for Column2; remove some duplicated code; add some fast-paths

* Fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants