-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV breaks filter! #539
Comments
In-placeish temporary work-around: Edit: work-around doesn't work with bits types. Can just make a new df with @inline function fixcsv!(df::DataFrame)::DataFrame
for c ∈ 1:size(df,2)
df[!,c] = Array(df[!,c])
end
return df
end
function mwe()
df = DataFrame(rand(1000,5))
#seems to work without the csv usage
dfnocsv = deepcopy(df)
filter!(r->r.x1<0.5, dfnocsv)
println("size dfnocsv: ", size(dfnocsv))
#sending the df to a CSV first breaks the code
df |> CSV.write("test.csv")
dfcsv = CSV.read("test.csv") |> DataFrame |> fixcsv!
filter!(r->r.x1<0.5, dfcsv)
println("size dfcsv: ", size(dfcsv))
end
mwe() Output: size dfnocsv: (466, 5)
size dfcsv: (466, 5) |
Pass @quinnj I really think we should do something about this, it keeps confusing people. :-) |
I should have mentioned that I had tried |
Are you using the latest CSV release? There was a bug in a recent release. |
Tried it on master, no dice. Edit: Just to be clear, here is the self-contained code I am using to test this (my project code is much longer). I put in copycols=true, although previously (v5.13) it worked without that option. using DataFrames, CSV
function mwe()
df = DataFrame(rand(1000,5))
#seems to work without the csv usage
dfnocsv = deepcopy(df)
filter!(r->r.x1<0.5, dfnocsv)
println("size dfnocsv: ", size(dfnocsv))
#sending the df to a CSV first breaks the code
df |> CSV.write("test.csv")
dfcsv = CSV.read("test.csv", copycols=true) |> DataFrame
filter!(r->r.x1<0.5, dfcsv)
println("size dfcsv: ", size(dfcsv))
end
mwe()
|
Mmm, I guess you're using multiple threads? |
Yep, just the default parameters. However I just tried it with Julia set to a single thread and had the same issue. |
Weird. Can you check the type of |
Interesting. With Julia set to a single thread (and confirmed with |
OK, yes, what matters it the type of array you get. With @quinnj I think when |
Yeah the |
I think most people will expect columns of |
I definitely think there needs to be an option to return something mutable. I think it's fine if that's not the default, but right now the problem is that |
Exactly - by default CSV.jl returns immutable columns, but when you pass them to |
I'll be working on fixing this today; sorry for the hassle everyone. I had tried to do quite a bit of robustness testing on LazyArrays before adding it as a dependency, but other people's workflows are hard to predict, and I didn't catch this case that breaks. I have an idea on how to fix things. |
Might it be worth introducing a I'm in favor of keeping the default whatever is moset efficient. |
I don't think that's needed: when |
same problem here. Just updated to Julia 1.3 and now I get |
It seems that LazyArrays.jl interacts badly with DataFrames.jl in general, see https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/86. I do not know that package, but we should resolve it somehow (maybe also some fix to DataFrames.jl is in place - I am not sure here). |
Does it really "interact badly" or is it just that these are immutable? It seems to me that @bkamins is standardizing and clarifying this something that has been discussed for DataFrames 1.0? |
I always thought this is relatively clear:
We have two problems though:
|
Yeah, I'm not too sure if For the latter problem, it sounds like the issue is whether |
I am not saying it should be, but that this is what the code assumes in some places, so we should review it. Note that
And in the case of LazyArrays.jl it does (it produces a standard array from Base). The problem is later with running |
That sounds like something that should not be assumed to work. Could you re-produce the problem and create an issue where you deem appropriate? |
I cannot reproduce it and that is why I am posting it here (as for sure it can be fixed by changing what CSV.jl returns when a copy is requested). I do not know which package causes the problem, i.e. is the problem in:
|
…an control with regards to copying and iteration. Fixes #539
Alright, PR up that drops LazyArrays all together in favor of a new |
Judging by the error, it probably breaks other things as well. I find filter! useful for in place operations on large DataFrames.
Output:
The text was updated successfully, but these errors were encountered: