Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Any type in column #1027

Open
bkamins opened this issue Sep 24, 2022 · 14 comments
Open

Allow Any type in column #1027

bkamins opened this issue Sep 24, 2022 · 14 comments

Comments

@bkamins
Copy link
Member

bkamins commented Sep 24, 2022

Related to user's problem report https://discourse.julialang.org/t/help-with-appending-row-in-read-in-dataframe-weird-behavior/87734.

When you write:

julia> df2 = CSV.read("test.csv", DataFrame, types=Any)
ERROR: ArgumentError: Non-concrete types passed in `types` keyword argument, please provide concrete types for columns: Any

you get an error. What can the user do to allow a flexible eltype for column read-in, like Any or e.g. Real?

@nickrobinson251
Copy link
Collaborator

This sounds similar to JuliaData/DataFrames.jl#3044

It seems DataFrames.jl users expect push!/append! on a DataFrame to do type promotion -- is that a decision worth revisiting?

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2022

push!/append! can do type promotion - you just need to write promote=true when doing push!/append!.

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2022

The point of this issue is that user originally had columns with Any eltype, so potentially user might want to read back a table with Any eltype also.

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Sep 24, 2022

i think promote=true might be a good recommendation in such a case -- i wonder if there's anything to be done to make that more clearly an option for users (e.g. something like error hints? or "common gotchas" part of docs?)


On the suggestion for CSV.jl, i think right now the types keyword determines how we try to parse a values, in the parse(T, x) sense, not just the eltype of the resulting vector, which is why we require T to be a concrete type.
So, if we were to allow non-concrete types T, we'd have to decide what that means... I suppose it could mean "do type detection (as if no type info were provided), then push! the result to a vector with eltype T (throwing an error if the detected type cannot be converted to a T)". Are there other possible meanings? Perhaps it could mean "parse to any type <:T", with users still expecting a concretely typed column vector to be returned (but e.g. an error thrown if the detected type isn't <:T rather than e.g. falling back to parsing as a string)?

We do get to pick (and document) the meaning... but i slightly worry about providing a foot-gun, since if users expect the second meaning ("find me a concrete type <:T") then they might be surprised by poor performance downstream when working with abstractly-typed columns.

What do you think?

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2022

The docstring says:

types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T,Nothing} to be used for column types

So it promises column types. I agree this might not be the case, but then docstring should be clarified.

I agree with your fears, and that is why I opened this issue to discuss what is best (in the end we might drop the original request and just fix the docs).

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2022

i wonder if there's anything to be done to make that more clearly an option for users (e.g. something like error hints? or "common gotchas" part of docs?)

I will propose something in DataFrames.jl

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2022

x-ref JuliaData/DataFrames.jl#3177

@quinnj
Copy link
Member

quinnj commented Oct 6, 2022

Hmmm, yeah, this is tricky. One question is whether passing types=Any would result in the actual columns having Any type (like Vector{Any}), or if types=Any is just a signal that we should do normal type inference and you get whatever you get in the result.

@bkamins
Copy link
Member Author

bkamins commented Oct 6, 2022

I imagined that types=Any would result in column having Any type. Also, I think that if we follow this rule, the implementation should not be super hard.

PS. this is a low-priority issue

@nickrobinson251
Copy link
Collaborator

having thought about it a bit longer, i'm coming round to the idea proposed here. That a non-concrete type T sets the eltype of the resulting vector (and we do type detection on the columns as if no type info were provided, and push! the result to a Vector{T}, throwing an error if the detected type is not <:T). Although i think i'd be less of a fan if the implementation turned out to be difficult 😄 As you say at the top Bogumił, currently there's no way to ask for a non-concrete eltype, which i suppose can be a reasonable request.

@bkamins
Copy link
Member Author

bkamins commented Oct 6, 2022

Although i think i'd be less of a fan if the implementation turned out to be difficult

That is why I was thinking of allowing Any as non-concrete type and a special case (in which case the implementation, hopefully, would not be that complicated).

@nickrobinson251
Copy link
Collaborator

oh, you mean special-casing types=Any (different from other abstract types e.g. types=Number)?

@bkamins
Copy link
Member Author

bkamins commented Oct 6, 2022

Yes, as currently types=Number is not allowed. Of course your solution would be more flexible, but I was afraid it might be more complex for @quinnj to do.

@quinnj
Copy link
Member

quinnj commented Oct 6, 2022

but I was afraid it might be more complex for @quinnj to do.

That's fine, I'll just make @nickrobinson251 do it 😛 .

Yeah, I think I like special-casing Any for now and see how that looks. We can always allow more abstract types later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants