Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending Dataframes after CSV.read fails for different length String columns #3044

Closed
jd-foster opened this issue Apr 29, 2022 · 8 comments
Closed

Comments

@jd-foster
Copy link

An error arises when using CSV.read(filename,DataFrames) on two or more files with String columns, then calling append!.

Maybe not an issue with DataFrames.jl per se, but if one file has a column parsed as String15 and the second has longer strings in the same column parsed as String31, then the appending of the vectors fails due to an inability to "promote" to the longer type.

See also:
JuliaData/CSV.jl#945

MWE:

using CSV, DataFrames
csv_data1  = ["Description,Scenario\n1,LittleOne\n"]
csv_data2 = ["Description,Scenario\n2,Really Long Scenario Name\n"]

df1 = CSV.read(map(IOBuffer, csv_data1), DataFrame)
df2 = CSV.read(map(IOBuffer, csv_data2), DataFrame)

append!(df1,df2)

I get:

┌ Error: Error adding value to column :Scenario.
└ @ DataFrames ~/.julia/packages/DataFrames/MA4YO/src/dataframe/dataframe.jl:1423
ERROR: ArgumentError: string too large (25) to convert to InlineStrings.String15
Stacktrace:
  [1] stringtoolong(T::Type, n::Int64)
    @ InlineStrings ~/.julia/packages/InlineStrings/F5Dhz/src/InlineStrings.jl:264
  [2] String15
    @ ~/.julia/packages/InlineStrings/F5Dhz/src/InlineStrings.jl:247 [inlined]
  [3] convert
    @ ./strings/basic.jl:232 [inlined]
...
@jd-foster jd-foster changed the title Appending Dataframes after CSV.read fails Appending Dataframes after CSV.read fails for different length String columns Apr 29, 2022
@bkamins
Copy link
Member

bkamins commented Apr 29, 2022

This is not an error, but a correct behavior.
It is unrelated with DataFrames.jl nor CSV.jl, but a feature of InlineStrings.jl.
Unfortunately indeed this is quite inconvenient.

@quinnj - there is probably nothing we can do automatically about it, right?

The solution is the following:

julia> using CSV, DataFrames

julia> csv_data1  = ["Description,Scenario\n1,LittleOne\n"]
1-element Vector{String}:
 "Description,Scenario\n1,LittleOne\n"

julia> csv_data2 = ["Description,Scenario\n2,Really Long Scenario Name\n"]
1-element Vector{String}:
 "Description,Scenario\n2,Really Long Scenario Name\n"

julia> df1 = CSV.read(map(IOBuffer, csv_data1), DataFrame, stringtype=String)
1×2 DataFrame
 Row │ Description  Scenario  
     │ Int64        String    
─────┼────────────────────────
   1 │           1  LittleOne

julia> df2 = CSV.read(map(IOBuffer, csv_data2), DataFrame, stringtype=String)
1×2 DataFrame
 Row │ Description  Scenario
     │ Int64        String
─────┼────────────────────────────────────────
   1 │           2  Really Long Scenario Name

julia> append!(df1, df2)
2×2 DataFrame
 Row │ Description  Scenario
     │ Int64        String
─────┼────────────────────────────────────────
   1 │           1  LittleOne
   2 │           2  Really Long Scenario Name

@bkamins bkamins closed this as completed Apr 29, 2022
@jd-foster
Copy link
Author

Thanks for the reply. Your solution was also the solution I came to, but it might be difficult to resolve or understand for new users.

@bkamins
Copy link
Member

bkamins commented Apr 29, 2022

Yes, I agree. That is why I pinged @quinnj who maintains InlineStrings.jl.

@quinnj
Copy link
Member

quinnj commented May 3, 2022

@bkamins, what ends up getting called internally for append!? Do we try to promote the column types at all? Or do column types have to be exact matching?

@bkamins
Copy link
Member

bkamins commented May 3, 2022

by default we use Base.append! - and this is a case @jd-foster probably means.

Alternatively you can write:

julia> append!(df1,df2, promote=true)
2×2 DataFrame
 Row │ Description  Scenario
     │ Int64        String31
─────┼────────────────────────────────────────
   1 │           1  LittleOne
   2 │           2  Really Long Scenario Name

to explicitly ask for promotion and in this case all things work as @jd-foster expects.

@quinnj
Copy link
Member

quinnj commented May 4, 2022

I think I would personally expect promote=true to be the default? Are there problems with assuming that?

@quinnj
Copy link
Member

quinnj commented May 4, 2022

I guess we currently have this behavior with Base.append!:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> y = [1.2, 3.4, 5.6]
3-element Vector{Float64}:
 1.2
 3.4
 5.6

julia> append!(x, y)
ERROR: InexactError: Int64(1.2)
Stacktrace:
 [1] Int64
   @ ./float.jl:788 [inlined]
 [2] convert
   @ ./number.jl:7 [inlined]
 [3] setindex!
   @ ./array.jl:966 [inlined]
 [4] _unsafe_copyto!(dest::Vector{Int64}, doffs::Int64, src::Vector{Float64}, soffs::Int64, n::Int64)
   @ Base ./array.jl:253
 [5] unsafe_copyto!
   @ ./array.jl:307 [inlined]
 [6] _copyto_impl!
   @ ./array.jl:331 [inlined]
 [7] copyto!
   @ ./array.jl:317 [inlined]
 [8] append!(a::Vector{Int64}, items::Vector{Float64})
   @ Base ./array.jl:1109
 [9] top-level scope
   @ REPL[3]:1

julia> append!(y, x)
9-element Vector{Float64}:
 1.2
 3.4
 5.6
 1.0
 2.0
 3.0
 0.0
 0.0
 0.0

i.e. we'll promote the incoming vector to the original vector type, but won't modify the original vector type to promote to a common type.

@bkamins
Copy link
Member

bkamins commented May 4, 2022

i.e. we'll promote the incoming vector to the original vector type, but won't modify the original vector type to promote to a common type.

This is exactly why promote=false by default except when cols kwarg is :union or :subset when promote is set to true as then one has to promote to make the operation work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants