Performance regressions CSV.Rows since 0.5? #752

altre · 2020-10-12T10:34:22Z

When I run the following benchmark

using CSV
using BenchmarkTools
using Random
Random.seed!(0)
open("test.csv", "w") do f
    for _ in 1:100_000
        write(f, join([randstring('a':'z') for _ in 1:8], ","))
        write(f, "\n")
    end
end
function read()
    rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

@benchmark read()

on v0.5.26, I get:

(Bla) pkg> st
Project Bla v0.1.0
Status `/private/tmp/Bla/Project.toml`
  [336ed68f] CSV v0.5.26

julia> @benchmark read()
BenchmarkTools.Trial: 
  memory estimate:  47.31 MiB
  allocs estimate:  1100059
  --------------
  minimum time:     65.747 ms (4.78% GC)
  median time:      72.943 ms (4.71% GC)
  mean time:        73.037 ms (5.13% GC)
  maximum time:     82.599 ms (4.64% GC)
  --------------
  samples:          69
  evals/sample:     1

Running the same on v0.7.7 is 4 x slower:

julia> @benchmark read()
BenchmarkTools.Trial: 
  memory estimate:  177.01 MiB
  allocs estimate:  3500118
  --------------
  minimum time:     287.515 ms (3.16% GC)
  median time:      294.814 ms (3.11% GC)
  mean time:        294.901 ms (3.14% GC)
  maximum time:     302.321 ms (3.21% GC)
  --------------
  samples:          17
  evals/sample:     1

(@v1.5) pkg> st CSV
Status `~/.julia/environments/v1.5/Project.toml`
  [336ed68f] CSV v0.7.7

Is this a performance regression, or have I missed an API change?

quinnj · 2020-10-15T15:19:59Z

Thanks for reporting; I did some digging last night and I think I know what's going; I'll try to get a fix up today

Fixes #752. This is a case of us not providing just not quite enough information to the compiler, along with the compiler itself being too clever. The default for `CSV.Rows` is to treat each column as `Union{String, Missing}`, which results in the `V` type parameter of `CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we should get pretty good inferrability for `getproperty(::Row2, ::Symbol)`, because we should be able to know the return value will at least be `Union{String, Missing}`. This knowledge, however, was trapped in the "csv domain" and not expressed clearly enough to the compiler. It inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called `Tables.getcolumn(::Row2, i::Int)`, which in turn called `Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected, except that when we started supporting non-String types for `CSV.Rows` (i.e. you can pass in whatever type you want and we'll parse it directly from the file for each row), we added an additional typed `Tables.getcolumn` method that handled all the non-String columns. Oops. Now the compiler is confused because from `Tables.getcolumn(::Row2, nm::Symbol)` it knows that it can return `missing`, a `String`, or if we call this third method, it'll return an instance of our `V` type parameter, which, if you'll remember, in the default case is `CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return type of `Union{Missing, UInt64, String}`, which makes downstream operations even trickier to figure out. Luckily, the solution here is to just help connect the dots for the compiler: i.e. define specialize methods that dispatch on `V`, specifically when `V === UInt64`. Then the compiler will see/know that we will only ever call the `Union{String, Missing}` method and can ignore the custom types codepath. This PR also rearranges a few `@inbounds` uses since we can avoid the bounds checks further down the stack once we've checked them higher up.

quinnj · 2020-10-16T03:48:01Z

Alright; it was a little tricky to track down, but I've got a fix up for this: #753.

One thing to note is that the benchmarking is much cleaner if you pass in the CSV.Rows object to the read function, like:

function read(rows)
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
@benchmark read(rows)

this allows read to properly specialize on the CSV.Rows type parameters, which is important since the rows = CSV.Rows(...) operation is inherently type unstable (i.e. it's trying to figure out the type parameters from the initial parsing of the file).

With that rearrangement, I get these timings with the fix in my PR:
v0.5.26

julia> @benchmark read(rows)
BenchmarkTools.Trial:
  memory estimate:  30.52 MiB
  allocs estimate:  900001
  --------------
  minimum time:     32.326 ms (0.00% GC)
  median time:      37.181 ms (9.83% GC)
  mean time:        37.344 ms (7.23% GC)
  maximum time:     46.213 ms (9.55% GC)
  --------------
  samples:          134
  evals/sample:     1

PR:

julia> @benchmark read(rows)
BenchmarkTools.Trial:
  memory estimate:  24.41 MiB
  allocs estimate:  800001
  --------------
  minimum time:     41.163 ms (0.00% GC)
  median time:      44.335 ms (4.09% GC)
  mean time:        44.553 ms (2.50% GC)
  maximum time:     54.575 ms (0.00% GC)
  --------------
  samples:          113
  evals/sample:     1

Which seems inline with what I would expect; note that a big update that happened between 0.5.26 and 0.7.7 is the ability to support parsing custom types for any column, and incurred a similar 5-10% performance hit, along with some of the other improvements that have been made.

Fixes #752. This is a case of us not providing just not quite enough information to the compiler, along with the compiler itself being too clever. The default for `CSV.Rows` is to treat each column as `Union{String, Missing}`, which results in the `V` type parameter of `CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we should get pretty good inferrability for `getproperty(::Row2, ::Symbol)`, because we should be able to know the return value will at least be `Union{String, Missing}`. This knowledge, however, was trapped in the "csv domain" and not expressed clearly enough to the compiler. It inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called `Tables.getcolumn(::Row2, i::Int)`, which in turn called `Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected, except that when we started supporting non-String types for `CSV.Rows` (i.e. you can pass in whatever type you want and we'll parse it directly from the file for each row), we added an additional typed `Tables.getcolumn` method that handled all the non-String columns. Oops. Now the compiler is confused because from `Tables.getcolumn(::Row2, nm::Symbol)` it knows that it can return `missing`, a `String`, or if we call this third method, it'll return an instance of our `V` type parameter, which, if you'll remember, in the default case is `CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return type of `Union{Missing, UInt64, String}`, which makes downstream operations even trickier to figure out. Luckily, the solution here is to just help connect the dots for the compiler: i.e. define specialize methods that dispatch on `V`, specifically when `V === UInt64`. Then the compiler will see/know that we will only ever call the `Union{String, Missing}` method and can ignore the custom types codepath. This PR also rearranges a few `@inbounds` uses since we can avoid the bounds checks further down the stack once we've checked them higher up.

schlurp · 2023-02-28T15:28:07Z

opened new issue #1075 since cause is likely different.

quinnj mentioned this issue Oct 16, 2020

Improve inferrability of getproperty(::Row2, ::Symbol) #753

Merged

quinnj closed this as completed in #753 Oct 16, 2020

schlurp mentioned this issue Feb 28, 2023

Performance regression since v0.8.0 #1075

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regressions CSV.Rows since 0.5? #752

Performance regressions CSV.Rows since 0.5? #752

altre commented Oct 12, 2020

quinnj commented Oct 15, 2020

quinnj commented Oct 16, 2020

schlurp commented Feb 28, 2023 •

edited

Loading

Performance regressions CSV.Rows since 0.5? #752

Performance regressions CSV.Rows since 0.5? #752

Comments

altre commented Oct 12, 2020

quinnj commented Oct 15, 2020

quinnj commented Oct 16, 2020

schlurp commented Feb 28, 2023 • edited Loading

schlurp commented Feb 28, 2023 •

edited

Loading