-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regressions CSV.Rows since 0.5? #752
Comments
Thanks for reporting; I did some digging last night and I think I know what's going; I'll try to get a fix up today |
Fixes #752. This is a case of us not providing just not quite enough information to the compiler, along with the compiler itself being too clever. The default for `CSV.Rows` is to treat each column as `Union{String, Missing}`, which results in the `V` type parameter of `CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we should get pretty good inferrability for `getproperty(::Row2, ::Symbol)`, because we should be able to know the return value will at least be `Union{String, Missing}`. This knowledge, however, was trapped in the "csv domain" and not expressed clearly enough to the compiler. It inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called `Tables.getcolumn(::Row2, i::Int)`, which in turn called `Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected, except that when we started supporting non-String types for `CSV.Rows` (i.e. you can pass in whatever type you want and we'll parse it directly from the file for each row), we added an additional typed `Tables.getcolumn` method that handled all the non-String columns. Oops. Now the compiler is confused because from `Tables.getcolumn(::Row2, nm::Symbol)` it knows that it can return `missing`, a `String`, or if we call this third method, it'll return an instance of our `V` type parameter, which, if you'll remember, in the default case is `CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return type of `Union{Missing, UInt64, String}`, which makes downstream operations even trickier to figure out. Luckily, the solution here is to just help connect the dots for the compiler: i.e. define specialize methods that dispatch on `V`, specifically when `V === UInt64`. Then the compiler will see/know that we will only ever call the `Union{String, Missing}` method and can ignore the custom types codepath. This PR also rearranges a few `@inbounds` uses since we can avoid the bounds checks further down the stack once we've checked them higher up.
Alright; it was a little tricky to track down, but I've got a fix up for this: #753. One thing to note is that the benchmarking is much cleaner if you pass in the function read(rows)
bla = 0
for r in rows
bla += hash(r.a)
bla += hash(r.b)
bla += hash(r.c)
bla += hash(r.d)
bla += hash(r.e)
bla += hash(r.f)
bla += hash(r.g)
bla += hash(r.h)
end
bla
end
rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
@benchmark read(rows) this allows With that rearrangement, I get these timings with the fix in my PR: julia> @benchmark read(rows)
BenchmarkTools.Trial:
memory estimate: 30.52 MiB
allocs estimate: 900001
--------------
minimum time: 32.326 ms (0.00% GC)
median time: 37.181 ms (9.83% GC)
mean time: 37.344 ms (7.23% GC)
maximum time: 46.213 ms (9.55% GC)
--------------
samples: 134
evals/sample: 1 PR: julia> @benchmark read(rows)
BenchmarkTools.Trial:
memory estimate: 24.41 MiB
allocs estimate: 800001
--------------
minimum time: 41.163 ms (0.00% GC)
median time: 44.335 ms (4.09% GC)
mean time: 44.553 ms (2.50% GC)
maximum time: 54.575 ms (0.00% GC)
--------------
samples: 113
evals/sample: 1 Which seems inline with what I would expect; note that a big update that happened between 0.5.26 and 0.7.7 is the ability to support parsing custom types for any column, and incurred a similar 5-10% performance hit, along with some of the other improvements that have been made. |
Fixes #752. This is a case of us not providing just not quite enough information to the compiler, along with the compiler itself being too clever. The default for `CSV.Rows` is to treat each column as `Union{String, Missing}`, which results in the `V` type parameter of `CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we should get pretty good inferrability for `getproperty(::Row2, ::Symbol)`, because we should be able to know the return value will at least be `Union{String, Missing}`. This knowledge, however, was trapped in the "csv domain" and not expressed clearly enough to the compiler. It inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called `Tables.getcolumn(::Row2, i::Int)`, which in turn called `Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected, except that when we started supporting non-String types for `CSV.Rows` (i.e. you can pass in whatever type you want and we'll parse it directly from the file for each row), we added an additional typed `Tables.getcolumn` method that handled all the non-String columns. Oops. Now the compiler is confused because from `Tables.getcolumn(::Row2, nm::Symbol)` it knows that it can return `missing`, a `String`, or if we call this third method, it'll return an instance of our `V` type parameter, which, if you'll remember, in the default case is `CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return type of `Union{Missing, UInt64, String}`, which makes downstream operations even trickier to figure out. Luckily, the solution here is to just help connect the dots for the compiler: i.e. define specialize methods that dispatch on `V`, specifically when `V === UInt64`. Then the compiler will see/know that we will only ever call the `Union{String, Missing}` method and can ignore the custom types codepath. This PR also rearranges a few `@inbounds` uses since we can avoid the bounds checks further down the stack once we've checked them higher up.
opened new issue #1075 since cause is likely different. |
When I run the following benchmark
on v0.5.26, I get:
Running the same on v0.7.7 is 4 x slower:
Is this a performance regression, or have I missed an API change?
The text was updated successfully, but these errors were encountered: