Overhaul how column pooling works while parsing #962

quinnj · 2022-01-14T05:47:21Z

Fixes #950. Ok, so there's a lot going on here, let me try to summarize
the motivation and what the changes are doing here:

In 0.8 releases, CSV.File would parse the 1st 100 rows of a file,
then do a calculation for whether the # of unique values and provided
pool argument should determine if a column should be pooled or not
In 0.9, this was changed since just considering the 1st 100 rows
tends to not be very representative of large files in terms of # of
unique values. Instead, we parsed the entire file, and afterwards did
the calculation for whether the column should be pooled.
Additionally, in multithreaded parsing, each thread parsed it's own
chunk of the file, assuming a column would be pooled. After parsing,
we needed to "synchronize" the ref codes that would be used by a pooled
column. This ended up being an expensive operation to recode each
chunk of a column for the whole file and process and reprocess the
unique values of each chunk. This was notably expensive when the # of
unique values was large (10s of thousands of values), as reported in
Slow CSV reading when using multiple threads #950.

That provides the context for why we needed to change, so what is
changing here:

We previously tried to assume a column would be pooled before
parsing and build up the "pool dict" while parsing. This ended up
complicating a lot of code: additional code to build the dict,
complications because the element type doesn't match the column type
(UInt32 for ref array, T for pool dict key type), and
complications for all the promotion machinery (needing to
promote/convert pool values instead of ref values). That's in addition
to the very complex syncrefs! routine, that has been the source of
more a few bugs.
Proposed in this PR is to ignore pooling until post-parsing, where
we'll start checking unique values and do the calculation for whether
a column should be pooled or not.
Also introduced in this PR is allowing the pool keyword argument
to be a number greater than 1.0, which will be interpreted as an
absolute upper limit on the # of unique values allowed before a column
will switch from pooled to non-pooled. This seems to make sense for
larger files for several reasons: using a % can result in a really
large # of unique values allowed, which can be a performance
bottleneck (though still allowed in this PR), and it provides a way
for us to "short-circuit" the pooling check post-parsing. If the pool
of unique values gets larger than the pool upper limit threshold, we
can immediately abort on building an expensive pool with potentially
lots of unique values. We update the pool keyword argument to be a
default of 500. Another note is that a file must have number of rows

then pool upper limit in order to pool, otherwise, the column
won't be pooled.

One consideration, and potential optimization, is that if
stringtype=String or a column otherwise has a type of String, we'll
individually allocate each row String during parsing, instead of
getting a natural "interning" benefit we got with the pooling strategies
of 0.8 or previous to this PR. During various rounds of benchmarking
every since before 0.8, one thing that has continued to surface is how
efficient allocating individual strings actually is. So while we could
potentially get a little efficiency by interning string columns while
parsing, it's actually a more minimal benefit that people may guess. The
memory saved by interning gets partly used up by having to keep the
intern pool around, and the extra cost of hashing to check to intern can
be comparable to just allocating a smallish string and setting it in our
parsed array. The other consideration is multithreaded parsing: the
interning benefit isn't as strong when each thread has to maintain its
own intern pool (not as many values to intern against). We could perhaps
explore using a lock-free or spin-lock based dict/set for interning
globally for a column, which may end up providing enough performance
benefit. If we do manage to figure out efficient interning, we could
leverage the intern pool post-parsing when considering whether a column
should be pooled or not.

Fixes #950. Ok, so there's a lot going on here, let me try to summarize the motivation and what the changes are doing here: * In 0.8 releases, `CSV.File` would parse the 1st 100 rows of a file, then do a calculation for whether the # of unique values and provided `pool` argument should determine if a column should be pooled or not * In 0.9, this was changed since just considering the 1st 100 rows tends to not be very representative of large files in terms of # of unique values. Instead, we parsed the entire file, and afterwards did the calculation for whether the column should be pooled. * Additionally, in multithreaded parsing, each thread parsed it's own chunk of the file, assuming a column would be pooled. After parsing, we needed to "synchronize" the ref codes that would be used by a pooled column. This ended up being an expensive operation to recode each chunk of a column for the whole file and process and reprocess the unique values of each chunk. This was notably expensive when the # of unique values was large (10s of thousands of values), as reported in #950. That provides the context for why we needed to change, so what is changing here: * We previously tried to assume a column would be pooled before parsing and build up the "pool dict" while parsing. This ended up complicating a lot of code: additional code to build the dict, complications because the element type doesn't match the column type (`UInt32` for ref array, `T` for pool dict key type), and complications for all the promotion machinery (needing to promote/convert pool values instead of ref values). That's in addition to the very complex `syncrefs!` routine, that has been the source of more a few bugs. * Proposed in this PR is to ignore pooling until post-parsing, where we'll start checking unique values and do the calculation for whether a column should be pooled or not. * Also introduced in this PR is allowing the `pool` keyword argument to be a number greater than `1.0`, which will be interpreted as an absolute upper limit on the # of unique values allowed before a column will switch from pooled to non-pooled. This seems to make sense for larger files for several reasons: using a % can result in a really large # of unique values allowed, which can be a performance bottleneck (though still allowed in this PR), and it provides a way for us to "short-circuit" the pooling check post-parsing. If the pool of unique values gets larger than the pool upper limit threshold, we can immediately abort on building an expensive pool with potentially lots of unique values. We update the `pool` keyword argument to be a default of `500`. Another note is that a file must have number of rows > then `pool` upper limit in order to pool, otherwise, the column won't be pooled. One consideration, and potential optimization, is that if `stringtype=String` or a column otherwise has a type of `String`, we'll individually allocate each row `String` during parsing, instead of getting a natural "interning" benefit we got with the pooling strategies of 0.8 or previous to this PR. During various rounds of benchmarking every since before 0.8, one thing that has continued to surface is how efficient allocating individual strings actually is. So while we could _potentially_ get a little efficiency by interning string columns while parsing, it's actually a more minimal benefit that people may guess. The memory saved by interning gets partly used up by having to keep the intern pool around, and the extra cost of hashing to check to intern can be comparable to just allocating a smallish string and setting it in our parsed array. The other consideration is multithreaded parsing: the interning benefit isn't as strong when each thread has to maintain its own intern pool (not as many values to intern against). We could perhaps explore using a lock-free or spin-lock based dict/set for interning globally for a column, which may end up providing enough performance benefit. If we do manage to figure out efficient interning, we could leverage the intern pool post-parsing when considering whether a column should be pooled or not.

codecov · 2022-01-14T05:54:46Z

Codecov Report

Merging #962 (edf5609) into main (482a187) will decrease coverage by 0.32%.
The diff coverage is 91.42%.

@@            Coverage Diff             @@
##             main     #962      +/-   ##
==========================================
- Coverage   90.28%   89.95%   -0.33%     
==========================================
  Files           9        9              
  Lines        2306     2211      -95     
==========================================
- Hits         2082     1989      -93     
+ Misses        224      222       -2

Impacted Files	Coverage Δ
src/CSV.jl	`80.00% <ø> (ø)`
src/chunks.jl	`91.30% <ø> (ø)`
src/rows.jl	`83.76% <ø> (+0.34%)`	⬆️
src/utils.jl	`85.98% <86.66%> (-0.43%)`	⬇️
src/file.jl	`95.19% <92.59%> (-0.50%)`	⬇️
src/context.jl	`88.81% <100.00%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 482a187...edf5609. Read the comment docs.

bkamins · 2022-01-14T06:54:24Z

docs/src/reading.md

@@ -192,7 +192,7 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type

 ## [`pool`](@id pool)

-Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
+Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.


I am OK with overloading pool, but my concern is clarity of the rules. We need to be very precise here I think. The description is a bit confusing. Are you making a decision based on type or only value? For safety I would do the following rules:

Bool - as you propose

subtype of AbstractFloat - then require it to be in [0.0, 1.0] interval and treat as fraction

subtype of Union{Signed,Unsigned} and positive number then treated as an absolute value

This in particular resolves the ambiguity if someone passes literal 1 - is this 100% or pooling if there is exactly one value. Following the current description it would be treated as 100%, but following what I propose the distinction would be 1.0 - 100%, and 1 - exactly one value.

Having thought more about it the problem with pool=500 by default is that if you are reading a small data frame with less than 500 rows it would be always pooled - which I am not sure is desirable.

In the end maybe we need the same as in isapprox - atol and rtol. In order to keep the pool kwarg unchanged and to ensure backward compatibility maybe we can allow passing a Tuple like (0.2, 500)?

Yes, I think that's a good distinction to clarify.

@bkamins, so in the implementation, I actually check that the total # of rows parsed is > than the pool value (if giving absolute pool value) and if not, we don't pool. So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.

In the (0.2, 500) case, we would check that both conditions are met to pool? i.e. the # of unique values would have to be < 20% AND there would have to be < 500 total # unique values? I think that makes sense to me.

we would check that both conditions are met to pool?

Yes

So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.

This is only a half-measure, as if you have 600 rows you would pool, and probably you should not.

Ok, implemented allowing pool::Tuple{Float64, Int} with a default of pool=(0.2, 500). I really like that solution; it feels very clean implementation-wise.

Thank you. Also I think it should be easy enough to explain it in the docs as many people are used to relative and absolute tolerances e.g. when doing optimization there is exactly the same issue.

quinnj mentioned this pull request Jan 14, 2022

Slow CSV reading when using multiple threads #950

Closed

bkamins reviewed Jan 14, 2022

View reviewed changes

Allow pool keyword arg to be Tuple{Float64, Int}

edf5609

quinnj merged commit cde550c into main Jan 15, 2022

quinnj deleted the jq/pooloverhaul branch January 15, 2022 04:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul how column pooling works while parsing #962

Overhaul how column pooling works while parsing #962

quinnj commented Jan 14, 2022

codecov bot commented Jan 14, 2022 •

edited

Loading

bkamins Jan 14, 2022

bkamins Jan 14, 2022

quinnj Jan 14, 2022

quinnj Jan 14, 2022

quinnj Jan 14, 2022

bkamins Jan 14, 2022

quinnj Jan 14, 2022

bkamins Jan 14, 2022

Overhaul how column pooling works while parsing #962

Overhaul how column pooling works while parsing #962

Conversation

quinnj commented Jan 14, 2022

codecov bot commented Jan 14, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 14, 2022 •

edited

Loading