-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul how column pooling works while parsing #962
Conversation
Fixes #950. Ok, so there's a lot going on here, let me try to summarize the motivation and what the changes are doing here: * In 0.8 releases, `CSV.File` would parse the 1st 100 rows of a file, then do a calculation for whether the # of unique values and provided `pool` argument should determine if a column should be pooled or not * In 0.9, this was changed since just considering the 1st 100 rows tends to not be very representative of large files in terms of # of unique values. Instead, we parsed the entire file, and afterwards did the calculation for whether the column should be pooled. * Additionally, in multithreaded parsing, each thread parsed it's own chunk of the file, assuming a column would be pooled. After parsing, we needed to "synchronize" the ref codes that would be used by a pooled column. This ended up being an expensive operation to recode each chunk of a column for the whole file and process and reprocess the unique values of each chunk. This was notably expensive when the # of unique values was large (10s of thousands of values), as reported in #950. That provides the context for why we needed to change, so what is changing here: * We previously tried to assume a column would be pooled before parsing and build up the "pool dict" while parsing. This ended up complicating a lot of code: additional code to build the dict, complications because the element type doesn't match the column type (`UInt32` for ref array, `T` for pool dict key type), and complications for all the promotion machinery (needing to promote/convert pool values instead of ref values). That's in addition to the very complex `syncrefs!` routine, that has been the source of more a few bugs. * Proposed in this PR is to ignore pooling until post-parsing, where we'll start checking unique values and do the calculation for whether a column should be pooled or not. * Also introduced in this PR is allowing the `pool` keyword argument to be a number greater than `1.0`, which will be interpreted as an absolute upper limit on the # of unique values allowed before a column will switch from pooled to non-pooled. This seems to make sense for larger files for several reasons: using a % can result in a really large # of unique values allowed, which can be a performance bottleneck (though still allowed in this PR), and it provides a way for us to "short-circuit" the pooling check post-parsing. If the pool of unique values gets larger than the pool upper limit threshold, we can immediately abort on building an expensive pool with potentially lots of unique values. We update the `pool` keyword argument to be a default of `500`. Another note is that a file must have number of rows > then `pool` upper limit in order to pool, otherwise, the column won't be pooled. One consideration, and potential optimization, is that if `stringtype=String` or a column otherwise has a type of `String`, we'll individually allocate each row `String` during parsing, instead of getting a natural "interning" benefit we got with the pooling strategies of 0.8 or previous to this PR. During various rounds of benchmarking every since before 0.8, one thing that has continued to surface is how efficient allocating individual strings actually is. So while we could _potentially_ get a little efficiency by interning string columns while parsing, it's actually a more minimal benefit that people may guess. The memory saved by interning gets partly used up by having to keep the intern pool around, and the extra cost of hashing to check to intern can be comparable to just allocating a smallish string and setting it in our parsed array. The other consideration is multithreaded parsing: the interning benefit isn't as strong when each thread has to maintain its own intern pool (not as many values to intern against). We could perhaps explore using a lock-free or spin-lock based dict/set for interning globally for a column, which may end up providing enough performance benefit. If we do manage to figure out efficient interning, we could leverage the intern pool post-parsing when considering whether a column should be pooled or not.
Codecov Report
@@ Coverage Diff @@
## main #962 +/- ##
==========================================
- Coverage 90.28% 89.95% -0.33%
==========================================
Files 9 9
Lines 2306 2211 -95
==========================================
- Hits 2082 1989 -93
+ Misses 224 222 -2
Continue to review full report at Codecov.
|
docs/src/reading.md
Outdated
@@ -192,7 +192,7 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type | |||
|
|||
## [`pool`](@id pool) | |||
|
|||
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict. | |||
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with overloading pool
, but my concern is clarity of the rules. We need to be very precise here I think. The description is a bit confusing. Are you making a decision based on type or only value? For safety I would do the following rules:
Bool
- as you propose- subtype of
AbstractFloat
- then require it to be in [0.0, 1.0] interval and treat as fraction - subtype of
Union{Signed,Unsigned}
and positive number then treated as an absolute value
This in particular resolves the ambiguity if someone passes literal 1
- is this 100% or pooling if there is exactly one value. Following the current description it would be treated as 100%, but following what I propose the distinction would be 1.0
- 100%, and 1
- exactly one value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought more about it the problem with pool=500
by default is that if you are reading a small data frame with less than 500 rows it would be always pooled - which I am not sure is desirable.
In the end maybe we need the same as in isapprox
- atol
and rtol
. In order to keep the pool
kwarg unchanged and to ensure backward compatibility maybe we can allow passing a Tuple
like (0.2, 500)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a good distinction to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkamins, so in the implementation, I actually check that the total # of rows parsed is > than the pool
value (if giving absolute pool value) and if not, we don't pool. So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the (0.2, 500)
case, we would check that both conditions are met to pool? i.e. the # of unique values would have to be < 20% AND there would have to be < 500 total # unique values? I think that makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would check that both conditions are met to pool?
Yes
So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.
This is only a half-measure, as if you have 600 rows you would pool, and probably you should not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, implemented allowing pool::Tuple{Float64, Int}
with a default of pool=(0.2, 500)
. I really like that solution; it feels very clean implementation-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Also I think it should be easy enough to explain it in the docs as many people are used to relative and absolute tolerances e.g. when doing optimization there is exactly the same issue.
Fixes #950. Ok, so there's a lot going on here, let me try to summarize
the motivation and what the changes are doing here:
CSV.File
would parse the 1st 100 rows of a file,then do a calculation for whether the # of unique values and provided
pool
argument should determine if a column should be pooled or nottends to not be very representative of large files in terms of # of
unique values. Instead, we parsed the entire file, and afterwards did
the calculation for whether the column should be pooled.
chunk of the file, assuming a column would be pooled. After parsing,
we needed to "synchronize" the ref codes that would be used by a pooled
column. This ended up being an expensive operation to recode each
chunk of a column for the whole file and process and reprocess the
unique values of each chunk. This was notably expensive when the # of
unique values was large (10s of thousands of values), as reported in
Slow CSV reading when using multiple threads #950.
That provides the context for why we needed to change, so what is
changing here:
parsing and build up the "pool dict" while parsing. This ended up
complicating a lot of code: additional code to build the dict,
complications because the element type doesn't match the column type
(
UInt32
for ref array,T
for pool dict key type), andcomplications for all the promotion machinery (needing to
promote/convert pool values instead of ref values). That's in addition
to the very complex
syncrefs!
routine, that has been the source ofmore a few bugs.
we'll start checking unique values and do the calculation for whether
a column should be pooled or not.
pool
keyword argumentto be a number greater than
1.0
, which will be interpreted as anabsolute upper limit on the # of unique values allowed before a column
will switch from pooled to non-pooled. This seems to make sense for
larger files for several reasons: using a % can result in a really
large # of unique values allowed, which can be a performance
bottleneck (though still allowed in this PR), and it provides a way
for us to "short-circuit" the pooling check post-parsing. If the pool
of unique values gets larger than the pool upper limit threshold, we
can immediately abort on building an expensive pool with potentially
lots of unique values. We update the
pool
keyword argument to be adefault of
500
. Another note is that a file must have number of rowsOne consideration, and potential optimization, is that if
stringtype=String
or a column otherwise has a type ofString
, we'llindividually allocate each row
String
during parsing, instead ofgetting a natural "interning" benefit we got with the pooling strategies
of 0.8 or previous to this PR. During various rounds of benchmarking
every since before 0.8, one thing that has continued to surface is how
efficient allocating individual strings actually is. So while we could
potentially get a little efficiency by interning string columns while
parsing, it's actually a more minimal benefit that people may guess. The
memory saved by interning gets partly used up by having to keep the
intern pool around, and the extra cost of hashing to check to intern can
be comparable to just allocating a smallish string and setting it in our
parsed array. The other consideration is multithreaded parsing: the
interning benefit isn't as strong when each thread has to maintain its
own intern pool (not as many values to intern against). We could perhaps
explore using a lock-free or spin-lock based dict/set for interning
globally for a column, which may end up providing enough performance
benefit. If we do manage to figure out efficient interning, we could
leverage the intern pool post-parsing when considering whether a column
should be pooled or not.