Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul how column pooling works while parsing #962

Merged
merged 2 commits into from
Jan 15, 2022
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jan 14, 2022

Fixes #950. Ok, so there's a lot going on here, let me try to summarize
the motivation and what the changes are doing here:

  • In 0.8 releases, CSV.File would parse the 1st 100 rows of a file,
    then do a calculation for whether the # of unique values and provided
    pool argument should determine if a column should be pooled or not
  • In 0.9, this was changed since just considering the 1st 100 rows
    tends to not be very representative of large files in terms of # of
    unique values. Instead, we parsed the entire file, and afterwards did
    the calculation for whether the column should be pooled.
  • Additionally, in multithreaded parsing, each thread parsed it's own
    chunk of the file, assuming a column would be pooled. After parsing,
    we needed to "synchronize" the ref codes that would be used by a pooled
    column. This ended up being an expensive operation to recode each
    chunk of a column for the whole file and process and reprocess the
    unique values of each chunk. This was notably expensive when the # of
    unique values was large (10s of thousands of values), as reported in
    Slow CSV reading when using multiple threads #950.

That provides the context for why we needed to change, so what is
changing here:

  • We previously tried to assume a column would be pooled before
    parsing and build up the "pool dict" while parsing. This ended up
    complicating a lot of code: additional code to build the dict,
    complications because the element type doesn't match the column type
    (UInt32 for ref array, T for pool dict key type), and
    complications for all the promotion machinery (needing to
    promote/convert pool values instead of ref values). That's in addition
    to the very complex syncrefs! routine, that has been the source of
    more a few bugs.
  • Proposed in this PR is to ignore pooling until post-parsing, where
    we'll start checking unique values and do the calculation for whether
    a column should be pooled or not.
  • Also introduced in this PR is allowing the pool keyword argument
    to be a number greater than 1.0, which will be interpreted as an
    absolute upper limit on the # of unique values allowed before a column
    will switch from pooled to non-pooled. This seems to make sense for
    larger files for several reasons: using a % can result in a really
    large # of unique values allowed, which can be a performance
    bottleneck (though still allowed in this PR), and it provides a way
    for us to "short-circuit" the pooling check post-parsing. If the pool
    of unique values gets larger than the pool upper limit threshold, we
    can immediately abort on building an expensive pool with potentially
    lots of unique values. We update the pool keyword argument to be a
    default of 500. Another note is that a file must have number of rows

then pool upper limit in order to pool, otherwise, the column
won't be pooled.

One consideration, and potential optimization, is that if
stringtype=String or a column otherwise has a type of String, we'll
individually allocate each row String during parsing, instead of
getting a natural "interning" benefit we got with the pooling strategies
of 0.8 or previous to this PR. During various rounds of benchmarking
every since before 0.8, one thing that has continued to surface is how
efficient allocating individual strings actually is. So while we could
potentially get a little efficiency by interning string columns while
parsing, it's actually a more minimal benefit that people may guess. The
memory saved by interning gets partly used up by having to keep the
intern pool around, and the extra cost of hashing to check to intern can
be comparable to just allocating a smallish string and setting it in our
parsed array. The other consideration is multithreaded parsing: the
interning benefit isn't as strong when each thread has to maintain its
own intern pool (not as many values to intern against). We could perhaps
explore using a lock-free or spin-lock based dict/set for interning
globally for a column, which may end up providing enough performance
benefit. If we do manage to figure out efficient interning, we could
leverage the intern pool post-parsing when considering whether a column
should be pooled or not.

Fixes #950. Ok, so there's a lot going on here, let me try to summarize
the motivation and what the changes are doing here:
  * In 0.8 releases, `CSV.File` would parse the 1st 100 rows of a file,
  then do a calculation for whether the # of unique values and provided
  `pool` argument should determine if a column should be pooled or not
  * In 0.9, this was changed since just considering the 1st 100 rows
  tends to not be very representative of large files in terms of # of
  unique values. Instead, we parsed the entire file, and afterwards did
  the calculation for whether the column should be pooled.
  * Additionally, in multithreaded parsing, each thread parsed it's own
  chunk of the file, assuming a column would be pooled. After parsing,
  we needed to "synchronize" the ref codes that would be used by a pooled
  column. This ended up being an expensive operation to recode each
  chunk of a column for the whole file and process and reprocess the
  unique values of each chunk. This was notably expensive when the # of
  unique values was large (10s of thousands of values), as reported in
  #950.

That provides the context for why we needed to change, so what is
changing here:
  * We previously tried to assume a column would be pooled before
  parsing and build up the "pool dict" while parsing. This ended up
  complicating a lot of code: additional code to build the dict,
  complications because the element type doesn't match the column type
  (`UInt32` for ref array, `T` for pool dict key type), and
  complications for all the promotion machinery (needing to
  promote/convert pool values instead of ref values). That's in addition
  to the very complex `syncrefs!` routine, that has been the source of
  more a few bugs.
  * Proposed in this PR is to ignore pooling until post-parsing, where
  we'll start checking unique values and do the calculation for whether
  a column should be pooled or not.
  * Also introduced in this PR is allowing the `pool` keyword argument
  to be a number greater than `1.0`, which will be interpreted as an
  absolute upper limit on the # of unique values allowed before a column
  will switch from pooled to non-pooled. This seems to make sense for
  larger files for several reasons: using a % can result in a really
  large # of unique values allowed, which can be a performance
  bottleneck (though still allowed in this PR), and it provides a way
  for us to "short-circuit" the pooling check post-parsing. If the pool
  of unique values gets larger than the pool upper limit threshold, we
  can immediately abort on building an expensive pool with potentially
  lots of unique values. We update the `pool` keyword argument to be a
  default of `500`. Another note is that a file must have number of rows
  > then `pool` upper limit in order to pool, otherwise, the column
  won't be pooled.

One consideration, and potential optimization, is that if
`stringtype=String` or a column otherwise has a type of `String`, we'll
individually allocate each row `String` during parsing, instead of
getting a natural "interning" benefit we got with the pooling strategies
of 0.8 or previous to this PR. During various rounds of benchmarking
every since before 0.8, one thing that has continued to surface is how
efficient allocating individual strings actually is. So while we could
_potentially_ get a little efficiency by interning string columns while
parsing, it's actually a more minimal benefit that people may guess. The
memory saved by interning gets partly used up by having to keep the
intern pool around, and the extra cost of hashing to check to intern can
be comparable to just allocating a smallish string and setting it in our
parsed array. The other consideration is multithreaded parsing: the
interning benefit isn't as strong when each thread has to maintain its
own intern pool (not as many values to intern against). We could perhaps
explore using a lock-free or spin-lock based dict/set for interning
globally for a column, which may end up providing enough performance
benefit. If we do manage to figure out efficient interning, we could
leverage the intern pool post-parsing when considering whether a column
should be pooled or not.
@codecov
Copy link

codecov bot commented Jan 14, 2022

Codecov Report

Merging #962 (edf5609) into main (482a187) will decrease coverage by 0.32%.
The diff coverage is 91.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #962      +/-   ##
==========================================
- Coverage   90.28%   89.95%   -0.33%     
==========================================
  Files           9        9              
  Lines        2306     2211      -95     
==========================================
- Hits         2082     1989      -93     
+ Misses        224      222       -2     
Impacted Files Coverage Δ
src/CSV.jl 80.00% <ø> (ø)
src/chunks.jl 91.30% <ø> (ø)
src/rows.jl 83.76% <ø> (+0.34%) ⬆️
src/utils.jl 85.98% <86.66%> (-0.43%) ⬇️
src/file.jl 95.19% <92.59%> (-0.50%) ⬇️
src/context.jl 88.81% <100.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 482a187...edf5609. Read the comment docs.

@@ -192,7 +192,7 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type

## [`pool`](@id pool)

Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with overloading pool, but my concern is clarity of the rules. We need to be very precise here I think. The description is a bit confusing. Are you making a decision based on type or only value? For safety I would do the following rules:

  • Bool - as you propose
  • subtype of AbstractFloat - then require it to be in [0.0, 1.0] interval and treat as fraction
  • subtype of Union{Signed,Unsigned} and positive number then treated as an absolute value

This in particular resolves the ambiguity if someone passes literal 1 - is this 100% or pooling if there is exactly one value. Following the current description it would be treated as 100%, but following what I propose the distinction would be 1.0 - 100%, and 1 - exactly one value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought more about it the problem with pool=500 by default is that if you are reading a small data frame with less than 500 rows it would be always pooled - which I am not sure is desirable.

In the end maybe we need the same as in isapprox - atol and rtol. In order to keep the pool kwarg unchanged and to ensure backward compatibility maybe we can allow passing a Tuple like (0.2, 500)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good distinction to clarify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkamins, so in the implementation, I actually check that the total # of rows parsed is > than the pool value (if giving absolute pool value) and if not, we don't pool. So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the (0.2, 500) case, we would check that both conditions are met to pool? i.e. the # of unique values would have to be < 20% AND there would have to be < 500 total # unique values? I think that makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would check that both conditions are met to pool?

Yes

So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.

This is only a half-measure, as if you have 600 rows you would pool, and probably you should not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, implemented allowing pool::Tuple{Float64, Int} with a default of pool=(0.2, 500). I really like that solution; it feels very clean implementation-wise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Also I think it should be easy enough to explain it in the docs as many people are used to relative and absolute tolerances e.g. when doing optimization there is exactly the same issue.

@quinnj quinnj merged commit cde550c into main Jan 15, 2022
@quinnj quinnj deleted the jq/pooloverhaul branch January 15, 2022 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow CSV reading when using multiple threads
2 participants