Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul how column pooling works while parsing #962

Merged
merged 2 commits into from
Jan 15, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 23 additions & 2 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -630,8 +630,8 @@ file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
using CSV

# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt64`. By default,
# `pool=0.1`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
data = """
id,code
Expand Down Expand Up @@ -666,4 +666,25 @@ category,amount

file = CSV.File(IOBuffer(data); pool=Dict(1 => true))
file = CSV.File(IOBuffer(data); pool=[true, false])
```

## [Pool absolute threshold](@id pool_absolute_threshold)

```julia
using CSV

# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=2` means that if a column has 2 or fewer unique values, then it will be pooled.
data = """
id,code
A18E9,AT
BF392,GC
93EBC,AT
54EE1,AT
8CD2E,GC
"""

file = CSV.File(IOBuffer(data); pool=2)
```
2 changes: 1 addition & 1 deletion docs/src/reading.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type

## [`pool`](@id pool)

Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with overloading pool, but my concern is clarity of the rules. We need to be very precise here I think. The description is a bit confusing. Are you making a decision based on type or only value? For safety I would do the following rules:

  • Bool - as you propose
  • subtype of AbstractFloat - then require it to be in [0.0, 1.0] interval and treat as fraction
  • subtype of Union{Signed,Unsigned} and positive number then treated as an absolute value

This in particular resolves the ambiguity if someone passes literal 1 - is this 100% or pooling if there is exactly one value. Following the current description it would be treated as 100%, but following what I propose the distinction would be 1.0 - 100%, and 1 - exactly one value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought more about it the problem with pool=500 by default is that if you are reading a small data frame with less than 500 rows it would be always pooled - which I am not sure is desirable.

In the end maybe we need the same as in isapprox - atol and rtol. In order to keep the pool kwarg unchanged and to ensure backward compatibility maybe we can allow passing a Tuple like (0.2, 500)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good distinction to clarify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkamins, so in the implementation, I actually check that the total # of rows parsed is > than the pool value (if giving absolute pool value) and if not, we don't pool. So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the (0.2, 500) case, we would check that both conditions are met to pool? i.e. the # of unique values would have to be < 20% AND there would have to be < 500 total # unique values? I think that makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would check that both conditions are met to pool?

Yes

So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.

This is only a half-measure, as if you have 600 rows you would pool, and probably you should not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, implemented allowing pool::Tuple{Float64, Int} with a default of pool=(0.2, 500). I really like that solution; it feels very clean implementation-wise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Also I think it should be easy enough to explain it in the docs as many people are used to relative and absolute tolerances e.g. when doing optimization there is exactly the same issue.


### Examples
* [Pooled values](@ref pool_example)
Expand Down
2 changes: 1 addition & 1 deletion src/CSV.jl
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Base.showerror(io::IO, e::Error) = println(io, e.msg)

# constants
const DEFAULT_STRINGTYPE = InlineString
const DEFAULT_POOL = 0.25
const DEFAULT_POOL = 500
const DEFAULT_ROWS_TO_CHECK = 30
const DEFAULT_MAX_WARNINGS = 100
const DEFAULT_MAX_INLINE_STRING_LENGTH = 32
Expand Down
7 changes: 2 additions & 5 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,11 @@ By providing the `pool` keyword argument, users can control how this optimizatio

Valid inputs for `pool` include:
* A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
* A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns
* A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns. If the number is greater than 1, then it will be treated as an upper limit on the # of unique values allowed, under which the column will be pooled, over which will be a normal array.
* An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool` or `Real` indicating the pooling behavior for each specific column.
* An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool` or `Real` to again signal how specific columns should be pooled

For the implementation of pooling:
* We normalize however the keyword argument was provided to have a `pool` value per column while parsing
* We also have a `pool` field on the `Context` structure in case columns are widened while parsing, they will take on this value
* For multithreaded parsing, we decide if a column will be pooled or not from the type sampling stage; if a column has a `pool` value of `1.0`, it will _always_ be pooled (as requested), if it has `0.0` it will _not_ be pooled, if `0.0 < pool < 1.0` then we'll calculate whether or not it will be pooled from the sampled values. As noted aboved, a `pool` value of `NaN` will also be considered if a column had `String` values sampled and meets the default threshold. Currently, we'll sample `rows_to_check * ntasks` values per column, which are both configurable via keyword arguments, with defaults `rows_to_check=30` and `ntasks=Threads.nthreads()`. From those samples, we'll calculate the # of unique values divided by the total # of values sampled and compare it with the `pool` value to determine whether the column will be ultimately pooled. The ramification here is that we may "get it wrong" in two different ways: 1) based on the values sampled, we may determine that a column _will_ be pooled even though the total # of uniques values we'll parse will be over the `pool` threshold and 2) we may determine a column _shouldn't_ be pooled because of seemingly high cardinality, even though the total # of unique values for the column is ultimately _lower_ than the `pool` threshold. The only way to do things perfectly is to check _all_ the values of the entire column, but in multithreaded parsing, that would be expensive relative to the simpler sampling method. The other unique piece of multithreaded parsing is that we form a column's initial `refpool` field from the sampled values, which individual task-parsing columns will then start with when parsing their local chunks. Two tricky implementation details involved with sampling and pooling are 1) if a column's values end up being promoted to a different type _while_ sampling or post-sampling while parsing, and 2) if the whole columnset parsed is widened. For the first case, we take all sampled values and promote to the "widest" type, then when building a potential refpool, only consider values that "are" (i.e. `val isa type`) of the promoted type. That may seem like the obvious course of action, but consider that we may detect a value like `2021-01-01` as a `Date` object, but the column type is promoted to `String`; in that case, the `Date(2021, 1, 1)` object parsed _will not_ be in the initial refpool, since `!(Date(2021, 1, 1) isa String)`. For the 2nd tricky case, the columnset isn't widened while type sampling, so "extra" columns are just ignored. The columnset _will_ be widened by each local parsing task that detects the extra columns, and those extra columns will be synchronized/promoted post-parsing as needed. These "widened" columns will only ever be pooled if the user passed `pool=true`, meaning _every_ column for the whole file should be pooled.
* In the single-threaded case, we take a simpler approach by pooling any column with `pool` value `0.0 < pool <= 1.0`, meaning even if we're not totally sure the column will be pooled, we'll pool it while parsing, then decide post-parsing whether the column should actually be pooled or unpooled.
* One of the effects of these approaches to pooling in single vs. multithreading code is that we'll never change _whether a column is pooled_ while parsing; it's either decided post-sampling (multithreaded) or post-parsing (single-threaded).
* Couple other notes on things we have to account for in the multithreading case that makes things a little more complicated. We have to synchronize ref values between local parsing tasks. This is because each task is encountering unique values in different orders, so different tasks may have the same values, but mapped to different ref values. We also have to account for the fact that different parsing tasks may also have promoted to different types, so we may need to promote the pooled values.
* Once column parsing is done, the cardinality is checked against the individual column pool value and whether the column should be pooled or not is computed.
22 changes: 0 additions & 22 deletions src/context.jl
Original file line number Diff line number Diff line change
@@ -1,19 +1,3 @@
# a RefPool holds our refs as a Dict, along with a lastref field which is incremented when a new ref is found while parsing pooled columns
mutable struct RefPool
# what? why ::Any here? well, we want flexibility in what kind of refs we stick in here
# it might be Dict{Union{String, Missing}, UInt32}, but it might be some other string type
# or it might not allow `missing`; in short, there are too many options to try and type
# the field concretely; luckily, working with the `refs` field here is limited to
# a very few specific methods, and we'll always have the column type, so we just need
# to make sure we assert the concrete type before using refs
refs::Any
lastref::UInt32
end

# start lastref at 1, since it's reserved for `missing`, so first ref value will be 2
const Refs{T} = Dict{Union{T, Missing}, UInt32}
RefPool(::Type{T}=String) where {T} = RefPool(Refs{T}(), 1)

"""
Internal structure used to track information for a single column in a delimited file.

Expand All @@ -25,7 +9,6 @@ Fields:
* `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected;
* `columnspecificpool`: if `pool` was provided via Vector or Dict by user, then `true`, other `false`; if `false`, then only string column types will attempt pooling
* `column`: the actual column vector to hold parsed values; field is typed as `AbstractVector` and while parsing, we do switches on `col.type` to assert the column type to make code concretely typed
* `refpool`: if the column is pooled (or might be pooled in single-threaded case), this is the column-specific `RefPool` used to track unique parsed values and their `UInt32` ref codes
* `lock`: in multithreaded parsing, we have a top-level set of `Vector{Column}`, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-local `Column` will `lock(col.lock)` to make changes to the parent `Column`; each task-local `Column` shares the same `lock` of the top-level `Column`
* `position`: for transposed reading, the current column position
* `endposition`: for transposed reading, the expected ending position for this column
Expand All @@ -40,7 +23,6 @@ mutable struct Column
columnspecificpool::Bool
# lazily/manually initialized fields
column::AbstractVector
refpool::RefPool
# per top-level column fields (don't need to copy per task when parsing)
lock::ReentrantLock
position::Int
Expand Down Expand Up @@ -71,10 +53,6 @@ function Column(x::Column)
if isdefined(x, :options)
y.options = x.options
end
if isdefined(x, :refpool)
# if parent has refpool from sampling, make a copy
y.refpool = RefPool(copy(x.refpool.refs), x.refpool.lastref)
end
# specifically _don't_ copy/re-use x.column; that needs to be allocated fresh per parsing task
return y
end
Expand Down
Loading