JuliaData · quinnj · Jan 15, 2022 · Jan 14, 2022 · Jan 14, 2022 · bkamins
diff --git a/docs/src/examples.md b/docs/src/examples.md
@@ -630,8 +630,8 @@ file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
 using CSV
 
 # In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
-# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt64`. By default, 
-# `pool=0.1`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
+# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
 # greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
 data = """
 id,code
@@ -666,4 +666,25 @@ category,amount
 
 file = CSV.File(IOBuffer(data); pool=Dict(1 => true))
 file = CSV.File(IOBuffer(data); pool=[true, false])
+```
+
+## [Pool absolute threshold](@id pool_absolute_threshold)
+
+```julia
+using CSV
+
+# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
+# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
+# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# greater control: `pool=2` means that if a column has 2 or fewer unique values, then it will be pooled.
+data = """
+id,code
+A18E9,AT
+BF392,GC
+93EBC,AT
+54EE1,AT
+8CD2E,GC
+"""
+
+file = CSV.File(IOBuffer(data); pool=2)
 ```
diff --git a/docs/src/reading.md b/docs/src/reading.md
@@ -192,7 +192,7 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type
 
 ## [`pool`](@id pool)
 
-Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, dict mapping column number/name to `Bool` or `Float64`, or a function of the form `(i, name) -> Union{Bool, Float64, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
+Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
 
 ### Examples
   * [Pooled values](@ref pool_example)

diff --git a/src/CSV.jl b/src/CSV.jl
@@ -56,7 +56,7 @@ Base.showerror(io::IO, e::Error) = println(io, e.msg)
 
 # constants
 const DEFAULT_STRINGTYPE = InlineString
-const DEFAULT_POOL = 0.25
+const DEFAULT_POOL = 500
 const DEFAULT_ROWS_TO_CHECK = 30
 const DEFAULT_MAX_WARNINGS = 100
 const DEFAULT_MAX_INLINE_STRING_LENGTH = 32

diff --git a/src/README.md b/src/README.md
@@ -22,14 +22,11 @@ By providing the `pool` keyword argument, users can control how this optimizatio
 
 Valid inputs for `pool` include:
   * A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
-  * A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns
+  * A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns. If the number is greater than 1, then it will be treated as an upper limit on the # of unique values allowed, under which the column will be pooled, over which will be a normal array.
   * An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool` or `Real` indicating the pooling behavior for each specific column.
   * An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool` or `Real` to again signal how specific columns should be pooled
 
 For the implementation of pooling:
   * We normalize however the keyword argument was provided to have a `pool` value per column while parsing
   * We also have a `pool` field on the `Context` structure in case columns are widened while parsing, they will take on this value
-  * For multithreaded parsing, we decide if a column will be pooled or not from the type sampling stage; if a column has a `pool` value of `1.0`, it will _always_ be pooled (as requested), if it has `0.0` it will _not_ be pooled, if `0.0 < pool < 1.0` then we'll calculate whether or not it will be pooled from the sampled values. As noted aboved, a `pool` value of `NaN` will also be considered if a column had `String` values sampled and meets the default threshold. Currently, we'll sample `rows_to_check * ntasks` values per column, which are both configurable via keyword arguments, with defaults `rows_to_check=30` and `ntasks=Threads.nthreads()`. From those samples, we'll calculate the # of unique values divided by the total # of values sampled and compare it with the `pool` value to determine whether the column will be ultimately pooled. The ramification here is that we may "get it wrong" in two different ways: 1) based on the values sampled, we may determine that a column _will_ be pooled even though the total # of uniques values we'll parse will be over the `pool` threshold and 2) we may determine a column _shouldn't_ be pooled because of seemingly high cardinality, even though the total # of unique values for the column is ultimately _lower_ than the `pool` threshold. The only way to do things perfectly is to check _all_ the values of the entire column, but in multithreaded parsing, that would be expensive relative to the simpler sampling method. The other unique piece of multithreaded parsing is that we form a column's initial `refpool` field from the sampled values, which individual task-parsing columns will then start with when parsing their local chunks. Two tricky implementation details involved with sampling and pooling are 1) if a column's values end up being promoted to a different type _while_ sampling or post-sampling while parsing, and 2) if the whole columnset parsed is widened. For the first case, we take all sampled values and promote to the "widest" type, then when building a potential refpool, only consider values that "are" (i.e. `val isa type`) of the promoted type. That may seem like the obvious course of action, but consider that we may detect a value like `2021-01-01` as a `Date` object, but the column type is promoted to `String`; in that case, the `Date(2021, 1, 1)` object parsed _will not_ be in the initial refpool, since `!(Date(2021, 1, 1) isa String)`. For the 2nd tricky case, the columnset isn't widened while type sampling, so "extra" columns are just ignored. The columnset _will_ be widened by each local parsing task that detects the extra columns, and those extra columns will be synchronized/promoted post-parsing as needed. These "widened" columns will only ever be pooled if the user passed `pool=true`, meaning _every_ column for the whole file should be pooled.
-  * In the single-threaded case, we take a simpler approach by pooling any column with `pool` value `0.0 < pool <= 1.0`, meaning even if we're not totally sure the column will be pooled, we'll pool it while parsing, then decide post-parsing whether the column should actually be pooled or unpooled.
-  * One of the effects of these approaches to pooling in single vs. multithreading code is that we'll never change _whether a column is pooled_ while parsing; it's either decided post-sampling (multithreaded) or post-parsing (single-threaded).
-  * Couple other notes on things we have to account for in the multithreading case that makes things a little more complicated. We have to synchronize ref values between local parsing tasks. This is because each task is encountering unique values in different orders, so different tasks may have the same values, but mapped to different ref values. We also have to account for the fact that different parsing tasks may also have promoted to different types, so we may need to promote the pooled values.
+  * Once column parsing is done, the cardinality is checked against the individual column pool value and whether the column should be pooled or not is computed.
diff --git a/src/context.jl b/src/context.jl
@@ -1,19 +1,3 @@
-# a RefPool holds our refs as a Dict, along with a lastref field which is incremented when a new ref is found while parsing pooled columns
-mutable struct RefPool
-    # what? why ::Any here? well, we want flexibility in what kind of refs we stick in here
-    # it might be Dict{Union{String, Missing}, UInt32}, but it might be some other string type
-    # or it might not allow `missing`; in short, there are too many options to try and type
-    # the field concretely; luckily, working with the `refs` field here is limited to
-    # a very few specific methods, and we'll always have the column type, so we just need
-    # to make sure we assert the concrete type before using refs
-    refs::Any
-    lastref::UInt32
-end
-
-# start lastref at 1, since it's reserved for `missing`, so first ref value will be 2
-const Refs{T} = Dict{Union{T, Missing}, UInt32}
-RefPool(::Type{T}=String) where {T} = RefPool(Refs{T}(), 1)
-
 """
 Internal structure used to track information for a single column in a delimited file.
 
@@ -25,7 +9,6 @@ Fields:
   * `pool`: computed from `pool` keyword argument; `true` is `1.0`, `false` is `0.0`, everything else is `Float64(pool)`; once computed, this field isn't mutated at all while parsing; it's used in type detection to determine whether a column will be pooled or not once a type is detected; 
   * `columnspecificpool`: if `pool` was provided via Vector or Dict by user, then `true`, other `false`; if `false`, then only string column types will attempt pooling
   * `column`: the actual column vector to hold parsed values; field is typed as `AbstractVector` and while parsing, we do switches on `col.type` to assert the column type to make code concretely typed
-  * `refpool`: if the column is pooled (or might be pooled in single-threaded case), this is the column-specific `RefPool` used to track unique parsed values and their `UInt32` ref codes
   * `lock`: in multithreaded parsing, we have a top-level set of `Vector{Column}`, then each threaded parsing task makes its own copy to parse its own chunk; when synchronizing column types/pooled refs, the task-local `Column` will `lock(col.lock)` to make changes to the parent `Column`; each task-local `Column` shares the same `lock` of the top-level `Column`
   * `position`: for transposed reading, the current column position
   * `endposition`: for transposed reading, the expected ending position for this column
@@ -40,7 +23,6 @@ mutable struct Column
     columnspecificpool::Bool
     # lazily/manually initialized fields
     column::AbstractVector
-    refpool::RefPool
     # per top-level column fields (don't need to copy per task when parsing)
     lock::ReentrantLock
     position::Int
@@ -71,10 +53,6 @@ function Column(x::Column)
     if isdefined(x, :options)
         y.options = x.options
     end
-    if isdefined(x, :refpool)
-        # if parent has refpool from sampling, make a copy
-        y.refpool = RefPool(copy(x.refpool.refs), x.refpool.lastref)
-    end
     # specifically _don't_ copy/re-use x.column; that needs to be allocated fresh per parsing task
     return y
 end