Allow pool keyword arg to be Tuple{Float64, Int}

JuliaData · Jan 14, 2022 · edf5609 · edf5609
1 parent f1c29f3
commit edf5609
Show file tree

Hide file tree

Showing 9 changed files with 36 additions and 25 deletions.
diff --git a/docs/src/examples.md b/docs/src/examples.md
@@ -631,7 +631,7 @@ using CSV
 
 # In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
 # like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
-# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
 # greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
 data = """
 id,code
@@ -668,15 +668,15 @@ file = CSV.File(IOBuffer(data); pool=Dict(1 => true))
 file = CSV.File(IOBuffer(data); pool=[true, false])
 ```
 
-## [Pool absolute threshold](@id pool_absolute_threshold)
+## [Pool with absolute threshold](@id pool_absolute_threshold)
 
 ```julia
 using CSV
 
 # In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
 # like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
-# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
-# greater control: `pool=2` means that if a column has 2 or fewer unique values, then it will be pooled.
+# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.
 data = """
 id,code
 A18E9,AT
@@ -686,5 +686,5 @@ BF392,GC
 8CD2E,GC
 """
 
-file = CSV.File(IOBuffer(data); pool=2)
+file = CSV.File(IOBuffer(data); pool=(0.5, 2))
 ```
diff --git a/docs/src/reading.md b/docs/src/reading.md
@@ -192,11 +192,12 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type
 
 ## [`pool`](@id pool)
 
-Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
+Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Tuple{Float64, Int}`, vector, dict, or a function of the form `(i, name) -> Union{Bool, Real,  Tuple{Float64, Int}, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a tuple, like `(0.2, 500)`, the first tuple element is the same as a single `Float64` value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, `pool=(0.2, 500)` means if a String column has less than or equal to 500 unique values _and_ the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool`, `Real`, or `Tuple{Float64, Int}`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool`, `Float64`, or `Tuple{Float64, Int}`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool`, `Float64`, or `Tuple{Float64, Int}` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
 
 ### Examples
   * [Pooled values](@ref pool_example)
   * [Non-string column pooling](@ref nonstring_pool_example)
+  * [Pool with absolute threshold](@ref pool_absolute_threshold)
 
 ## [`downcast`](@id downcast)
 

diff --git a/src/CSV.jl b/src/CSV.jl
@@ -56,7 +56,7 @@ Base.showerror(io::IO, e::Error) = println(io, e.msg)
 
 # constants
 const DEFAULT_STRINGTYPE = InlineString
-const DEFAULT_POOL = 500
+const DEFAULT_POOL = (0.2, 500)
 const DEFAULT_ROWS_TO_CHECK = 30
 const DEFAULT_MAX_WARNINGS = 100
 const DEFAULT_MAX_INLINE_STRING_LENGTH = 32

diff --git a/src/README.md b/src/README.md
@@ -22,9 +22,11 @@ By providing the `pool` keyword argument, users can control how this optimizatio
 
 Valid inputs for `pool` include:
   * A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
-  * A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns. If the number is greater than 1, then it will be treated as an upper limit on the # of unique values allowed, under which the column will be pooled, over which will be a normal array.
-  * An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool` or `Real` indicating the pooling behavior for each specific column.
-  * An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool` or `Real` to again signal how specific columns should be pooled
+  * A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns.
+  * a `Tuple{Float64, Int}`, where the 1st argument is the same as the above percent threshold on cardinality, while the 2nd argument is an absolute upper limit on the # of unique values. This is useful for large datasets where 0.2 may grow to allow pooled columns with thousands of values; it's helpful performance-wise to put an upper limit like `pool=(0.2, 500)` to ensure no pooled column will have more than 500 unique values.
+  * An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool`, `Real`, or `Tuple{Float64, Int}` indicating the pooling behavior for each specific column.
+  * An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool`, `Real`, or `Tuple{Float64, Int}` to again signal how specific columns should be pooled
+  * A function of the form `(i, nm) -> Union{Bool, Real, Tuple{Float64, Int}}` where it takes the column index and name as two arguments, and returns one of the first 3 possible pool values from the above list.
 
 For the implementation of pooling:
   * We normalize however the keyword argument was provided to have a `pool` value per column while parsing

diff --git a/src/chunks.jl b/src/chunks.jl
@@ -64,7 +64,7 @@ function Chunks(source::ValidSources;
     type=nothing,
     types=nothing,
     typemap::Dict=Dict{Type, Type}(),
-    pool::Union{Bool, Real, AbstractVector, AbstractDict}=DEFAULT_POOL,
+    pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple}=DEFAULT_POOL,
     downcast::Bool=false,
     lazystrings::Bool=false,
     stringtype::StringTypes=DEFAULT_STRINGTYPE,

diff --git a/src/context.jl b/src/context.jl
@@ -19,7 +19,7 @@ mutable struct Column
     anymissing::Bool
     userprovidedtype::Bool
     willdrop::Bool
-    pool::Float64
+    pool::Union{Float64, Tuple{Float64, Int}}
     columnspecificpool::Bool
     # lazily/manually initialized fields
     column::AbstractVector
@@ -29,7 +29,7 @@ mutable struct Column
     endposition::Int
     options::Parsers.Options
 
-    Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Float64, columnspecificpool::Bool) =
+    Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Union{Float64, Tuple{Float64, Int}}, columnspecificpool::Bool) =
         new(type, anymissing, userprovidedtype, willdrop, pool, columnspecificpool)
 end
 
@@ -104,7 +104,7 @@ struct Context
     datarow::Int
     options::Parsers.Options
     columns::Vector{Column}
-    pool::Float64
+    pool::Union{Float64, Tuple{Float64, Int}}
     downcast::Bool
     customtypes::Type
     typemap::Dict{Type, Type}
@@ -217,7 +217,7 @@ end
     type::Union{Nothing, Type},
     types::Union{Nothing, Type, AbstractVector, AbstractDict, Function},
     typemap::Dict,
-    pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable},
+    pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple},
     downcast::Bool,
     lazystrings::Bool,
     stringtype::StringTypes,

diff --git a/src/file.jl b/src/file.jl
@@ -458,6 +458,7 @@ function checkpooled!(::Type{T}, pertaskcolumns, col, j, ntasks, nrows, ctx) whe
     lastref = Ref{UInt32}(0)
     refs = Vector{UInt32}(undef, nrows)
     k = 1
+    limit = col.pool isa Tuple ? col.pool[2] : typemax(Int)
     for i = 1:ntasks
         column = (pertaskcolumns === nothing ? col.column : pertaskcolumns[i][j].column)::columntype(S)
         for x in column
@@ -494,15 +495,13 @@ function checkpooled!(::Type{T}, pertaskcolumns, col, j, ntasks, nrows, ctx) whe
                 end
             end
             k += 1
-            if col.pool > 1.0 && nrows > col.pool && length(pool) > col.pool
+            if length(pool) > limit
                 return false
             end
         end
     end
-    if col.pool <= 1.0 && ((length(pool) - 1) / nrows) <= col.pool
-        col.column = PooledArray(PooledArrays.RefArray(refs), pool)
-        return true
-    elseif col.pool > 1.0 && nrows > col.pool && length(pool) <= col.pool
+    percent = col.pool isa Tuple ? col.pool[1] : col.pool
+    if ((length(pool) - 1) / nrows) <= percent
         col.column = PooledArray(PooledArrays.RefArray(refs), pool)
         return true
     else

diff --git a/src/keyworddocs.jl b/src/keyworddocs.jl
@@ -34,7 +34,7 @@ const KEYWORD_DOCS = """
 
   * `types`: a single `Type`, `AbstractVector` or `AbstractDict` of types, or a function of the form `(i, name) -> Union{T, Nothing}` to be used for column types; if a single `Type` is provided, _all_ columns will be parsed with that single type; an `AbstractDict` can map column index `Integer`, or name `Symbol` or `String` to type for a column, i.e. `Dict(1=>Float64)` will set the first column as a `Float64`, `Dict(:column1=>Float64)` will set the column named `column1` to `Float64` and, `Dict("column1"=>Float64)` will set the `column1` to `Float64`; if a `Vector` is provided, it must match the # of columns provided or detected in `header`. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or `nothing` to signal the column's type should be detected while parsing.
   * `typemap::Dict{Type, Type}`: a mapping of a type that should be replaced in every instance with another type, i.e. `Dict(Float64=>String)` would change every detected `Float64` column to be parsed as `String`; only "standard" types are allowed to be mapped to another type, i.e. `Int64`, `Float64`, `Date`, `DateTime`, `Time`, and `Bool`. If a column of one of those types is "detected", it will be mapped to the specified type.
-  * `pool::Union{Bool, Real, AbstractVector, AbstractDict, Function}=$DEFAULT_POOL`: [not supported by `CSV.Rows`] controls whether columns will be built as `PooledArray`; if `true`, all columns detected as `String` will be pooled; alternatively, the proportion of unique values below which `String` columns should be pooled (meaning that if the # of unique strings in a column is under 25%, `pool=0.25`, it will be pooled). If provided as a positive number > 1, it represents the upper threshold for the # of unique values, under which the column will be pooled; this is the default (`pool=$DEFAULT_POOL`). If an `AbstractVector`, each element should be `Bool` or `Real` and the # of elements should match the # of columns in the dataset; if an `AbstractDict`, a `Bool` or `Real` value can be provided for individual columns where the dict key is given as column index `Integer`, or column name as `Symbol` or `String`. If a function is provided, it should take a column index and name as 2 arguments, and return a `Bool`, `Real`, or `nothing` for each column.
+  * `pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=$DEFAULT_POOL`: [not supported by `CSV.Rows`] controls whether columns will be built as `PooledArray`; if `true`, all columns detected as `String` will be pooled; alternatively, the proportion of unique values below which `String` columns should be pooled (meaning that if the # of unique strings in a column is under 25%, `pool=0.25`, it will be pooled). If provided as a `Tuple{Float64, Int}` like `(0.2, 500)`, it represents the percent cardinality threshold as the 1st tuple element (`0.2`), and an upper limit for the # of unique values (`500`), under which the column will be pooled; this is the default (`pool=$DEFAULT_POOL`). If an `AbstractVector`, each element should be `Bool`, `Real`, or `Tuple{Float64, Int}` and the # of elements should match the # of columns in the dataset; if an `AbstractDict`, a `Bool`, `Real`, or `Tuple{Float64, Int}` value can be provided for individual columns where the dict key is given as column index `Integer`, or column name as `Symbol` or `String`. If a function is provided, it should take a column index and name as 2 arguments, and return a `Bool`, `Real`, `Tuple{Float64, Int}`, or `nothing` for each column.
   * `downcast::Bool=false`: controls whether columns detected as `Int64` will be "downcast" to the smallest possible integer type like `Int8`, `Int16`, `Int32`, etc.
   * `stringtype=$DEFAULT_STRINGTYPE`: controls how detected string columns will ultimately be returned; default is `InlineString`, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to `String`. If `String` is passed, all string columns will just be normal `String` values. If `PosLenString` is passed, string columns will be returned as `PosLenStringVector`, which is a special "lazy" `AbstractVector` that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of `PosLenStringVector` makes it read-only, so operations like `push!`, `append!`, or `setindex!` are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
   * `strict::Bool=false`: whether invalid values should throw a parsing error or be replaced with `missing`

diff --git a/src/utils.jl b/src/utils.jl
@@ -22,15 +22,24 @@ finaltype(::Type{HardMissing}) = Missing
 finaltype(::Type{NeedsTypeDetection}) = Missing
 coltype(col) = ifelse(col.anymissing, Union{finaltype(col.type), Missing}, finaltype(col.type))
 
-pooled(col) = col.pool == 1.0
-maybepooled(col) = col.pool > 0.0
+maybepooled(col) = col.pool isa Tuple ? (col.pool[1] > 0.0) : (col.pool > 0.0)
 
-function getpool(x::Real)::Float64
+function getpool(x)::Union{Float64, Tuple{Float64, Int}}
     if x isa Bool
         return x ? 1.0 : 0.0
+    elseif x isa Tuple
+        y = Float64(x[1])
+        (isnan(y) || 0.0 <= y <= 1.0) || throw(ArgumentError("pool tuple 1st argument must be in the range: 0.0 <= x <= 1.0"))
+        try
+            z = Int(x[2])
+            @assert z > 0
+            return (y, z)
+        catch
+            throw(ArgumentError("pool tuple 2nd argument must be a positive integer > 0"))
+        end
     else
         y = Float64(x)
-        (isnan(y) || 0.0 <= y) || throw(ArgumentError("pool argument must be in the range: 0.0 <= x <= 1.0 or a positive integer > 1"))
+        (isnan(y) || 0.0 <= y <= 1.0) || throw(ArgumentError("pool argument must be in the range: 0.0 <= x <= 1.0"))
         return y
     end
 end