Skip to content

Commit

Permalink
Allow pool keyword arg to be Tuple{Float64, Int}
Browse files Browse the repository at this point in the history
  • Loading branch information
quinnj committed Jan 14, 2022
1 parent f1c29f3 commit edf5609
Show file tree
Hide file tree
Showing 9 changed files with 36 additions and 25 deletions.
10 changes: 5 additions & 5 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -631,7 +631,7 @@ using CSV

# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
data = """
id,code
Expand Down Expand Up @@ -668,15 +668,15 @@ file = CSV.File(IOBuffer(data); pool=Dict(1 => true))
file = CSV.File(IOBuffer(data); pool=[true, false])
```

## [Pool absolute threshold](@id pool_absolute_threshold)
## [Pool with absolute threshold](@id pool_absolute_threshold)

```julia
using CSV

# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=500`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=2` means that if a column has 2 or fewer unique values, then it will be pooled.
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.
data = """
id,code
A18E9,AT
Expand All @@ -686,5 +686,5 @@ BF392,GC
8CD2E,GC
"""

file = CSV.File(IOBuffer(data); pool=2)
file = CSV.File(IOBuffer(data); pool=(0.5, 2))
```
3 changes: 2 additions & 1 deletion docs/src/reading.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,11 +192,12 @@ A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type

## [`pool`](@id pool)

Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Integer`, vector of `Bool` or number, dict mapping column number/name to `Bool` or number, or a function of the form `(i, name) -> Union{Bool, Real, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a number greater than `1.0`, it will be treated as an upper limit on the # of unique values allowed to pool the column. For example, `pool=500` means if a String column has less than or equal to 500 unique values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool` or number, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, `Tuple{Float64, Int}`, vector, dict, or a function of the form `(i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. If `pool` is provided as a tuple, like `(0.2, 500)`, the first tuple element is the same as a single `Float64` value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, `pool=(0.2, 500)` means if a String column has less than or equal to 500 unique values _and_ the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the `pool` argument is a single `Bool`, `Real`, or `Tuple{Float64, Int}`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool`, `Float64`, or `Tuple{Float64, Int}`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool`, `Float64`, or `Tuple{Float64, Int}` for specific columns. Unspecified columns will not be pooled when the argument is a dict.

### Examples
* [Pooled values](@ref pool_example)
* [Non-string column pooling](@ref nonstring_pool_example)
* [Pool with absolute threshold](@ref pool_absolute_threshold)

## [`downcast`](@id downcast)

Expand Down
2 changes: 1 addition & 1 deletion src/CSV.jl
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Base.showerror(io::IO, e::Error) = println(io, e.msg)

# constants
const DEFAULT_STRINGTYPE = InlineString
const DEFAULT_POOL = 500
const DEFAULT_POOL = (0.2, 500)
const DEFAULT_ROWS_TO_CHECK = 30
const DEFAULT_MAX_WARNINGS = 100
const DEFAULT_MAX_INLINE_STRING_LENGTH = 32
Expand Down
8 changes: 5 additions & 3 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@ By providing the `pool` keyword argument, users can control how this optimizatio

Valid inputs for `pool` include:
* A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
* A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns. If the number is greater than 1, then it will be treated as an upper limit on the # of unique values allowed, under which the column will be pooled, over which will be a normal array.
* An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool` or `Real` indicating the pooling behavior for each specific column.
* An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool` or `Real` to again signal how specific columns should be pooled
* A `Real`, which will be converted to `Float64`, which should be a value between `0.0` and `1.0`, to indicate the % cardinality threshold _under which_ a column will be pooled. e.g. by passing `pool=0.1`, if a column has less than 10% unique values, it will end up as a `PooledArray`, otherwise a normal array. Like the `Bool` argument, this will apply the same % threshold to only/all string columns.
* a `Tuple{Float64, Int}`, where the 1st argument is the same as the above percent threshold on cardinality, while the 2nd argument is an absolute upper limit on the # of unique values. This is useful for large datasets where 0.2 may grow to allow pooled columns with thousands of values; it's helpful performance-wise to put an upper limit like `pool=(0.2, 500)` to ensure no pooled column will have more than 500 unique values.
* An `AbstractVector`, where the # of elements should/needs to match the # of columns in the dataset. Each element of the `pool` argument should be a `Bool`, `Real`, or `Tuple{Float64, Int}` indicating the pooling behavior for each specific column.
* An `AbstractDict`, with keys as `String`s, `Symbol`s, or `Int`s referring to column names or indices, and values in the `AbstractDict` being `Bool`, `Real`, or `Tuple{Float64, Int}` to again signal how specific columns should be pooled
* A function of the form `(i, nm) -> Union{Bool, Real, Tuple{Float64, Int}}` where it takes the column index and name as two arguments, and returns one of the first 3 possible pool values from the above list.

For the implementation of pooling:
* We normalize however the keyword argument was provided to have a `pool` value per column while parsing
Expand Down
2 changes: 1 addition & 1 deletion src/chunks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ function Chunks(source::ValidSources;
type=nothing,
types=nothing,
typemap::Dict=Dict{Type, Type}(),
pool::Union{Bool, Real, AbstractVector, AbstractDict}=DEFAULT_POOL,
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple}=DEFAULT_POOL,
downcast::Bool=false,
lazystrings::Bool=false,
stringtype::StringTypes=DEFAULT_STRINGTYPE,
Expand Down
8 changes: 4 additions & 4 deletions src/context.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ mutable struct Column
anymissing::Bool
userprovidedtype::Bool
willdrop::Bool
pool::Float64
pool::Union{Float64, Tuple{Float64, Int}}
columnspecificpool::Bool
# lazily/manually initialized fields
column::AbstractVector
Expand All @@ -29,7 +29,7 @@ mutable struct Column
endposition::Int
options::Parsers.Options

Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Float64, columnspecificpool::Bool) =
Column(type::Type, anymissing::Bool, userprovidedtype::Bool, willdrop::Bool, pool::Union{Float64, Tuple{Float64, Int}}, columnspecificpool::Bool) =
new(type, anymissing, userprovidedtype, willdrop, pool, columnspecificpool)
end

Expand Down Expand Up @@ -104,7 +104,7 @@ struct Context
datarow::Int
options::Parsers.Options
columns::Vector{Column}
pool::Float64
pool::Union{Float64, Tuple{Float64, Int}}
downcast::Bool
customtypes::Type
typemap::Dict{Type, Type}
Expand Down Expand Up @@ -217,7 +217,7 @@ end
type::Union{Nothing, Type},
types::Union{Nothing, Type, AbstractVector, AbstractDict, Function},
typemap::Dict,
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable},
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple},
downcast::Bool,
lazystrings::Bool,
stringtype::StringTypes,
Expand Down
9 changes: 4 additions & 5 deletions src/file.jl
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,7 @@ function checkpooled!(::Type{T}, pertaskcolumns, col, j, ntasks, nrows, ctx) whe
lastref = Ref{UInt32}(0)
refs = Vector{UInt32}(undef, nrows)
k = 1
limit = col.pool isa Tuple ? col.pool[2] : typemax(Int)
for i = 1:ntasks
column = (pertaskcolumns === nothing ? col.column : pertaskcolumns[i][j].column)::columntype(S)
for x in column
Expand Down Expand Up @@ -494,15 +495,13 @@ function checkpooled!(::Type{T}, pertaskcolumns, col, j, ntasks, nrows, ctx) whe
end
end
k += 1
if col.pool > 1.0 && nrows > col.pool && length(pool) > col.pool
if length(pool) > limit
return false
end
end
end
if col.pool <= 1.0 && ((length(pool) - 1) / nrows) <= col.pool
col.column = PooledArray(PooledArrays.RefArray(refs), pool)
return true
elseif col.pool > 1.0 && nrows > col.pool && length(pool) <= col.pool
percent = col.pool isa Tuple ? col.pool[1] : col.pool
if ((length(pool) - 1) / nrows) <= percent
col.column = PooledArray(PooledArrays.RefArray(refs), pool)
return true
else
Expand Down
2 changes: 1 addition & 1 deletion src/keyworddocs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ const KEYWORD_DOCS = """
* `types`: a single `Type`, `AbstractVector` or `AbstractDict` of types, or a function of the form `(i, name) -> Union{T, Nothing}` to be used for column types; if a single `Type` is provided, _all_ columns will be parsed with that single type; an `AbstractDict` can map column index `Integer`, or name `Symbol` or `String` to type for a column, i.e. `Dict(1=>Float64)` will set the first column as a `Float64`, `Dict(:column1=>Float64)` will set the column named `column1` to `Float64` and, `Dict("column1"=>Float64)` will set the `column1` to `Float64`; if a `Vector` is provided, it must match the # of columns provided or detected in `header`. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or `nothing` to signal the column's type should be detected while parsing.
* `typemap::Dict{Type, Type}`: a mapping of a type that should be replaced in every instance with another type, i.e. `Dict(Float64=>String)` would change every detected `Float64` column to be parsed as `String`; only "standard" types are allowed to be mapped to another type, i.e. `Int64`, `Float64`, `Date`, `DateTime`, `Time`, and `Bool`. If a column of one of those types is "detected", it will be mapped to the specified type.
* `pool::Union{Bool, Real, AbstractVector, AbstractDict, Function}=$DEFAULT_POOL`: [not supported by `CSV.Rows`] controls whether columns will be built as `PooledArray`; if `true`, all columns detected as `String` will be pooled; alternatively, the proportion of unique values below which `String` columns should be pooled (meaning that if the # of unique strings in a column is under 25%, `pool=0.25`, it will be pooled). If provided as a positive number > 1, it represents the upper threshold for the # of unique values, under which the column will be pooled; this is the default (`pool=$DEFAULT_POOL`). If an `AbstractVector`, each element should be `Bool` or `Real` and the # of elements should match the # of columns in the dataset; if an `AbstractDict`, a `Bool` or `Real` value can be provided for individual columns where the dict key is given as column index `Integer`, or column name as `Symbol` or `String`. If a function is provided, it should take a column index and name as 2 arguments, and return a `Bool`, `Real`, or `nothing` for each column.
* `pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=$DEFAULT_POOL`: [not supported by `CSV.Rows`] controls whether columns will be built as `PooledArray`; if `true`, all columns detected as `String` will be pooled; alternatively, the proportion of unique values below which `String` columns should be pooled (meaning that if the # of unique strings in a column is under 25%, `pool=0.25`, it will be pooled). If provided as a `Tuple{Float64, Int}` like `(0.2, 500)`, it represents the percent cardinality threshold as the 1st tuple element (`0.2`), and an upper limit for the # of unique values (`500`), under which the column will be pooled; this is the default (`pool=$DEFAULT_POOL`). If an `AbstractVector`, each element should be `Bool`, `Real`, or `Tuple{Float64, Int}` and the # of elements should match the # of columns in the dataset; if an `AbstractDict`, a `Bool`, `Real`, or `Tuple{Float64, Int}` value can be provided for individual columns where the dict key is given as column index `Integer`, or column name as `Symbol` or `String`. If a function is provided, it should take a column index and name as 2 arguments, and return a `Bool`, `Real`, `Tuple{Float64, Int}`, or `nothing` for each column.
* `downcast::Bool=false`: controls whether columns detected as `Int64` will be "downcast" to the smallest possible integer type like `Int8`, `Int16`, `Int32`, etc.
* `stringtype=$DEFAULT_STRINGTYPE`: controls how detected string columns will ultimately be returned; default is `InlineString`, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to `String`. If `String` is passed, all string columns will just be normal `String` values. If `PosLenString` is passed, string columns will be returned as `PosLenStringVector`, which is a special "lazy" `AbstractVector` that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of `PosLenStringVector` makes it read-only, so operations like `push!`, `append!`, or `setindex!` are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
* `strict::Bool=false`: whether invalid values should throw a parsing error or be replaced with `missing`
Expand Down
17 changes: 13 additions & 4 deletions src/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,24 @@ finaltype(::Type{HardMissing}) = Missing
finaltype(::Type{NeedsTypeDetection}) = Missing
coltype(col) = ifelse(col.anymissing, Union{finaltype(col.type), Missing}, finaltype(col.type))

pooled(col) = col.pool == 1.0
maybepooled(col) = col.pool > 0.0
maybepooled(col) = col.pool isa Tuple ? (col.pool[1] > 0.0) : (col.pool > 0.0)

function getpool(x::Real)::Float64
function getpool(x)::Union{Float64, Tuple{Float64, Int}}
if x isa Bool
return x ? 1.0 : 0.0
elseif x isa Tuple
y = Float64(x[1])
(isnan(y) || 0.0 <= y <= 1.0) || throw(ArgumentError("pool tuple 1st argument must be in the range: 0.0 <= x <= 1.0"))
try
z = Int(x[2])
@assert z > 0
return (y, z)
catch
throw(ArgumentError("pool tuple 2nd argument must be a positive integer > 0"))
end
else
y = Float64(x)
(isnan(y) || 0.0 <= y) || throw(ArgumentError("pool argument must be in the range: 0.0 <= x <= 1.0 or a positive integer > 1"))
(isnan(y) || 0.0 <= y <= 1.0) || throw(ArgumentError("pool argument must be in the range: 0.0 <= x <= 1.0"))
return y
end
end
Expand Down

0 comments on commit edf5609

Please sign in to comment.