-
Notifications
You must be signed in to change notification settings - Fork 1
Bucketing
Buckets are worth a page of their own as they are quite central to how numerical aggregates are collected and combined. Sorting a data set into buckets of a defined range and calculating aggregates for each bucket allow us to obtain fine-grained statistics while navigating the anonymity protections imposed by Diffix. These fine-grained statistic can be recombined to obtain global metrics such as the data distribution.
For example, counts of data points that fall into sub-ranges (buckets) can be combined to form a histogram:
SELECT
bucket(amount by 500),
count(*)
FROM
loans
GROUP BY 1
Note the use of a custom
bucket
function to truncate the column value to the lower bound of its bucket
This query groups the amount
column into buckets of size 500 and counts the number of values in each. If the original amount
column contained values between 0 and 50_000, this query would return around at most 100 rows. There may be less than 100 rows returned because Diffix suppresses buckets with too few values.
In other words, there is a trade-off: if the bucket size is too small, the data will be suppressed. If the bucket size is too large, statistics are of limited use. One of the key challenges here is to find the optimal bucket size - sufficient precision without sacrificing too much accuracy.
One way of finding the appropriate resolution is to break data up into ever smaller buckets until data contained in bucket is no longer accurate.
Eg. for numeric data we might want to query for multiple bucket sizes to obtain data at different resolutions:
min: 0 max: 100
|________________________________________bucket size: 100___________________________________________| <-- Global stats, limited usefulness
|______________________50________________________|_____________________50___________________________|
|_________20*_______|_________20________|_________20________|_________20________|_________20________|
| |___10*___|____10___|___10____|____10___|___10____|____10___|___10____|____10*__|
| |_5__|__5_|_5__|__5_|_5__|__5_|_5__|__5_|_5__|__5_|_5__|__5_|
*
Suppressed: Smaller bucket counts are not returned in the query but can be estimated from larger buckets.
Note that different-sized buckets can be returned from same query. For example, the following returns a count of the amount
column sorted into buckets of 500, 200 and 100:
SELECT
bucket(amount by 500),
bucket(amount by 200),
bucket(amount by 100),
count(*),
grouping_id(
bucket(amount by 500),
bucket(amount by 200),
bucket(amount by 100)
)
FROM
loans
GROUP BY GROUPING SETS (1,2,3)
Note the use of the Aircloak-specific
grouping_id
function. This generates a bitmask that can used to identify the group that a particular row belongs to. See the docs.
This approach applies most obviously to numerical data, however the same principle can used for other types of data. For example, Text
data columns can be sorted into buckets based on prefixes / postfixes / substrings. Date
ranges can be bucketed similarly to numbers.
Small buckets are more likely to be suppressed than larger ones. If some small buckets are suppressed, we can use the larger buckets to estimate the count for the suppressed buckets. For example, if a bucket of size 100 has a count of 50 and we break it into 5 buckets of size 20, the resulting smaller buckets may have counts of *
/*
/12
/12
/16
. In this case we know that 10 suppressed values belong to the two *
buckets. We can simply apportion this evenly, or maybe according to some expected distribution, in order to estimate the size of the remaining buckets.
Diffix restricts the bucket sizes to multiples of
i * (10 ^ n)
wherei
is1
,2
or5
. So valid bucket sizes are: 0.0001, 0.0002, 0.0005, 0.001, [...], 1, 2, 5, 10, 20, [...], 1_000, 2_000, 5_000, 10_000, ...etc
To help with interpolation it can be useful to define parent-child relationships between buckets. For numerical buckets, this is complicated by the fact that base-2 buckets do not divide nicely into base-5 buckets. For example, you can't divide buckets of size 50
into buckets of size 20
, which is the next smaller bucket size.
One approach is to simply skip a generation for these combinations. So, the parent of a base-2 bucket is the next-larger base-1 bucket, and the child of a base-5 bucket is the next base-1 bucket.
Bucket count: The number of values contained in the bucket
Bucket size: The range of the bucket
Bucket base: 1, 2 or 5
Lower bound: The lower bound of the bucket
Upper bound: The upper bound of the bucket