Skip to content
This repository has been archived by the owner on Jul 2, 2024. It is now read-only.

group by grouping sets - adding a subgroup can change unrelated results. #4826

Open
dandanlen opened this issue Nov 18, 2020 · 1 comment
Open

Comments

@dandanlen
Copy link

Not sure if this is a bug or a feature... either way I don't think it adds any privacy benefit since it can be worked around by decomposing the group by grouping sets query into multiple group by statements.

The following were run against https://demo.aircloak.com/data_sources/NAV_W1_TENANT_WS001.

                select
                    grouping_id(
                        "Quantity"
                    ),
                    "Quantity",
                    count(*),
                    count_noise(*)
                from "SalesCrMemoLine"
                group by grouping sets ((2))

☝️ This is equivalent to just group by 2 and I get the same result whichever way I write the query statement. So far, so good. Link

                select
                    grouping_id(
                        "Type","Quantity"
                    ),
                    "Type","Quantity",
                    count(*),
                    count_noise(*)
                from "SalesCrMemoLine"
                group by grouping sets ((2), (3))

☝️ This is equivalent to the union of group by 2 and group by 3 and I get the results I expect: The results contain the same two rows as in the query above (with different grouping_ids). Link

Now if instead I add a column with the bucket functions something weird happens.

                select
                    grouping_id(
                        bucket ("Line No_" by 2000.0),"Quantity"
                    ),
                    bucket ("Line No_" by 2000.0),"Quantity",
                    count(*),
                    count_noise(*)
                from "SalesCrMemoLine"
                group by grouping sets ((2), (3))

☝️ The two rows for the "Quantity" grouping are still there, but the numbers are totally different. It looks like the mere fact of including another group has added extra noise to the unrelated column.

In other examples, this can even cause a column to be suppressed or removed entirely from the result set.

TLDR;
group by grouping sets ought to be equivalent to the union of the individual group by ${subgroup} queries. But in some cases adding an extra (and unrelated) grouping set can change results from other grouping sets.

@cristianberneanu
Copy link
Member

Nice find!

Each grouping expression generates noise layers, which should apply only to those buckets containing it.
It seems the Cloak always applies these noise layers globally, to all the buckets, returning different noise values for a bucket which should not be affected by unrelated grouping expressions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants