Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ALL/DISTINCT modifier for all set operation types #708

Merged
merged 9 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -333,9 +333,11 @@ message SetRel {
enum SetOp {
SET_OP_UNSPECIFIED = 0;
SET_OP_MINUS_PRIMARY = 1;
SET_OP_MINUS_PRIMARY_ALL = 7;
SET_OP_MINUS_MULTISET = 2;
SET_OP_INTERSECTION_PRIMARY = 3;
SET_OP_INTERSECTION_MULTISET = 4;
SET_OP_INTERSECTION_MULTISET_ALL = 8;
SET_OP_UNION_DISTINCT = 5;
SET_OP_UNION_ALL = 6;
vbarua marked this conversation as resolved.
Show resolved Hide resolved
}
Expand Down
43 changes: 27 additions & 16 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,14 +268,23 @@ The set operation encompasses several set-level operations that support combinin

The set operation type determines both the records that are emitted and the type of the output record.

| Property | Description | Output Nullability
| ----------------------- | ------------------------------------------------------------------------------------------------------------- | ----------------------------- |
| Minus (Primary) | Returns all records from the primary input excluding any matching records from secondary inputs. | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that match at least one record from *any* secondary inputs. | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
| Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs. | If a field is required in any of the inputs, it is required in the output.
| Union Distinct | Returns all the records from each set, removing any rows that are duplicated (within or across sets). | If a field is nullable in any of the inputs, it is nullable in the output.
| Union All | Returns all records from each set, allowing duplicates. | If a field is nullable in any of the inputs, it is nullable in the output. |
For some set operations, whether a specific record is included in the output and if it appears more than once depends on the number of times it occurs across all inputs. In the following table, treat:
* m: the number of time a records occurs in the primary input (p)
* n1: the number of times a record occurs in the 1st secondary input (s1)
* n2: the number of times a record occurs in the 2nd secondary input (s2)
* ...
* n: the number of times a record occurs in the nth secondary input

| Operation | Description | Examples | Output Nullability
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------| -----------------------------
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicates.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result.<br/>This operation maps to SQL EXCEPT. | MINUS<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2}<br/>&nbsp;&nbsp;s2: {3}<br/>YIELDS<br/>{4} | The same as the primary input.
vbarua marked this conversation as resolved.
Show resolved Hide resolved
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies.<br/>This operation maps to SQL EXCEPT ALL. | MINUS ALL<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;s1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;s2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs.<br/>This operation does have a direct SQL mapping. | MINUS MULTISET<br/>&nbsp;&nbsp;p: {1, 2, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2}<br/>&nbsp;&nbsp;s2: {1, 2, 3}<br/>YIELDS<br/>{3, 4} | The same as the primary input.
vbarua marked this conversation as resolved.
Show resolved Hide resolved
| Intersection (Primary) | Returns all records from the primary input that are present in any secondary input, removing duplicates.<br/>This operation does have a direct SQL mapping. | INTERSECT<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;s2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
vbarua marked this conversation as resolved.
Show resolved Hide resolved
| Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs.<br/>This operation maps to SQL INTERSECT DISTINCT | INTERSECT MULTISET<br/>&nbsp;&nbsp;p: {1, 2, 3, 4}<br/>&nbsp;&nbsp;s1: {2, 3}<br/>&nbsp;&nbsp;s2: {3, 4}<br/>YIELDS<br/>{3} | If a field is required in any of the inputs, it is required in the output.
| Intersection (Multiset All) | Returns all records from the primary input that are present in every secondary input.<br/>For each specific record returned, the output contains min(m, n1, n2, …, n) copies.<br/>This operation maps to SQL INTERSECT ALL. | INTERSECT ALL<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2, 3, 3, 5}<br/>&nbsp;&nbsp;s2: {2, 3, 3, 6}<br/>YIELDS<br/>{2, 3, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
vbarua marked this conversation as resolved.
Show resolved Hide resolved
| Union Distinct | Returns all records from each set, removing duplicates.<br/>This operation maps to SQL UNION DISTINCT. | UNION<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {2, 3, 5}<br/>&nbsp;&nbsp;s2: {1, 6}<br/>YIELDS<br/>{1, 2, 3, 4, 5, 6} | If a field is nullable in any of the inputs, it is nullable in the output.
| Union All | Returns all records from all inputs.<br/>For each specific record returned, the output contains (m + n1 + n2 + … + n) copies.<br/>This operation maps to SQL UNION ALL. | UNION ALL<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {2, 3, 5}<br/>&nbsp;&nbsp;s2: {1, 6}<br/>YIELDS<br/>{1, 2, 2, 3, 3, 3, 4, 2, 3, 5, 1, 6} | If a field is nullable in any of the inputs, it is nullable in the output.

Note that for set operations, NULL matches NULL. That is
```
Expand All @@ -294,14 +303,16 @@ Input 3: (R, N, R, N, R, N, R, N) Secondary Input

The output type is as follows for the various operations

| Property | Output Type
| ----------------------- | -----------------------------------------------------------------------------------------------------
| Minus (Primary) | (R, R, R, R, N, N, N, N)
| Minus (Multiset) | (R, R, R, R, N, N, N, N)
| Intersection (Primary) | (R, R, R, R, R, N, N, N)
| Intersection (Multiset) | (R, R, R, R, R, R, R, N)
| Union Distinct | (R, N, N, N, N, N, N, N)
| Union All | (R, N, N, N, N, N, N, N)
| Property | Output Type
|-----------------------------| -----------------------------------------------------------------------------------------------------
| Minus (Primary) | (R, R, R, R, N, N, N, N)
| Minus (Primary All) | (R, R, R, R, N, N, N, N)
| Minus (Multiset) | (R, R, R, R, N, N, N, N)
| Intersection (Primary) | (R, R, R, R, R, N, N, N)
| Intersection (Multiset) | (R, R, R, R, R, R, R, N)
| Intersection (Multiset All) | (R, R, R, R, R, N, N, N)
| Union Distinct | (R, N, N, N, N, N, N, N)
| Union All | (R, N, N, N, N, N, N, N)


=== "SetRel Message"
Expand Down
Loading