Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: define sideband optimization hints #705

Merged
merged 13 commits into from
Oct 8, 2024
37 changes: 37 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,12 @@ message RelCommon {

substrait.extensions.AdvancedExtension advanced_extension = 10;

// Save or load a system-specific computation for use in optimizing a remote operation.
// The anchor refers to the source/destination of the computation. The computation type
// and number refer to the current relation.
repeated SavedComputation saved_computations = 11;
repeated LoadedComputation loaded_computations = 12;

// The statistics related to a hint (physical properties of records)
message Stats {
double row_count = 1;
Expand All @@ -59,6 +65,37 @@ message RelCommon {

substrait.extensions.AdvancedExtension advanced_extension = 10;
}

enum ComputationType {
COMPUTATION_TYPE_UNSPECIFIED = 0;
COMPUTATION_TYPE_HASHTABLE = 1;
COMPUTATION_TYPE_BLOOM_FILTER = 2;
COMPUTATION_TYPE_UNKNOWN = 9999;
jacques-n marked this conversation as resolved.
Show resolved Hide resolved
}

message SavedComputation {
// The value corresponds to a plan unique number for that datastructure. Any particular
// computation may be saved only once but it may be loaded multiple times.
int32 computation_id = 1;
// The type of this computation. While a plan may use COMPUTATION_TYPE_UNKNOWN for all
// of its types it is recommended to use a more specific type so that the optimization
// is more portable. The consumer should be able to decide if an unknown type here
// matches the same unknown type at a different plan and ignore the optimization if they
// are mismatched.
ComputationType type = 2;
}

message LoadedComputation {
// The value corresponds to a plan unique number for that datastructure. Any particular
// computation may be saved only once but it may be loaded multiple times.
int32 computation_id_reference = 1;
// The type of this computation. While a plan may use COMPUTATION_TYPE_UNKNOWN for all
// of its types it is recommended to use a more specific type so that the optimization
// is more portable. The consumer should be able to decide if an unknown type here
// matches the same unknown type at a different plan and ignore the optimization if they
// are mismatched.
ComputationType type = 2;
}
}
}

Expand Down
3 changes: 2 additions & 1 deletion site/docs/relations/_config
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
arrange:
- basics.md
- common_fields.md
- logical_relations.md
- physical_relations.md
- user_defined_relations.md
- embedded_relations.md
- embedded_relations.md
26 changes: 26 additions & 0 deletions site/docs/relations/common_fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Common Fields

Every relation contains a common section containing optional hints and emit behavior.


## Emit

A relation which has a direct emit kind outputs the relation's output without reordering or selection. A relation that specifies an emit output mapping can output its output columns in any order and may leave output columns out.

???+ info "Relation Output"

* Many relations (such as Project) by default provide as their output the list of all their input columns plus any generated columns as its output columns. Review each relation to understand its specific output default.


## Hints

Hints provide information that can improve performance but cannot be used to control the behavior. Table statistics, runtime constraints, name hints, and saved computations all fall into this category.

???+ info "Hint Design"

* If a hint is not present or has incorrect data the consumer should be able to ignore it and still arrive at the correct result.


### Saved Computations

Computations can be used to save a data structure to use elsewhere. For instance, let's say we have a plan with a HashEquiJoin and an AggregateDistinct operation. The HashEquiJoin could save its hash table as part of saved computation id number 1 and the AggregateDistinct could read in computation id number 1.
Loading