[Design] Exchange Materialization #12387

wenleix · 2019-02-26T06:29:15Z

A comment-friendly version can be found in https://docs.google.com/document/d/1pQOOsveEN6KQPxiNDIii03lqa5x_O-HXWLt_T80dhRo/edit?usp=sharing .

Introduction

Grouped execution was introduced to Presto in #8951 to support huge join and aggregation raised in ETL pipelines.

When the input tables are already partitioned on the join key or aggregation key (e.g. bucketed table in Hive), Presto could process a subset of the partitions of the data at a time. This reduces the amount of memory needed to hold the hash table and opens opportunities to partial query recovery (see #12124 for more details)

While grouped execution makes Presto to run large ETL jobs, both in terms of memory and duration, it doesn’t work for unpartitioned table. In order to make grouped execution work for such cases, we could materialize the exchange (modeled as temporary partitioned table).

Materialized Exchange

In this section we will discuss the key question: what are the exchanges being materialized.

Hinted by user
- Common Table Expression
- Query Hint (requires non-ANSI SQL syntax)
Automatically decided by engine
- Every REPARTITION exchange
- CBO (e.g. based on estimated memory usage, see Peak memory cost calculation #11591)

In this section, we will discuss Common Table Expression and materializing every REPARTITION exchange while left CBO as future work. While we have also evaluated the possibility of query hints (see appendix), it was not considered at this moment given it introduces ANSI SQL incompatible syntax.

Consider the following query:

SELECT *
FROM customer JOIN order USING custkey

We can use CTE to hint we want to have materialization point for relations defined in the query:

WITH 
bucketed_customer AS 
	SELECT * FROM customer,
	bucketed_order AS
	SELECT * FROM orders
SELECT *
FROM bucketed_customer JOIN bucketed_order USING custkey

The implementation complexity of CTE depends how we do it:

A prototype based on “meta-planning” is available (i.e. split the queries based on CTE, and plan for each sub-queries individually). In this case, CTE doesn’t involve in plan optimization, and thus becomes the optimization barrier.
Make CTE involved in planner by introducing CTESinkNode and CTESourceNode. This makes CTE optimizable, however it changes the plan from a tree to a DAG, and we expect some significant engineering effort has to be put in. A recent advance on plan optimization over CTE can be found in
Optimization of Common Table Expressions in MPP Database Systems.

On the other hand, materializing all REPARTITION exchange doesn’t require query change. And we can use the existing tree shaped plan.

We propose the following path:

We will introduce the session property to allow engine to materialize the exchange. Today it will simply be expanded to TabeWriterOperator + TableScanOperator.
In the future, we should allow materialized exchange to be decided by CBO. One thought is a session property like “materialize-exchange” and take three values: NONE, ALWAYS, AUTOMATIC
We also need to think about whether it makes sense to abstract Exchange (e.g. introduce ExchangePageSink and ExchangePageSource, and migrate PartitionedOutputOperator as a special case of them)

Planner Support

In this section we will discuss how to support materializing exchange from planner side. Consider the same example query, the original simplified plan will be like the following:

During Plan fragementing, the join will be decided to be ungrouped execution, since the TableScan are not eligible (table are not partitioned). (See GroupedExecutionTagger).

When exchanges are decided to be materialized, the plan will first be “sectioned”, and ExchangeNode will be replaced by TableWriter/Finish and TableScan:

TableLayout and SplitSource

Since the query still get planned once before query executes, these temporary materialized table doesn’t exist at the plan time. This introduces difficulty to construct the TableLayout and SplitSource.

We propose the following solution:

A new prepareExchangeMaterialization method is introduced to ConnectorMetadata, and returns information required for planning, such as TableHandle, OutputHandle, TableLayoutHandle. For Hive connector, the temporary table will be registered in the in-memory SemiTransactionalMetastore. The data will be removed at the end of transaction, and the table never get committed to Metastore.
Allow ConnectorSplitSource to be initialized in a lazy way.

Execution Support for Multiple “Queries”

Andrii’s Prototype: arhimondr#1

The text was updated successfully, but these errors were encountered:

stale · 2021-07-11T20:57:53Z

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

This was referenced Feb 26, 2019

Add connector specific partitioning support for remote exchanges #12373

Merged

Calculate query peak memory usage as memory cost #12398

Merged

[WIP] Presto 2.0 #12419

Closed

wenleix changed the title ~~[Proposal] Support Materialized Exchange~~ [Proposal] Materialized Exchange Mar 5, 2019

This was referenced Mar 12, 2019

Introduce Temporary Table SPI #12464

Merged

Support materializing exchanges #12469

Closed

wenleix changed the title ~~[Proposal] Materialized Exchange~~ [Design] Exchange Materialization Mar 15, 2019

arhimondr mentioned this issue Apr 7, 2019

Simplify DistributedExecutionPlanner to SplitSourceFactory #12582

Merged

wenleix mentioned this issue Apr 8, 2019

Support partial merge pushdown #12611

Merged

This was referenced Apr 10, 2019

Support materializing exchanges [Execution Part] #12604

Merged

Support materializing exchanges [Planner Part] #12568

Merged

wenleix mentioned this issue Apr 12, 2019

Allow materializing partitioning inferred by join #12656

Merged

arhimondr mentioned this issue Apr 15, 2019

Allow to change compression codec with a session property #12647

Merged

rschlussel mentioned this issue Jun 22, 2019

Limit the number of concurrent plan sections #12903

Merged

wenleix mentioned this issue Jun 24, 2019

Aggregate over unpartitioned source is not yet supported by materialized exchange #13003

Closed

tdcmeehan pinned this issue Jul 11, 2019

aweisberg added the Roadmap A top level roadmap item label Jul 11, 2019

tdcmeehan unpinned this issue Jul 15, 2019

This was referenced Jul 16, 2019

Improve exchange materialization configuration experience #13085

Merged

Change section schedulng approach #13090

Merged

highker pinned this issue Jul 18, 2019

arhimondr mentioned this issue Aug 29, 2019

Allow anonymous rows in Hive temporary tables #13307

Merged

pguofb mentioned this issue Jul 6, 2020

Initial Support of Adaptive Optimization with Presto Unlimited #14675

Merged

stale bot added the stale label Jul 11, 2021

stale bot closed this as completed Jul 21, 2021

tdcmeehan unpinned this issue Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design] Exchange Materialization #12387

[Design] Exchange Materialization #12387

wenleix commented Feb 26, 2019 •

edited

Loading

stale bot commented Jul 11, 2021

[Design] Exchange Materialization #12387

[Design] Exchange Materialization #12387

Comments

wenleix commented Feb 26, 2019 • edited Loading

Introduction

Materialized Exchange

Planner Support

TableLayout and SplitSource

Execution Support for Multiple “Queries”

stale bot commented Jul 11, 2021

wenleix commented Feb 26, 2019 •

edited

Loading