Parquet limit pushdown (#5404) #5416

tustvold · 2023-02-27T12:22:29Z

Which issue does this PR close?

Relates to #5404

Rationale for this change

apache/arrow-rs#3633 added the ability to push down limits to the parquet reader. This is particularly important when filter pushdown is enabled on ParquetOptions (soon to be default), as it allows the limit to be applied before late materialization, which has significant performance benefits

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Thank you @tustvold -- it would be nice to have a test for this, but I am not sure there is any way to test this really (other than performance benchmarks).

jackwener

A nice job. Thank you @tustvold

jackwener · 2023-02-27T12:41:13Z

Thank you @tustvold -- it would be nice to have a test for this, but I am not sure there is any way to test this really (other than performance benchmarks).

There is tpch in the current benchmarks, may we consider adding clickbench?

alamb · 2023-02-27T13:28:51Z

There is tpch in the current benchmarks, may we consider adding clickbench?

It is a good idea @jackwener -- I have it on my list this week to do some more organizing related to benchmarking. I definitely agree that clickbench would be super helpful

ursabot · 2023-02-27T13:31:54Z

Benchmark runs are scheduled for baseline = 58cd1bf and contender = 4118076. 4118076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Parquet limit pushdown (apache#5404)

229a83b

github-actions bot added the core Core DataFusion crate label Feb 27, 2023

tustvold mentioned this pull request Feb 27, 2023

Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data #5404

Open

alamb approved these changes Feb 27, 2023

View reviewed changes

jackwener approved these changes Feb 27, 2023

View reviewed changes

alamb merged commit 4118076 into apache:main Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet limit pushdown (#5404) #5416

Parquet limit pushdown (#5404) #5416

tustvold commented Feb 27, 2023

alamb left a comment

jackwener left a comment •

edited

Loading

jackwener commented Feb 27, 2023 •

edited

Loading

alamb commented Feb 27, 2023

ursabot commented Feb 27, 2023

Parquet limit pushdown (#5404) #5416

Parquet limit pushdown (#5404) #5416

Conversation

tustvold commented Feb 27, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

jackwener left a comment • edited Loading

Choose a reason for hiding this comment

jackwener commented Feb 27, 2023 • edited Loading

alamb commented Feb 27, 2023

ursabot commented Feb 27, 2023

jackwener left a comment •

edited

Loading

jackwener commented Feb 27, 2023 •

edited

Loading