v0 Datafusion with late materialization #414

a10y · 2024-06-25T18:25:30Z

Unfinished, just opening this as I continue to get things working.

This PR augments the original Vortex connection for Datafusion, with an implementation of filter pushdown that allows us to perform late materialization on as many columns as possible.

Pushdown support will be able to get flagged on/off so we can run benchmarks testing different strategies.

I'm hoping to have an initial version of this with a benchmark harness tonight.

robert3005

I know it's unfinished so I tried to leave high level comments only

bench-vortex/benches/datafusion_benchmark.rs

vortex-array/src/array/bool/compute/compare.rs

vortex-datafusion/src/lib.rs

vortex-datafusion/src/plans.rs

vortex-datafusion/src/lib.rs

a10y · 2024-06-26T15:08:12Z

Output of the datafusion_benchmark on my MBP.

Note that vortex-nopushdown-uncompressed should actually be vortex-nopushdown-compressed, and vortex-nopushdown-uncompressed #2 is the actual vortex-nopushdown-uncompressed.

Even though this is synthetic data, it still illustrates that decoding overhead is the driving factor in execution time.

There's also some latency between uncompressed Vortex with no pushdown and Arrow with no pushdown, but that time is roughly the ~130µs it takes to do the Vortex -> Arrow conversion (benchmarked that separately, not in the repo).

robert3005 · 2024-06-26T15:28:51Z

Right now we don't run the filters on compressed data which would probably be the thing to fix. Anyway, this seems fixable

a10y · 2024-06-26T15:32:59Z

I agree. I'm going to address the last few comments of your original review and then convert this to "Ready for review"

vortex-array/src/stats/statsset.rs

pyvortex/test/test_array.py

vortex-array/src/array/varbin/compute/take.rs

robert3005

There's some style nits to avoid clones. Should figure out if we really need to validate filter column references

robert3005 · 2024-06-26T17:12:37Z

vortex-datafusion/src/lib.rs

+        let filter_exprs: Option<Vec<Expr>> = if filters.is_empty() {
+            None
+        } else {
+            Some(filters.to_vec())


my earlier comment meant that you can get rid of this to_vec

robert3005 · 2024-06-26T17:34:35Z

vortex-datafusion/src/lib.rs

@@ -261,7 +483,7 @@ impl ExecutionPlan for VortexMemoryExec {
            self.array.clone()
        };

-        Self::execute_single_chunk(chunk, &self.projection, context)
+        execute_unfiltered(&chunk, &self.scan_projection)


not sure why this change, you immediately clone the chunk in the function. How about we also change the scan_projection to &[usize]?

robert3005 · 2024-06-26T17:43:55Z

vortex-datafusion/src/lib.rs

+/// Check if the given expression tree can be pushed down into the scan.
+fn can_be_pushed_down(expr: &Expr, schema_columns: &HashSet<String>) -> DFResult<bool> {
+    // If the filter references a column not known to our schema, we reject the filter for pushdown.
+    // TODO(aduffy): is this necessary? Under what conditions would this happen?


Reading the docs I don't think this check is necessary. TableProvider returns the schema so datafusion will know if the query is valid or not

i just removed it, if we end up needing to add it back later for some reason we can reference this PR

robert3005 reviewed Jun 26, 2024

View reviewed changes

a10y added 10 commits June 26, 2024 10:44

save

88949d5

in progress

998bc55

implement RowSelectorExec

abb2e98

implement operators

1536dba

couple fixes

c51d706

benchmarks, fixes found while doing benchmarks

3982bb7

fix bug in VortexScanExec, address some comments

f417c51

remove SessionState creation

5d65ad4

more samples

0ad3007

renable DictEncoding for bench

f17cb70

a10y force-pushed the aduffy/df-pushdown-v0 branch from b4fb99c to f17cb70 Compare June 26, 2024 14:47

cleanup bench, add comment, fix unit test

c3abd3d

fix test name

fcc964d

RowIndicesExec receives a vortex array, not a recordbatch

f2f6c26

robert3005 reviewed Jun 26, 2024

View reviewed changes

vortex-array/src/stats/statsset.rs Outdated Show resolved Hide resolved

a10y marked this pull request as ready for review June 26, 2024 17:16

a10y changed the title ~~[WIP] Late materialization with datafusion~~ v0 Datafusion with late materialization Jun 26, 2024

fix bug in varbin_to_arrow

8cd7f8f

a10y force-pushed the aduffy/df-pushdown-v0 branch from e5d82a8 to 8cd7f8f Compare June 26, 2024 17:22

fix pytests

df06c5f

a10y force-pushed the aduffy/df-pushdown-v0 branch from 994e73b to df06c5f Compare June 26, 2024 17:28

a10y commented Jun 26, 2024

View reviewed changes

pyvortex/test/test_array.py Show resolved Hide resolved

a10y commented Jun 26, 2024

View reviewed changes

vortex-array/src/array/varbin/compute/take.rs Show resolved Hide resolved

robert3005 approved these changes Jun 26, 2024

View reviewed changes

robert3005 reviewed Jun 26, 2024

View reviewed changes

remove unnecessary clones and things

aa8e787

a10y enabled auto-merge (squash) June 26, 2024 18:02

a10y disabled auto-merge June 26, 2024 18:02

a10y enabled auto-merge (squash) June 26, 2024 18:02

a10y merged commit 40616b1 into develop Jun 26, 2024
2 checks passed

a10y deleted the aduffy/df-pushdown-v0 branch June 26, 2024 18:04

robert3005 mentioned this pull request Jul 1, 2024

Explore datafusion integration #203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0 Datafusion with late materialization #414

v0 Datafusion with late materialization #414

a10y commented Jun 25, 2024

robert3005 left a comment

a10y commented Jun 26, 2024 •

edited

Loading

robert3005 commented Jun 26, 2024 •

edited

Loading

a10y commented Jun 26, 2024

robert3005 left a comment

robert3005 Jun 26, 2024

a10y Jun 26, 2024

robert3005 Jun 26, 2024

a10y Jun 26, 2024

robert3005 Jun 26, 2024

a10y Jun 26, 2024

v0 Datafusion with late materialization #414

v0 Datafusion with late materialization #414

Conversation

a10y commented Jun 25, 2024

robert3005 left a comment

Choose a reason for hiding this comment

a10y commented Jun 26, 2024 • edited Loading

robert3005 commented Jun 26, 2024 • edited Loading

a10y commented Jun 26, 2024

robert3005 left a comment

Choose a reason for hiding this comment

robert3005 Jun 26, 2024

Choose a reason for hiding this comment

a10y Jun 26, 2024

Choose a reason for hiding this comment

robert3005 Jun 26, 2024

Choose a reason for hiding this comment

a10y Jun 26, 2024

Choose a reason for hiding this comment

robert3005 Jun 26, 2024

Choose a reason for hiding this comment

a10y Jun 26, 2024

Choose a reason for hiding this comment

a10y commented Jun 26, 2024 •

edited

Loading

robert3005 commented Jun 26, 2024 •

edited

Loading