Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0 Datafusion with late materialization #414

Merged
merged 16 commits into from
Jun 26, 2024
Merged

v0 Datafusion with late materialization #414

merged 16 commits into from
Jun 26, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jun 25, 2024

Unfinished, just opening this as I continue to get things working.

This PR augments the original Vortex connection for Datafusion, with an implementation of filter pushdown that allows us to perform late materialization on as many columns as possible.

Pushdown support will be able to get flagged on/off so we can run benchmarks testing different strategies.

I'm hoping to have an initial version of this with a benchmark harness tonight.

Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's unfinished so I tried to leave high level comments only

bench-vortex/benches/datafusion_benchmark.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/lib.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/plans.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/plans.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/plans.rs Show resolved Hide resolved
vortex-datafusion/src/plans.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/lib.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/lib.rs Outdated Show resolved Hide resolved
@a10y
Copy link
Contributor Author

a10y commented Jun 26, 2024

Output of the datafusion_benchmark on my MBP.

Note that vortex-nopushdown-uncompressed should actually be vortex-nopushdown-compressed, and vortex-nopushdown-uncompressed #2 is the actual vortex-nopushdown-uncompressed.

image

Even though this is synthetic data, it still illustrates that decoding overhead is the driving factor in execution time.

There's also some latency between uncompressed Vortex with no pushdown and Arrow with no pushdown, but that time is roughly the ~130µs it takes to do the Vortex -> Arrow conversion (benchmarked that separately, not in the repo).

@robert3005
Copy link
Member

robert3005 commented Jun 26, 2024

Right now we don't run the filters on compressed data which would probably be the thing to fix. Anyway, this seems fixable

@a10y
Copy link
Contributor Author

a10y commented Jun 26, 2024

I agree. I'm going to address the last few comments of your original review and then convert this to "Ready for review"

@a10y a10y marked this pull request as ready for review June 26, 2024 17:16
@a10y a10y changed the title [WIP] Late materialization with datafusion v0 Datafusion with late materialization Jun 26, 2024
Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some style nits to avoid clones. Should figure out if we really need to validate filter column references

let filter_exprs: Option<Vec<Expr>> = if filters.is_empty() {
None
} else {
Some(filters.to_vec())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my earlier comment meant that you can get rid of this to_vec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -261,7 +483,7 @@ impl ExecutionPlan for VortexMemoryExec {
self.array.clone()
};

Self::execute_single_chunk(chunk, &self.projection, context)
execute_unfiltered(&chunk, &self.scan_projection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this change, you immediately clone the chunk in the function. How about we also change the scan_projection to &[usize]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

/// Check if the given expression tree can be pushed down into the scan.
fn can_be_pushed_down(expr: &Expr, schema_columns: &HashSet<String>) -> DFResult<bool> {
// If the filter references a column not known to our schema, we reject the filter for pushdown.
// TODO(aduffy): is this necessary? Under what conditions would this happen?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the docs I don't think this check is necessary. TableProvider returns the schema so datafusion will know if the query is valid or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just removed it, if we end up needing to add it back later for some reason we can reference this PR

@a10y a10y enabled auto-merge (squash) June 26, 2024 18:02
@a10y a10y disabled auto-merge June 26, 2024 18:02
@a10y a10y enabled auto-merge (squash) June 26, 2024 18:02
@a10y a10y merged commit 40616b1 into develop Jun 26, 2024
2 checks passed
@a10y a10y deleted the aduffy/df-pushdown-v0 branch June 26, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants