Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion TableProvider for memory arrays #384

Merged
merged 24 commits into from
Jun 18, 2024
Merged

DataFusion TableProvider for memory arrays #384

merged 24 commits into from
Jun 18, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jun 18, 2024

Implements a readonly DataFusion TableProvider around an array of DType::Struct rows. See tests in vortex-datafusion/lib.rs for example of how to read from a SQL context.

  • Adds new vortex-datafusion crate. Types are still private until we're more comfortable making them publicly exposed
  • Adds new cfg-flagged converters in vortex-dtype to convert from DType to arrow DataType and arrow Schema

Support for pushing down filters against Vortex arrays will come in a FLUP tomorrow

Cargo.toml Outdated Show resolved Hide resolved
@a10y a10y requested review from robert3005 and gatesn June 18, 2024 02:16
vortex_bail!(InvalidArgument: "only DType::Struct arrays can produce a table provider");
}

let arrow_schema = Schema::try_from(array.dtype())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually removed this conversion since it is kind of lossy. You can't tell just from a Vortex DType the exact Arrow type, e.g. is it a String or a LargeString?

What we wanted to add was array.arrow_dtype() possibly related to the AsArrow compute function, or add an AsArrowDType compute function or something....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think VarBin and VarBinView are the only places where this is ambiguous, and for those you can introspect the encoding details. I think it makes most sense to just implement this as a method on Flattened...does that sound right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will happen more often - timestamp/date/time will be likely different. Similarly for list/fixedsizedlist/largelist/listview

vortex-datafusion/src/lib.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/lib.rs Show resolved Hide resolved
vortex-datafusion/src/lib.rs Outdated Show resolved Hide resolved
vortex-datafusion/src/lib.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@a10y
Copy link
Contributor Author

a10y commented Jun 18, 2024

Per our huddle earlier, I've gone ahead and done the following (apologies it's split across a few commits now, if it's too hard to review I can try and stack the PRs more):

  • Removed the AsArrowArray trait entirely, instead centralized the implementation as internal to the Flattened type
  • Removed VarBinView as one of the Flattened variants. ArrayFlatten for VarBinViewArra will do a naive rebuild of the array as a fresh VarBinArray.
  • Moved conversion functions between Vortex and Arrow schema types out of vortex-dtype, instead placing it as a new module within the vortex-datafusion crate

vortex-array/src/array/struct/mod.rs Outdated Show resolved Hide resolved
vortex-array/src/array/varbinview/mod.rs Outdated Show resolved Hide resolved
vortex-array/src/flatten.rs Show resolved Hide resolved
vortex-array/src/flatten.rs Show resolved Hide resolved
vortex-array/src/flatten.rs Show resolved Hide resolved
@a10y
Copy link
Contributor Author

a10y commented Jun 18, 2024

I would like to rename Flatten to something else, but this PR has gotten big enough already :P

@a10y a10y enabled auto-merge (squash) June 18, 2024 21:43
@a10y a10y merged commit 5fbc1af into develop Jun 18, 2024
2 checks passed
@a10y a10y deleted the aduffy/datafusion branch June 18, 2024 21:46
@alamb
Copy link

alamb commented Jun 19, 2024

This is so cool -- great to see this 🚀

Let us know if there is anything we can do to help you along upstream in DataFusion

@a10y
Copy link
Contributor Author

a10y commented Jun 19, 2024

Thanks Andrew, and thanks for the amazing work you're doing with the DataFusion project!

I think the two big things we're tracking are:

  • Better support for StringView, which it seems like you're already tracking and the community is making progress on!
  • Longer term, we think we're going to want to do projection pushdown over nested schemas, just given our data model. It seems like currently the TableProvider doesn't provide hooks for doing that.

@alamb
Copy link

alamb commented Jun 19, 2024

Thanks Andrew, and thanks for the amazing work you're doing with the DataFusion project!

❤️

  • Better support for StringView, which it seems like you're already tracking and the community is making progress on!

Indeed -- it seem to be coming along very nicely. We are tracking in apache/datafusion#10918

  • Longer term, we think we're going to want to do projection pushdown over nested schemas, just given our data model. It seems like currently the TableProvider doesn't provide hooks for doing that.

When you say "projection pushdown" is this what you mean: apache/datafusion#2581 ? I just want to make sure the requests is tracked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants