Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion integration assumes table's data files are local #43

Closed
sd2k opened this issue Dec 13, 2020 · 10 comments
Closed

Datafusion integration assumes table's data files are local #43

sd2k opened this issue Dec 13, 2020 · 10 comments
Labels
binding/rust Issues for the Rust crate enhancement New feature or request

Comments

@sd2k
Copy link
Contributor

sd2k commented Dec 13, 2020

The Datafusion integration passes a list of file paths representing a table's actual data to Datafusion's ParquetExec, but if the Delta table's StorageBackend is anything other than the FileStorageBackend then this fails because the files aren't local.

I'm not sure where this should be handled though - it feels like this should be part of Datafusion or an extension crate?

@houqp
Copy link
Member

houqp commented Dec 13, 2020

Yeah, unfortunately, datafusion uses arrow parquet readers, which only supports local file at the moment: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L181. I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.

@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)

@rtyler rtyler added binding/rust Issues for the Rust crate bug Something isn't working labels Dec 13, 2020
@sd2k
Copy link
Contributor Author

sd2k commented Dec 14, 2020

I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.

Makes sense to me!

@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)

Sounds good, I'll keep an eye on it and try and contribute an Azure reader when the time comes.

@nevi-me
Copy link
Contributor

nevi-me commented Dec 16, 2020

What could work in the interim is to use DataFusion's in-memory datasource (https://docs.rs/datafusion/2.0.0/datafusion/datasource/memory/index.html). When we have async-support on Parquet, then we can change to the relevant methods.

@rtyler rtyler added enhancement New feature or request and removed bug Something isn't working labels Jan 10, 2021
@meastham
Copy link

@nevi-me is there a bug anywhere to track S3 support? I took a brief look in the Arrow and Datafusion repos and didn't find anything. If you're open to it it's something that we could potentially look in to contributing.

@houqp
Copy link
Member

houqp commented Jun 18, 2021

@meastham feel free to start a discussion for s3 support in the upstream datafusion github repo or in the arrow dev mailing list.

@gopik
Copy link

gopik commented Oct 23, 2021

Given object store support in datafusion, can a blob path integration be implemented assuming we have appropriate blobstore implementation of object_store interface?
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/object_store/mod.rs

I understand that given this, we can pass the file names prefixed with appropriate storage handler name from delta-rs, but my question is, is datafusion execution plan integration with this data source complete or is it still in progress?

@houqp
Copy link
Member

houqp commented Oct 23, 2021

@gopik yes, we are pending on upstream object store support for s3. datafusion execution plan integration is all complete other than partition column support, which should be fairly straight forward to add.

@gopik
Copy link

gopik commented Oct 23, 2021

@houqp When you say upstream object support for s3, will that be part of datafusion project or it'll be part of an integration that is embedding datafusion?

@houqp
Copy link
Member

houqp commented Oct 23, 2021

@gopik it will be part of datafusion, see apache/datafusion#907

@roeap
Copy link
Collaborator

roeap commented Sep 2, 2022

With the adoption of object_store, the datafusion integration now supports all storage backends - there are integration tests as well :).

https://github.com/delta-io/delta-rs/blob/main/rust/tests/integration_datafusion.rs

@roeap roeap closed this as completed Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants