S3 Support #907

ehenry2 · 2021-08-19T14:30:11Z

Now that there is a PR in-flight for remote storage systems (#811 ), I wanted to ask what the plan was to implement s3 support. Is this something that will live within the datafusion project or maintained outside? Also, does anyone plan on working on this, or is this something that you are looking for a contributor for?

Also, I'm new to the project, so what is the best way to ask these kind of questions...github issues or on the email list? thanks!

Dandandan · 2021-08-19T15:39:02Z

My personal feeling is that we probably don't want to have every storage system inside the DataFusion crate by default but either:

Keep separate crates, like datafusion-s3 which has implementations for some interfaces, and maybe also exposes some helper functions
Have some optional features (e.g. s3) living inside the crate.

Wydt @alamb @houqp @andygrove @jorgecarleitao

jimexist · 2021-08-19T17:03:34Z

My personal feeling is that we probably don't want to have every storage system inside the DataFusion crate by default but either:

Keep separate crates, like datafusion-s3 which has implementations for some interfaces, and maybe also exposes some helper functions

Have some optional features (e.g. s3) living inside the crate.

Wydt @alamb @houqp @andygrove @jorgecarleitao

A separate crate is a good idea.

How would you think of keeping the code as part of the workspace?

ehenry2 · 2021-08-19T17:14:21Z

Another question I wanted to throw out too, on implementation details (probably putting the cart before the horse, but wanted to ask anyway), was should we use Rusoto (in maintenance mode) or the new AWS rust sdk (in Alpha, not on crates.io until GA). My current thinking is probably to use Rusoto for now, as it will probably be a while until the AWS SDK gets into GA, but want to hear everyone thoughts on this.

alamb · 2021-08-19T20:17:40Z

Keep separate crates, like datafusion-s3 which has implementations for some interfaces, and maybe also exposes some helper functions

Yes I think separate crates is a very good idea to avoid datafusion requiring a massive dependency stack and to keep compile time reasonable.

How would you think of keeping the code as part of the workspace?

I think this is a reasonable for "core" integrations, though going this path also subjects those crates to the apache arrow governance model (among other things slower release cycle) which may not be needed here

Maybe we could start it as its own repo / separate crate with whoever does it and then we can figure out if we want to bring it back into apache arrow if any.

was should we use Rusoto (in maintenance mode) or the new AWS rust sdk (in Alpha, not on crates.io until GA). My current thinking is probably to use Rusoto for now, as it will probably be a while until the AWS SDK gets into GA, but want to hear everyone thoughts on this.

FWIW we took this approach in IOx (starting with Rusto) and it has been working well for us: https://github.com/influxdata/influxdb_iox/blob/main/object_store/Cargo.toml#L21-L23 (we have been using it in AWS for several months now and we have not hit any issues with it)

houqp · 2021-08-20T06:38:51Z

I also think separate crate is a good idea. The design that @yjshen came up with made it very easy to implement different remote storage backends as self-contained plugins.

Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at apache/datafusion-sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using personal namespace.

How would you think of keeping the code as part of the workspace?

I think the upside of this is we won't need to deal with IP clearance if we ever need to donate it back to datafusion. The downside is what Andrew said earlier, slower release cycle due to apache governance overhead, which is not too bad for a stable project. For newer and fast evolving projects, it might slow things down unnecessarily.

yjshen · 2021-08-20T06:49:15Z

Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at apache/datafusion-sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using a personal namespace.

Wow, that's great!

yjshen · 2021-08-20T08:12:16Z

If someone may be interested, I will also work on HDFS support right after this.

andygrove · 2021-08-20T13:29:59Z

Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at sqlparser-rs/sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using personal namespace.

If we are developing crates that are extensions to DataFusion then I would be nervous about creating a separate org & community with its own governance for these crates. I think it would be fine for generic database-related crates that don't have "arrow" or "datafusion" in the name, such as the sqlparser-rs crate.

We already have the ability to create arrow-experimental-* repos where we do not need so much governance. Perhaps we should try that route first?

ehenry2 · 2021-09-14T19:02:21Z

Wanted to reopen the discussion here...do we have a consensus on which approach we want to go with? Now that the first set of remote storage PRs are merged, this can now be implemented.

alamb · 2021-09-14T20:20:26Z

I think there is a consensus on the architecture within datafusion, now that #950 has been merged

I think we now would need to rework the various table providers (CsvExec, ParquetExec, etc) to use that interface rather than File or Read directly, and then we would be ready to create an S3 plugin for DataFusion

ehenry2 · 2021-09-14T23:10:41Z

@alamb That makes sense, my thought though is that s3 support can be worked on in parallel (although not able to utilized until the table providers are updated) as the scaffolding is there with the Object Reader and Object Store traits. In order to do that, however, either a repo would need to be created (or one of the other approaches given previously adopted).

houqp · 2021-09-15T06:07:04Z

Taking @andygrove 's comment into account, unless anyone volunteers to help create the arrow-experimental-datafusion-s3 repo, it might be easier to follow @jimexist 's suggestion to create and manage the s3 plugin crate inside datafusion's workspace. Downstream applications can pull in the plugin dependency by git hash if we are behind on official releases. Also I believe we can release alpha/beta versions to crates.io without requiring an official vote?

alamb · 2021-09-15T10:11:02Z

Starting to work on an experimental crate within arrow-datafusion seems like a good approach to me.

I haven't tracked / studied the requirements for releasing alpha/betas in fine detail, but I think as long as they are clearly marked as not official apache releases, it would be fine to put them on crates.io

rdettai · 2021-09-24T10:14:24Z

#1010 (more exactly rdettai#1) is taking care of re-organizing the TableProviders and ExecutionPlans to integrate the ObjectStore abstraction

matthewmturner · 2021-12-24T15:58:14Z

@alamb @houqp I'm really interested in getting s3 support added.

If no one else is working on this I should be able to pick it up soon (of course I'll likely have questions along the way). Based on the above it seems the approach would be to add it as crate in arrow-datafusion. Assuming this is still the case is there anything else I should know before starting?

houqp · 2021-12-24T19:44:10Z

@matthewmturner we created https://github.com/datafusion-contrib to host these community maintained extensions so it's easier for us to evolve the core datafusion code base.

houqp · 2021-12-24T19:44:58Z

I also recommend taking a look at the new official AWS rust SDK, it seems a lot more mature since we started this discussion.

matthewmturner · 2021-12-24T21:50:18Z

I also recommend taking a look at the new official AWS rust SDK, it seems a lot more mature since we started this discussion.

Ok. Will check it out.

matthewmturner · 2021-12-25T02:29:18Z

@houqp Separate but related to s3 I'm interested in adding the ability to register a AWS glue catalog / database.

Is that type of functionality something I could expect datafusion to have? If so, do you think it makes sense to bundle with s3 functionality? I would of course start with s3. But just for planning purposes (I.e looking at rusoto vs AWS sdk)

houqp · 2021-12-25T05:12:18Z

I think so, but i think the GLUE catalog integration should be implemented as a table provider plugin in a separate crate, not as part of the s3 object store plugin.

matthewmturner · 2021-12-26T03:18:35Z

@houqp for these cloud based services what do you think about a naming convention like the following:

datafusion-[cloud provider]-[service]

i.e.
datafusion-aws-s3
datafusion-aws-glue

houqp · 2021-12-26T05:10:40Z

My suggestion would be naming the repos by extension types, for example: datafusion-objectstore-s3, datafusion-table-glue, etc. But I don't have a strong opinion on this :)

matthewmturner · 2021-12-26T06:04:42Z

makes sense. do you think for glue it might be better to use catalog instead of table?

houqp · 2021-12-26T06:15:51Z

Yeah, i think that's a better name 👍

alamb · 2021-12-26T13:30:30Z

FWIW one example of using the aws s3 object store interface is in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/aws.rs in case that is helpful

matthewmturner · 2021-12-26T16:24:19Z

FWIW one example of using the aws s3 object store interface is in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/aws.rs in case that is helpful

thank you

matthewmturner · 2022-01-04T16:17:07Z

@houqp @alamb im very interested in getting s3 / glue support added to datafusion cli. do you know of a way to feature gate other libraries without adding their code to datafusion cli codebase? im just concerned that datafusion cli codebase could become messy if we end up adding the code and feature gate for each functionality like this (i.e. aws s3, aws glue, azure blob, custom sql, etc...). or maybe thats okay? what do you think?

As an aside, and for your information, someone emailed me and wanted to add azure blob storage support to datafusion but said they could work on s3 first. ive created a repo in the datafusion-contrib organization (https://github.com/datafusion-contrib/datafusion-objectstore-s3) and will point them to the Influxdb IOx implementation for inspiration.

alamb · 2022-01-04T21:43:22Z

FYI there is some great work from @yahoNanJing in this area on #1062

seddonm1 · 2022-01-05T03:01:12Z

Hi guys,
I built this which works correctly and uses the offical Amazon SDK: https://gist.github.com/seddonm1/2fb5a6892989fe7bf246022a7bd586ee

If it is useful I'm happy to donate it.

houqp · 2022-01-05T04:15:17Z

@seddonm1 I invited you to the datafusion contrib org as well, could you work with @matthewmturner to figure out how you want to collaborate on this?

matthewmturner · 2022-01-05T04:30:19Z

@houqp yup sounds good.

@seddonm1 that would be fantastic if you could donate it. do you have a preference for how you would like to do that? if you want to contribute directly i could work on setting up CI in the repo and testing what you've done on my side.

let me know how you would like to proceed and thanks so much for the donation!

seddonm1 · 2022-01-05T06:00:45Z

@houqp thanks have joined that org.
@matthewmturner I have updated my example with the tests I ran against the Minio docker container (https://docs.min.io/docs/minio-docker-quickstart-guide) (so ignore the passwords as they are just the suggested values). I suggest something similar for testing in CI.

I can do a PR tomorrow AU time.

gopik · 2022-01-05T06:06:35Z

Thanks @seddonm1. minio provides s3 api compatibility for azure blob storage. I'll try running the tests today with azure blob storage.

matthewmturner · 2022-01-05T06:57:15Z

@seddonm1 ok i will look into that. also once im comfortable with the usage i can work on the documentation.

from a code perspective, is there anything in particular that we should be aware of? i.e. any issues with aws rust sdk?

seddonm1 · 2022-01-05T22:26:41Z

for those waiting for this functionality a PR has been raised: datafusion-contrib/datafusion-objectstore-s3#2

If you want to help test you can follow the instructions in the PR and it should work. The abstraction is good @rdettai and will be even better once the async fn chunk_reader can be used 👍

nitisht · 2022-02-04T10:13:56Z

Folks, thank you for great work on this issue. I see that repo https://github.com/datafusion-contrib/datafusion-objectstore-s3 has the relevant code added. I am a little hazy on how do I get these two projects (datafusion & datafusion-objectstore-s3) compiled and work together. Are there any pointers towards this?

matthewmturner · 2022-02-04T15:32:00Z

@nitisht the README on https://github.com/datafusion-contrib/datafusion-objectstore-s3 has some examples how to use it and you could also look at our tests to see all functionality that we have tested (https://github.com/datafusion-contrib/datafusion-objectstore-s3/blob/main/src/object_store/aws.rs).

If there is something that is not clear - or a functionality you need that we havent yet added could you please create an issue in that repo?

matthewmturner · 2022-02-04T15:33:51Z

@alamb @houqp im thinking we can close this issue now.

nitisht · 2022-02-05T05:34:48Z

you could also look at our tests to see all functionality

Thank you for this pointer @matthewmturner . This is precisely what I was looking for, everything working great in my local testing.

ehenry2 added the enhancement New feature or request label Aug 19, 2021

Igosuki mentioned this issue Aug 22, 2021

Read from remote file systems #925

Closed

yjshen mentioned this issue Sep 28, 2021

Add support of HDFS as remote object store #1060

Open

rdettai mentioned this issue Oct 7, 2021

Expose a static object store registry #1072

Closed

houqp mentioned this issue Oct 23, 2021

Datafusion integration assumes table's data files are local delta-io/delta-rs#43

Closed

yahoNanJing mentioned this issue Nov 1, 2021

Add support of HDFS as remote object store #1062

Closed

yjshen mentioned this issue Nov 6, 2021

Add support of HDFS as remote object store #1223

Closed

houqp closed this as completed Feb 5, 2022

tustvold mentioned this issue Apr 12, 2022

[datafusion-contrib] AWS Glue Integration #2206

Open

S3 Support #907

S3 Support #907

Comments

ehenry2 commented Aug 19, 2021

Dandandan commented Aug 19, 2021

jimexist commented Aug 19, 2021

ehenry2 commented Aug 19, 2021

alamb commented Aug 19, 2021

houqp commented Aug 20, 2021

yjshen commented Aug 20, 2021

yjshen commented Aug 20, 2021

andygrove commented Aug 20, 2021

ehenry2 commented Sep 14, 2021

alamb commented Sep 14, 2021

ehenry2 commented Sep 14, 2021

houqp commented Sep 15, 2021

alamb commented Sep 15, 2021

rdettai commented Sep 24, 2021

matthewmturner commented Dec 24, 2021

houqp commented Dec 24, 2021

houqp commented Dec 24, 2021

matthewmturner commented Dec 24, 2021

matthewmturner commented Dec 25, 2021 • edited Loading

houqp commented Dec 25, 2021 • edited Loading

matthewmturner commented Dec 26, 2021

houqp commented Dec 26, 2021

matthewmturner commented Dec 26, 2021

houqp commented Dec 26, 2021

alamb commented Dec 26, 2021

matthewmturner commented Dec 26, 2021

matthewmturner commented Jan 4, 2022

alamb commented Jan 4, 2022

seddonm1 commented Jan 5, 2022

houqp commented Jan 5, 2022

matthewmturner commented Jan 5, 2022

seddonm1 commented Jan 5, 2022

gopik commented Jan 5, 2022

matthewmturner commented Jan 5, 2022

seddonm1 commented Jan 5, 2022

nitisht commented Feb 4, 2022

matthewmturner commented Feb 4, 2022

matthewmturner commented Feb 4, 2022

nitisht commented Feb 5, 2022

matthewmturner commented Dec 25, 2021 •

edited

Loading

houqp commented Dec 25, 2021 •

edited

Loading