Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 support? #46

Closed
xiuliren opened this issue Sep 30, 2022 · 7 comments
Closed

s3 support? #46

xiuliren opened this issue Sep 30, 2022 · 7 comments

Comments

@xiuliren
Copy link

It seems that tensorstore does not support AWS S3, but only for GCS. Is there any plan to support it?

@jbms
Copy link
Collaborator

jbms commented Sep 30, 2022

Currently, read-only access to S3 is possible through the http kvstore driver if no authentication is required.

A PR to add S3 support would certainly be welcome. The actual operations should be quite straightforward to implement, similar to the gcs driver. The most complicated part would be handling authentication.

@xiuliren
Copy link
Author

thank you!

the http kvstore driver is good enough for me.

@tchaton
Copy link

tchaton commented Mar 23, 2023

Any updates ?

@sjperkins
Copy link
Contributor

sjperkins commented Mar 28, 2023

A PR to add S3 support would certainly be welcome. The actual operations should be quite straightforward to implement, similar to the gcs driver. The most complicated part would be handling authentication.

@jbms I had a look through the GCS driver. In terms of similarity, were you thinking of using the S3 Rest API?

Requests to Amazon S3 can be authenticated or anonymous. Authenticated access requires credentials that AWS can use to authenticate your requests. When making REST API calls directly from your code, you create a signature using valid credentials and include the signature in your request. For information about various authentication methods and signature calculations, see Authenticating Requests (AWS Signature Version 4).

Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead:

The AWS docs suggest using the AWS SDK's, perhaps using something like the TransferManager. In that case, a lot of the aws-sdk-cpp code could do much of the heavy lifting instead of using the infrastructure built for the GCS driver.

Use the AWS SDKs to send your requests (see Sample Code and Libraries). With this option, you don't need to write code to calculate a signature for request authentication because the SDK clients authenticate your requests by using access keys that you provide. Unless you have a good reason not to, you should always use the AWS SDKs.

Would it be more desirable to use the REST API for wider compatibility as the S3 is protocol implemented by other software, minio and ceph, for instance.

@jbms
Copy link
Collaborator

jbms commented Mar 28, 2023

Yes, I was thinking of using the REST API directly. Ultimately the computation of the aws signature only requires a small amount of code, and this would allow tensorstore to have full control over the API usage, and ensure all operations are implemented fully asynchronously. However, using the AWS C++ SDK for S3 (which I believe internally accesses the REST API via libcurl) could also be an option. It would probably require less "development" effort, but may require more effort to integrate the AWS SDK into our build system. I think that the AWS SDK can also be used with third-party implementations like minio and ceph. It appears that TransferManager is a higher-level API on top of the low-level S3 API provided by the AWS SDK; I think it would probably be better to use the lower-level API rather than TransferManager.

@jbms jbms reopened this Mar 28, 2023
@jbms
Copy link
Collaborator

jbms commented Mar 28, 2023

If you are interested in adding S3 support, it would certainly be a much-appreciated contribution and we'd be happy to provide guidance on the tensorstore codebase.

@sjperkins
Copy link
Contributor

Yes, I was thinking of using the REST API directly. Ultimately the computation of the aws signature only requires a small amount of code, and this would allow tensorstore to have full control over the API usage, and ensure all operations are implemented fully asynchronously. However, using the AWS C++ SDK for S3 (which I believe internally accesses the REST API via libcurl) could also be an option. It would probably require less "development" effort, but may require more effort to integrate the AWS SDK into our build system. I think that the AWS SDK can also be used with third-party implementations like minio and ceph. It appears that TransferManager is a higher-level API on top of the low-level S3 API provided by the AWS SDK; I think it would probably be better to use the lower-level API rather than TransferManager.

I made a PUT REST API call example where the AWS signature is calculated in Python -- easier to debug and to get an idea of what needs doing. I feel comfortable with the basics involved and it seems doable with the base tensorstore API.

Regarding the AWS C++ SDK, grafting the tensorflow/io bazel files onto the tensorstore build system seems to work, but hopefully this won't be necessary.

If you are interested in adding S3 support, it would certainly be a much-appreciated contribution and we'd be happy to provide guidance on the tensorstore codebase.

Yes, I'd like to have a go at this. I'll take another look at the C++ codebase and write up a basic implementation plan and some questions about it. For e.g.: HMAC-SHA256 support,, INI file parsing support for AWS credentials+config

@sjperkins sjperkins mentioned this issue Mar 31, 2023
23 tasks
copybara-service bot pushed a commit that referenced this issue Sep 1, 2023
Includes changes for includes, style, and some fixes from the following pull request:

#91

This is still a work in progress, so it is not yet available by default.
To use, add a bazel dependency on //tensorstore/kvstore/s3

#46

PiperOrigin-RevId: 562068999
Change-Id: Id660a72224ef865c858484c04985f9fc4f0e3bf5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants