Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need documentation on S3 use #26

Open
codebynumbers opened this issue Nov 23, 2016 · 4 comments
Open

Need documentation on S3 use #26

codebynumbers opened this issue Nov 23, 2016 · 4 comments

Comments

@codebynumbers
Copy link
Contributor

It's not clear that TinyS3 is needed, also not obvious without digging through code on how to set the AWS keys. Would also be nice if it supported profiles like boto, but that seems to be a limitation of TinyS3

@wdm0006
Copy link
Owner

wdm0006 commented Nov 23, 2016

Thanks for the input, for sure you are right about the documentation. As for boto vs tinys3, what extra would that allow that still matches how you would interact with spark? I've not used profiles in either spark or boto so I'm not sure what that would look like.

@codebynumbers
Copy link
Contributor Author

The nice thing boto allows you to do, is to not have to specify your credentials in code at all. It reads them from files on disk (~/.aws/credentials). You can also set up multiple credentials in a single file in profile sections, so you can specify the profile name to use in the code. This is helpful if you have multiple accounts/iam roles and need to quickly switch between them.

When we usually use spark, we configure it with IAM roles that control which S3 files it has access to, ie we are not embedding credentials in config files. I think the biggest hurdle was the documentation, more than the profiles though.

@wdm0006
Copy link
Owner

wdm0006 commented Nov 23, 2016

Ah I see what you mean, that does make sense. First let's tackle the documentation issue though, I'd like to just basically copy the pyspark docs for the implemented methods, because the idea is to work the same way. Separately some examples for things like pulling files from s3, or accessing rdd data directly for debugging would be helpful I think. Do you think that would have been enough in your situation?

@codebynumbers
Copy link
Contributor Author

Yeah that would have been perfect.

On Nov 23, 2016 11:57 AM, "Will McGinnis" [email protected] wrote:

Ah I see what you mean, that does make sense. First let's tackle the
documentation issue though, I'd like to just basically copy the pyspark
docs for the implemented methods, because the idea is to work the same way.
Separately some examples for things like pulling files from s3, or
accessing rdd data directly for debugging would be helpful I think. Do you
think that would have been enough in your situation?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#26 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4IMBWl6wtsRJjSMf0u2xExd3qiIhnfks5rBHCPgaJpZM4K6wyX
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants