Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass aws credentials as to the executor #387

Open
Liorba opened this issue Jun 26, 2019 · 3 comments
Open

pass aws credentials as to the executor #387

Liorba opened this issue Jun 26, 2019 · 3 comments

Comments

@Liorba
Copy link

Liorba commented Jun 26, 2019

Hey I'm trying to use the papermill airflow operator to read and execute notebook with airflow.
in our company we use a central airflow instance and we could not rely on having aws credentials in ~/.aws/credentials nor we can use environment variables.
because our airflow instance is hosted in kubernetes we could not relay on ec2 roles either.
our most feasible option is to use airflow connections and explicitly pass aws credentials to papermill.
I assume different user's will use different credentials and thus will be a good addition to have.

Currently, S3Handler does not get argument so there is no way to pass those to to the boto3 instance it abstracts.
IMO it's a good addition to have

@MSeal
Copy link
Member

MSeal commented Jun 28, 2019

In general it's hard to make pass throughs to the handlers as they're initialized independently of the execution control plane. We could look at adding the ability to pass options from execute to the handlers, but I'd have to think about that some more. Today there's a few ways around this problem.

The first would be to map the credentials to ENV variables for the papermill execution. This doesn't appear to be possible with the airflow operator as-implemented. The other supported pattern is to implement your own s3 handler and register it over the base handler with your credentials being passed in however you deem fit. You can extend the existing S3 class in this process fairly easily. And the third path, which is not a recommended pattern as we might change the contract in the future, is to generate the session, client, and s3 objects and assign them ahead of time over https://github.com/nteract/papermill/blob/master/papermill/s3.py#L143 so the class caches the connection objects.

I'd also suggest making an airflow issue and see if we can make the operators able to configure iorw options a little more natively in the future.

@Gurulhu
Copy link

Gurulhu commented Jun 14, 2021

@MSeal wouldn't something like this work?
image

Since self.client will always be 's3' and self.s3 is fetched from params already, we can use an user-provided session instance, or follow the current flow if it's None.

Edit: Link to relevant code snippet

Edit2: Since the S3 methods are only called way deep in the stack, we could use either constants like papermill.AWS_KEY or env vars (like BOTO3_ENDPOINT_URL is already used) to set overrideable Session() kwargs.

@MSeal
Copy link
Member

MSeal commented Jun 16, 2021

@Gurulhu As your second edit notes, it's fairly deep in the stack, so the read call in the middle and the calls above don't pass arguments down. In theory if you opened all those levels to *args **kwargs you could sorta do it, but there's some kwarg contentions therein that might get confusing. The ENV presence variable is a cleaner way to tackle this, though a static class variable (like S3.AWS_KEY) could be similarly acceptable to work around the issue. The airflow operator makes some of these options not helpful unless that construct has changed since last year.

I'd be happy with someone adding paths here that folks see as helpful, but the arg passing through the whole stack is still awkward in the code imho if the static options aren't sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants