Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-6260] Drive _cmd config options by env var #6801

Merged
merged 1 commit into from
Dec 17, 2019

Conversation

NBardelot
Copy link
Contributor

This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

@ashb
Copy link
Member

ashb commented Dec 14, 2019

This looks like it's causing failures in the rest of the config tests -- can you check on what's going on. You'll likely have to run them locally as the test output doesn't include the full error for some reason.

@NBardelot
Copy link
Contributor Author

NBardelot commented Dec 16, 2019

I wanted to reuse "testsection" as a config section but it's checked in ways that are incompatible with my tests (I want to have a non-env option in testsection while another test wants it having only env options). Thus I'm putting my tests in a new testcmdenv section to avoid issues.

@OmerJog
Copy link
Contributor

OmerJog commented Dec 16, 2019

@NBardelot this is a code change you will need to create a Jira ticket for it (So it won't be [AIRFLOW-XXX]

This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

Adding a unit test to check that an option that
should NOT be overriden by a command is correctly
only read from the configuration.
@NBardelot NBardelot changed the title [AIRFLOW-XXX] Drive _cmd config options by env var [AIRFLOW-6260] Drive _cmd config options by env var Dec 16, 2019
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@fd124a5). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6801   +/-   ##
=========================================
  Coverage          ?   84.29%           
=========================================
  Files             ?      676           
  Lines             ?    38362           
  Branches          ?        0           
=========================================
  Hits              ?    32338           
  Misses            ?     6024           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/configuration.py 93.21% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd124a5...521d0c2. Read the comment docs.

@potiuk
Copy link
Member

potiuk commented Dec 17, 2019

I have some doubts about that one.

While I understand where it comes from, it has a number of drawbacks:

  • the scripts should work the same way from all the entities that can run it (webserver, scheduler, workers) - with env variables you just need to make sure that they are set the same way, here you also likely have to set some authentication mechanism on all those entities or embed the credentials to read the credentials in the command itself.
  • it's security by obscurity. Once you have access to read the variables, and you can start/stop airflow or run airflow CLI you should be able to execute those commands yourself and get the credentials anyway.
  • You open up a possibility to run some dangerous operations - it's enough to set a variable (which user might not see) and get airflow to execute arbitrary command when the command is started. Of course if you can override someone's variables, you can probably do a lot more, but nevertheless it looks dangerous

I wonder what other committers think about it ?

@NBardelot
Copy link
Contributor Author

NBardelot commented Dec 17, 2019

@potiuk

I think all your points come to evaluating the core concept of sensitive data management in containers. IMHO that's not really a debate (or it would imply having it with the Kubernetes project itself). It is a best practice that sensitive data should only be stored in secrets. Secrets and configmaps are managed alongside with the application's deployment: you would provide the same configmaps and secrets to all your Airflow components, the same way you would provide them with the same environment variables or airflow.cfg file.

With the current Airflow design the only clean way to convey a configuration that includes sensitive data (user/password for the broker, the database etc.) to Airflow containers in Kubernetes is to provide the whole airflow.cfg file as a secret. That comes with huge drawbacks:

  • all non-sensitive data cannot be read/analyzed readily anymore
  • you cannot reuse secrets (if the database password is present in two fields, that's two locations where you need to manage it... instead of one secret)
  • you need a tool to generate the airflow.cfg from a template, since you do not want to store the version with sensitive data anywhere in clear-text

It is far better to manage your database password in only one secret, and the script that computes the connection string as a configmap that does not contain the secret in clear-text.

@potiuk
Copy link
Member

potiuk commented Dec 17, 2019

@NBardelot : I agree that in the container world that is how you keep secrets. And I understand your need.

But we seem to try to add a new generic feature that we might have to support in all future versions of Airflow. If people start using it - we should be careful because that feature can be used in totally unexpected ways. I can imagine (for example) someone implementing a round-robin script that assigns different sql_alchemy_conn to different instances in order to distribute traffic across several databases (been there, done that). While this is pretty powerful, it's also super dangerous for maintenance and support of a product. I am pretty sceptical when in a product we say "and here we can run any script with arbitrary complexity to achieve whatever user wants". That does not seem right because people will abuse it and product will have to support it indefinitely.

I don't like the "over-generic" approach here - seems we are killing a fly with a cannon gun (as we say in Polish). But again - this is more to start discussion than to completely abandon that approach. Maybe I am wrong and others have other opinion. I would love to hear what others think (@ashb , @kaxil, @feluelle, @Fokko, @dimberman., @mik-laj ?). Maybe that qualifies for devlist discussion and should be moved there?

I totally agree that we should never keep many secrets in the same config file together with other settings, but my question is - why do you want to keep secrets in the .cfg file (or in your case in ENV variable) at all? Are there settings that you want to be kept secret that you cannot set otherwise? Which ones?

Currently the default way for airflow to keep the connection secrets is to store them in connection database (which for the security reason can be encrypted with symmetric Fernet Key and the secrets are not visible via the UI) - not to keep them in .cfg file nor in environment variables (the environment variables are just a convenience to allow mostly testing and debugging). This is also security-by-obscurity in a way (but protects from revealing the password in case someone gets just access to the DB). Maybe we should have similar approach for anything that is secret and comes from the .cfg file currently instead? But maybe we could use some of the "vault" apis instead- whether via secrets in Kubernetes, or KMS, ApacheVault etc - we had discussion about it in the devlist I recall. Maybe that's a better approach?

@NBardelot
Copy link
Contributor Author

NBardelot commented Dec 17, 2019

@potiuk

Please consider that this patch doesn't really change fundamentally what already exists in Airflow. You can already achieve the same behaviour by providing an airflow.cfg file that has a _cmd option calling a script. You could even write the full script as a one-liner in the _cmd option. This patch only makes the process smoother and logical from the "you can override options with env vars" point of view.

Plus, think of it as a way to enable convention over configuration for sensitive data in the Helm chart. One could modify the existing Helm templates and values to standardize the mount path of secrets in Airflow containers, instead of setting multiple environment variables with clear-text sensitive data in the Helm values (see the airflow.mapenvsecrets Helm value in the stable Helm Airflow chart ).

For the moment the Helm values must be secured in some way because of those sensitive values. In a Kubernetes paradigm I think there should not even be a need for this: secrets are already there for this purpose, with the platform being responsible to choose how they are stored - in a vault or whatever - because it is not really the concern of the person deploying Airflow.

seems we are killing a fly with a cannon gun (as we say in Polish)

Never heard that one in Polish (half-Pole here ^^), but it translates as well in French :)

@potiuk
Copy link
Member

potiuk commented Dec 17, 2019

Geeee. Bad me .. I forgot entirely the _cmd option in Airflow. So it seems we are already past the dangerous line when user can not only write arbitrary python code in DAGs but also arbitrary bash script in configuration :).

Still what I see is that we allow that only for those "secret-related" cases. From the docs:

The following config options support this _cmd version:

  • sql_alchemy_conn in [core] section
  • fernet_key in [core] section
  • broker_url in [celery] section
  • result_backend in [celery] section
  • password in [atlas] section
  • smtp_password in [smtp] section
  • bind_password in [ldap] section
  • git_password in [kubernetes] section

So if we allow this ENV variable thing here it should also be limited to those variables IMHO.

Never heard that one in Polish (half-Pole here ^^), but it translates as well in French :)

What is the French version then :) ?

@NBardelot
Copy link
Contributor Author

NBardelot commented Dec 17, 2019

@potiuk

Yep, this list of commands is the reason why after @ashb asked me to add a test I wrote one to ensure that non-cmd commands would not be executed and only cmd ones.

In French we have: « écraser une mouche avec un marteau » (”to smash a fly with a hammer“) ;-)
But there are a lot of variations with bulldozers, bazooka, etc. : https://fr.wiktionary.org/wiki/%C3%A9craser_une_mouche_avec_un_marteau

@potiuk
Copy link
Member

potiuk commented Dec 17, 2019

Ah cool. indeed. then I happily approve this one.

merci beaucoup

@potiuk potiuk merged commit b3e8470 into apache:master Dec 17, 2019
kaxil pushed a commit that referenced this pull request Dec 18, 2019
This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

Adding a unit test to check that an option that
should NOT be overriden by a command is correctly
only read from the configuration.

(cherry picked from commit b3e8470)
ashb pushed a commit that referenced this pull request Dec 19, 2019
This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

Adding a unit test to check that an option that
should NOT be overriden by a command is correctly
only read from the configuration.

(cherry picked from commit b3e8470)
KKcorps pushed a commit to KKcorps/airflow that referenced this pull request Dec 21, 2019
This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

Adding a unit test to check that an option that
should NOT be overriden by a command is correctly
only read from the configuration.
galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

Adding a unit test to check that an option that
should NOT be overriden by a command is correctly
only read from the configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants