[AIRFLOW-6260] Drive _cmd config options by env var #6801

NBardelot · 2019-12-12T14:39:34Z

This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.

ashb · 2019-12-14T16:59:44Z

This looks like it's causing failures in the rest of the config tests -- can you check on what's going on. You'll likely have to run them locally as the test output doesn't include the full error for some reason.

NBardelot · 2019-12-16T09:23:34Z

I wanted to reuse "testsection" as a config section but it's checked in ways that are incompatible with my tests (I want to have a non-env option in testsection while another test wants it having only env options). Thus I'm putting my tests in a new testcmdenv section to avoid issues.

OmerJog · 2019-12-16T15:31:31Z

@NBardelot this is a code change you will need to create a Jira ticket for it (So it won't be [AIRFLOW-XXX]

This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration.

codecov-io · 2019-12-17T12:00:15Z

Codecov Report

❗ No coverage uploaded for pull request base (master@fd124a5). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #6801   +/-   ##
=========================================
  Coverage          ?   84.29%           
=========================================
  Files             ?      676           
  Lines             ?    38362           
  Branches          ?        0           
=========================================
  Hits              ?    32338           
  Misses            ?     6024           
  Partials          ?        0

Impacted Files	Coverage Δ
airflow/configuration.py	`93.21% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd124a5...521d0c2. Read the comment docs.

potiuk · 2019-12-17T13:04:25Z

I have some doubts about that one.

While I understand where it comes from, it has a number of drawbacks:

the scripts should work the same way from all the entities that can run it (webserver, scheduler, workers) - with env variables you just need to make sure that they are set the same way, here you also likely have to set some authentication mechanism on all those entities or embed the credentials to read the credentials in the command itself.
it's security by obscurity. Once you have access to read the variables, and you can start/stop airflow or run airflow CLI you should be able to execute those commands yourself and get the credentials anyway.
You open up a possibility to run some dangerous operations - it's enough to set a variable (which user might not see) and get airflow to execute arbitrary command when the command is started. Of course if you can override someone's variables, you can probably do a lot more, but nevertheless it looks dangerous

I wonder what other committers think about it ?

NBardelot · 2019-12-17T14:02:18Z

@potiuk

I think all your points come to evaluating the core concept of sensitive data management in containers. IMHO that's not really a debate (or it would imply having it with the Kubernetes project itself). It is a best practice that sensitive data should only be stored in secrets. Secrets and configmaps are managed alongside with the application's deployment: you would provide the same configmaps and secrets to all your Airflow components, the same way you would provide them with the same environment variables or airflow.cfg file.

With the current Airflow design the only clean way to convey a configuration that includes sensitive data (user/password for the broker, the database etc.) to Airflow containers in Kubernetes is to provide the whole airflow.cfg file as a secret. That comes with huge drawbacks:

all non-sensitive data cannot be read/analyzed readily anymore
you cannot reuse secrets (if the database password is present in two fields, that's two locations where you need to manage it... instead of one secret)
you need a tool to generate the airflow.cfg from a template, since you do not want to store the version with sensitive data anywhere in clear-text

It is far better to manage your database password in only one secret, and the script that computes the connection string as a configmap that does not contain the secret in clear-text.

potiuk · 2019-12-17T15:07:28Z

@NBardelot : I agree that in the container world that is how you keep secrets. And I understand your need.

But we seem to try to add a new generic feature that we might have to support in all future versions of Airflow. If people start using it - we should be careful because that feature can be used in totally unexpected ways. I can imagine (for example) someone implementing a round-robin script that assigns different sql_alchemy_conn to different instances in order to distribute traffic across several databases (been there, done that). While this is pretty powerful, it's also super dangerous for maintenance and support of a product. I am pretty sceptical when in a product we say "and here we can run any script with arbitrary complexity to achieve whatever user wants". That does not seem right because people will abuse it and product will have to support it indefinitely.

I don't like the "over-generic" approach here - seems we are killing a fly with a cannon gun (as we say in Polish). But again - this is more to start discussion than to completely abandon that approach. Maybe I am wrong and others have other opinion. I would love to hear what others think (@ashb , @kaxil, @feluelle, @Fokko, @dimberman., @mik-laj ?). Maybe that qualifies for devlist discussion and should be moved there?

I totally agree that we should never keep many secrets in the same config file together with other settings, but my question is - why do you want to keep secrets in the .cfg file (or in your case in ENV variable) at all? Are there settings that you want to be kept secret that you cannot set otherwise? Which ones?

Currently the default way for airflow to keep the connection secrets is to store them in connection database (which for the security reason can be encrypted with symmetric Fernet Key and the secrets are not visible via the UI) - not to keep them in .cfg file nor in environment variables (the environment variables are just a convenience to allow mostly testing and debugging). This is also security-by-obscurity in a way (but protects from revealing the password in case someone gets just access to the DB). Maybe we should have similar approach for anything that is secret and comes from the .cfg file currently instead? But maybe we could use some of the "vault" apis instead- whether via secrets in Kubernetes, or KMS, ApacheVault etc - we had discussion about it in the devlist I recall. Maybe that's a better approach?

NBardelot · 2019-12-17T15:48:38Z

@potiuk

Please consider that this patch doesn't really change fundamentally what already exists in Airflow. You can already achieve the same behaviour by providing an airflow.cfg file that has a _cmd option calling a script. You could even write the full script as a one-liner in the _cmd option. This patch only makes the process smoother and logical from the "you can override options with env vars" point of view.

Plus, think of it as a way to enable convention over configuration for sensitive data in the Helm chart. One could modify the existing Helm templates and values to standardize the mount path of secrets in Airflow containers, instead of setting multiple environment variables with clear-text sensitive data in the Helm values (see the airflow.mapenvsecrets Helm value in the stable Helm Airflow chart ).

For the moment the Helm values must be secured in some way because of those sensitive values. In a Kubernetes paradigm I think there should not even be a need for this: secrets are already there for this purpose, with the platform being responsible to choose how they are stored - in a vault or whatever - because it is not really the concern of the person deploying Airflow.

seems we are killing a fly with a cannon gun (as we say in Polish)

Never heard that one in Polish (half-Pole here ^^), but it translates as well in French :)

potiuk · 2019-12-17T15:55:57Z

Geeee. Bad me .. I forgot entirely the _cmd option in Airflow. So it seems we are already past the dangerous line when user can not only write arbitrary python code in DAGs but also arbitrary bash script in configuration :).

Still what I see is that we allow that only for those "secret-related" cases. From the docs:

The following config options support this _cmd version:

sql_alchemy_conn in [core] section

fernet_key in [core] section

broker_url in [celery] section

result_backend in [celery] section

password in [atlas] section

smtp_password in [smtp] section

bind_password in [ldap] section

git_password in [kubernetes] section

So if we allow this ENV variable thing here it should also be limited to those variables IMHO.

Never heard that one in Polish (half-Pole here ^^), but it translates as well in French :)

What is the French version then :) ?

NBardelot · 2019-12-17T16:02:46Z

@potiuk

Yep, this list of commands is the reason why after @ashb asked me to add a test I wrote one to ensure that non-cmd commands would not be executed and only cmd ones.

In French we have: « écraser une mouche avec un marteau » (”to smash a fly with a hammer“) ;-)
But there are a lot of variations with bulldozers, bazooka, etc. : https://fr.wiktionary.org/wiki/%C3%A9craser_une_mouche_avec_un_marteau

potiuk · 2019-12-17T16:29:06Z

Ah cool. indeed. then I happily approve this one.

merci beaucoup

This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration. (cherry picked from commit b3e8470)

This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration.

NBardelot force-pushed the master branch 2 times, most recently from 400abac to 0c80319 Compare December 13, 2019 17:47

NBardelot force-pushed the master branch 3 times, most recently from 0e08b76 to 6381361 Compare December 16, 2019 12:51

NBardelot force-pushed the master branch from 6381361 to 521d0c2 Compare December 16, 2019 16:00

NBardelot changed the title ~~[AIRFLOW-XXX] Drive _cmd config options by env var~~ [AIRFLOW-6260] Drive _cmd config options by env var Dec 16, 2019

potiuk approved these changes Dec 17, 2019

View reviewed changes

potiuk merged commit b3e8470 into apache:master Dec 17, 2019

NBardelot mentioned this pull request Feb 14, 2020

FIX since Airflow 1.10.8 AIRFLOW__ variables are eligible to _CMD behaviour puckel/docker-airflow#503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-6260] Drive _cmd config options by env var #6801

[AIRFLOW-6260] Drive _cmd config options by env var #6801

NBardelot commented Dec 12, 2019

ashb commented Dec 14, 2019

NBardelot commented Dec 16, 2019 •

edited

Loading

OmerJog commented Dec 16, 2019

codecov-io commented Dec 17, 2019

potiuk commented Dec 17, 2019

NBardelot commented Dec 17, 2019 •

edited

Loading

potiuk commented Dec 17, 2019 •

edited

Loading

NBardelot commented Dec 17, 2019 •

edited

Loading

potiuk commented Dec 17, 2019 •

edited

Loading

NBardelot commented Dec 17, 2019 •

edited

Loading

potiuk commented Dec 17, 2019

[AIRFLOW-6260] Drive _cmd config options by env var #6801

[AIRFLOW-6260] Drive _cmd config options by env var #6801

Conversation

NBardelot commented Dec 12, 2019

ashb commented Dec 14, 2019

NBardelot commented Dec 16, 2019 • edited Loading

OmerJog commented Dec 16, 2019

codecov-io commented Dec 17, 2019

Codecov Report

potiuk commented Dec 17, 2019

NBardelot commented Dec 17, 2019 • edited Loading

potiuk commented Dec 17, 2019 • edited Loading

NBardelot commented Dec 17, 2019 • edited Loading

potiuk commented Dec 17, 2019 • edited Loading

NBardelot commented Dec 17, 2019 • edited Loading

potiuk commented Dec 17, 2019

NBardelot commented Dec 16, 2019 •

edited

Loading

NBardelot commented Dec 17, 2019 •

edited

Loading

potiuk commented Dec 17, 2019 •

edited

Loading

NBardelot commented Dec 17, 2019 •

edited

Loading

potiuk commented Dec 17, 2019 •

edited

Loading

NBardelot commented Dec 17, 2019 •

edited

Loading