Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask passwords and sensitive info in task logs and UI #15599

Merged
merged 5 commits into from
May 5, 2021

Conversation

ashb
Copy link
Member

@ashb ashb commented Apr 29, 2021

Mask secret values from connections and variables

This masks secret values in logs for Connections and Variables.

It behaves as follows:

  • Connection passwords are always masked, where-ever they appear.

    This means, if a connection has a password of a, then every a in
    log messages would get replaced with ***

  • "Sensitive" keys from extra_dejson are also masked. Sensitive is
    defined by the "existing" mechanism that the UI used, based upon the
    name of the key.

  • "Sensitive" Variables are also masked.

Still to do:

  • Add docs for this
  • Unittest automatic masking from Connection properties
  • Unitest automatic masking from Variables.
  • Redact Rendered TI fields

Example

Given this dag:

with DAG('my_dag', start_date=datetime(2021,1,1)) as dag:
    @task
    def foo():
        conn = BaseHook.get_connection('foo_bar')
        print(conn.password)
        logging.info("Password is %s", conn.password)
        logging.info("Variable %s is %s", "api_key", Variable.get("api_key"))
        logging.info("flibble is masked?")

    BashOperator(task_id='bash', bash_command=f'echo a  bar "$foo"',
                 env={'foo': '{{ var.value.api_key }}'})

We get the following lines in the log for the python operator:

[2021-04-29 22:27:34,180] {base.py:78} INFO - Using connection to: id: foo_***. Host: example.com, Port: None, Schema: , Login: foo, Password: ***, extra: {'foo': '***', 'quxx': 'bazz'}
[2021-04-29 22:27:34,181] {logging_mixin.py:104} INFO - ***
[2021-04-29 22:27:34,181] {foo.py:18} INFO - Password is ***
[2021-04-29 22:27:34,182] {foo.py:19} INFO - Variable api_key is ***
[2021-04-29 22:27:34,183] {foo.py:20} INFO - *** is masked?

and for the Bash operator

[2021-04-30 13:36:27,085] {subprocess.py:63} INFO - Running command: ['bash', '-c', 'echo a bar "$foo"']
[2021-04-30 13:36:27,089] {subprocess.py:75} INFO - Output:
[2021-04-30 13:36:27,090] {subprocess.py:79} INFO - a bar ***

And in the UI it looks like this

image

Closes #8421

@boring-cyborg boring-cyborg bot added area:CLI area:core-operators Operators, Sensors and hooks within Core Airflow area:logging area:webserver Webserver related Issues labels Apr 29, 2021
@ashb ashb added this to the Airflow 2.1 milestone Apr 29, 2021
@ashb ashb removed area:CLI area:core-operators Operators, Sensors and hooks within Core Airflow labels Apr 29, 2021
@kaxil kaxil mentioned this pull request Apr 29, 2021
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


@task
def my_func():
from airflow.utils.log.secrets_masker import mask_secret
Copy link
Member Author

@ashb ashb Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sold that this is the right place for the public API to live.

I'm half tempted to make this a lazy import in airflow/__init__.py so this becomes

Suggested change
from airflow.utils.log.secrets_masker import mask_secret
from airflow import mask_secret

Anyone have any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t feel like a top level import to me. Maybe the import path should be shorter, but probably not that short.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think of any ideas for what a shorter import path might look like without being top level 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be airflow.utils.mask_secret

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or airflow.log.mask_secret? (Is it generally frawned upon to add a new module under airflow?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, though as soon as we do that I'm going to want to move a whole load of modules out of airflow.utils 😁 (probably no bad thing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not going to complain if you do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell you what -- I'll leave this here for this PR, then follow up with a refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming: airflow.log vs airflow.logs vs airflow.logging

Any preference @uranusjr ?

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@uranusjr
Copy link
Member

Would this be somehow masking too much and could provide an attack vector by changing another field containing the password string? For example, if I have username user_foo and password foo, the log would log username as user_***, and by comparing it to publicly displayed fields in (say) the web UI I can obtain the password string.

@ashb
Copy link
Member Author

ashb commented Apr 30, 2021

@uranusjr Possibly, but I don't think we need to defend against that level of attack -- it would only show up in the UI like that if it matches a connection the task has accessed.

So yes, there's a theoretical attack surface here, but with the planned work of per-Connection ACLs etc, I think that is mitigated. Plus it only lets you validate the password if you already know it, but if you can change the DAG code, there are easier ways of exfil-ing the credentials.

@ashb ashb marked this pull request as ready for review April 30, 2021 16:04
@ashb ashb changed the title Mask passwords Mask passwords and sensitive info in task logs and UI May 4, 2021
@ashb ashb requested review from potiuk and ephraimbuddy May 4, 2021 08:52
ashb added 3 commits May 4, 2021 10:24
This isn't used anywhere yet, but it is the first step in not printing
passwords in the logs
This masks secret values in logs for Connections and Variables.

It behaves as follows:

- Connection passwords are always masked, where-ever they appear.

  This means, if a connection has a password of `a`, then _every_ `a` in
  log messages would get replaced with `***`

- "Sensitive" keys from extra_dejson are also masked. Sensitive is
  defined by the "existing" mechanism that the UI used, based upon the
  name of the key.

- "Sensitive" Variables are also masked.
airflow/utils/log/secrets_masker.py Outdated Show resolved Hide resolved
airflow/utils/log/secrets_masker.py Outdated Show resolved Hide resolved
airflow/www/views.py Outdated Show resolved Hide resolved

@task
def my_func():
from airflow.utils.log.secrets_masker import mask_secret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t feel like a top level import to me. Maybe the import path should be shorter, but probably not that short.

task.

It will also mask the value of a Variable, or the field of a Connection's extra JSON blob if the name contains
any words in ('password', 'secret', 'passwd', 'authorization', 'api_key', 'apikey', 'access_token'). This list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can wr do a "include"/"exampleinclude" of it from the code so that we don't have to maintain this list at two places if in future we add another word here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved from elsewhere, so I'll leave it as it is for now

kaxil pushed a commit to astronomer/airflow that referenced this pull request May 11, 2021
This masks sensitive values in logs for Connections and Variables.

It behaves as follows:

- Connection passwords are always masked, where-ever they appear.

  This means, if a connection has a password of `a`, then _every_ `a` in
  log messages would get replaced with `***`

- "Sensitive" keys from extra_dejson are also masked. Sensitive is
  defined by the "existing" mechanism that the UI used, based upon the
  name of the key.

- "Sensitive" Variables are also masked.

(cherry picked from commit d295e70)
(cherry picked from commit 76e3cc20606c05418d8824ec6b47629048c021cf)
kaxil pushed a commit to astronomer/airflow that referenced this pull request May 12, 2021
This masks sensitive values in logs for Connections and Variables.

It behaves as follows:

- Connection passwords are always masked, where-ever they appear.

  This means, if a connection has a password of `a`, then _every_ `a` in
  log messages would get replaced with `***`

- "Sensitive" keys from extra_dejson are also masked. Sensitive is
  defined by the "existing" mechanism that the UI used, based upon the
  name of the key.

- "Sensitive" Variables are also masked.

(cherry picked from commit d295e70)
@thierryturpin
Copy link

Maybe just have a quick check on this, because on 2.1.0 with the setting hide_sensitive_var_conn_fields set to True, I still have in clear the information of the connection extra. thx.

@ashb
Copy link
Member Author

ashb commented Jun 11, 2021

Maybe just have a quick check on this, because on 2.1.0 with the setting hide_sensitive_var_conn_fields set to True, I still have in clear the information of the connection extra. thx.

Still have it where?

@raajpackt
Copy link

hide_sensitive_var_conn_fields is set to True. But i am facing an issue where my connection password is being logged in Airflow. Is anyone else facing this issue?

I assume the above field is for the extra json. The documentation mentions connection passwords to be masked by default which is not in my case. using Airflow 2.1.

@ashb
Copy link
Member Author

ashb commented Jun 23, 2021

Which log file?

@thierryturpin
Copy link

for me it's in the task logs, example:
2021-06-22 17:30:06,353] {base.py:69} INFO - Using connection to: id: XYZ. Host: https://XYZ:8200, Port: None, Schema: , Login: , Password: None, extra: {'vault_role': 'RL123'}

@raajpackt
Copy link

raajpackt commented Jun 23, 2021

Airflow DAG Task log file.
[2021-06-23 14:03:11,403] {subprocess.py:79} INFO - [�[34m2021-06-23 14:03:11,403�[0m] {�[34mbase.py:�[0m78} INFO�[0m - Using connection to: id:

^ In this line it prints out the connection password as well.
https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/hooks/base.html is printing the logs when get_connection(cls, conn_id: str) function is called.

Quoting the source code which causes it to log in the DAG task log without masking conn.password

log.info(
                "Using connection to: id: %s. Host: %s, Port: %s, Schema: %s, Login: %s, Password: %s, "
                "extra: %s",
                conn.conn_id,
                conn.host,
                conn.port,
                conn.schema,
                conn.login,
                conn.password,
                conn.extra_dejson,
            )
    
    
    

@ashb
Copy link
Member Author

ashb commented Jun 24, 2021

for me it's in the task logs, example:
2021-06-22 17:30:06,353] {base.py:69} INFO - Using connection to: id: XYZ. Host: https://XYZ:8200, Port: None, Schema: , Login: , Password: None, extra: {'vault_role': 'RL123'}

@thierryturpin Nothing in this line looks sensitive -- what did you expect to be masked?

@raajpackt Can you give me a reproduction case that shows this behaviour please?

@thierryturpin
Copy link

sorry, my bad. my understanding was that instead of extra: {'vault_role': 'RL123'} I would have: extra: XXX. That everything of the extra would be masked in the logs.

@ashb
Copy link
Member Author

ashb commented Jun 24, 2021

@thierryturpin Ah, as part of this change it aims to now only mask actual sensitive values.

@thierryturpin
Copy link

@ashb ok, thx for clarification.

@andydennehy
Copy link

I'm having the same issue as @raajpackt, the passwords are showing in the logs when using get_connection().

@ashb
Copy link
Member Author

ashb commented Jun 28, 2021

I'm having the same issue as @raajpackt, the passwords are showing in the logs when using get_connection().

Need more info please.

Do you have any custom logging configured? Can you provide a minimal reproduction case please?

@cpdean
Copy link

cpdean commented Aug 3, 2021

I am trying to get my installation of airflow to mask passwords in the rendered template UI of a given task. It seems this pull request was merged and made available somewhere between versions 2.0.0 and 2.1.0

1 - ❯❯❯ git log 2.0.0..2.1.0 | rg 15599
    Mask passwords and sensitive info in task logs and UI (#15599)
commit 6019c78cb475800f58714a9dabb747b9415599c8

I have airflow 2.1.2, but I still see passwords being rendered and freely visible to anyone with access to airflow.

Is there something I have to do to enable masking passwords rendered in templates to the UI?

I'm using an Admin account locally to test.

Here is the variable I would like to be masked in templates, airflow correctly masks it in the variable UI

image

On the rendered template page of a given task, however, it freely shows the password to me:

image

I have airflow 2.1.2 installed, .release:2.1.2+d25854dd413aa68ea70fb1ade7fe01425f456192 per the screenshot from the UI footer:

image

What am I doing wrong here? How can I allow dockerized tasks to have access to the systems they need without broadcasting the credentials to everyone?

Thanks!

@ashb
Copy link
Member Author

ashb commented Aug 4, 2021

@cpdean Have you customized the Airflow logging at all?

@cpdean
Copy link

cpdean commented Aug 4, 2021

I don't think that I have, but I'm not sure. Here's the logging stanza from my airflow.cfg. It was generated by whatever airflow does by default when you first run it.

[logging]
# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = $AIRFLOW_HOME/logs

# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Set this to True if you want to enable remote logging.
remote_logging = False

# Users must supply an Airflow connection id that provides access to the storage
# location.
remote_log_conn_id =
remote_base_log_folder =
encrypt_s3_logs = False

# Logging level
logging_level = WARN

# Logging level for Flask-appbuilder UI
fab_logging_level = WARN

# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# Example: logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class =

# Flag to enable/disable Colored logs in Console
# Colour the logs when the controlling terminal is a TTY.
colored_console_log = True

# Log format for when Colored logs is enabled
colored_log_format = [%%(blue)s%%(asctime)s%%(reset)s] {%%(blue)s%%(filename)s:%%(reset)s%%(lineno)d} %%(log_color)s%%(levelname)s%%(reset)s - %%(log_color)s%%(message)s%%(reset)s
colored_formatter_class = airflow.utils.log.colored_log.CustomTTYColoredFormatter

# Format of Log line
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s

# Log filename format
log_filename_template = {{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log
log_processor_filename_template = {{ filename }}.log
dag_processor_manager_log_location = $AIRFLOW_HOME/logs/dag_processor_manager/dag_processor_manager.log

# Name of handler to read task instance logs.
# Default to use task handler.
task_log_reader = task

In case it's relevant, I also have this in my airflow.cfg

[admin]
# UI to hide sensitive variable fields when set to True
hide_sensitive_variable_fields = True

@ppatel8-wooliex
Copy link

you still need to set the below 2 parameters into airflow.cfg in order to hide the secret in logs and rendered template.

hide_sensitive_var_conn_fields = True
sensitive_var_conn_names = 'password', 'secret', 'passwd', 'authorization', 'api_key', 'apikey', 'access_token'

@Jujubug88
Copy link

Same issue as cpdean - I have the hide_sensitive_var_conn_fields as true with the sensitive_var_conn_names as you listed. I was under the impression those are from the json extras...not the native connections login/password.

It is the native connections password that is being exposed in the logs

@ppatel8-wooliex
Copy link

@Jujubug88 Alternatively, you can try using own masking as suggested here.
https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/index.html
The below is example with Vault but i think you can mask any param value or connection string you want to mask.

when you use vault to get the secret then before passing as params you can mask it. It will display as **** in log as well as rendered template.

you can use the below code to mask the secret from the Vault.

from airflow.utils.log.secrets_masker import mask_secret

openssl_service_account_key_read_response = client.secrets.kv.read_secret_version(path=openssl_service_account_secret_path,mount_point=vault_mount_point)
service_account_secret = openssl_service_account_key_read_response['data']['data'][openssl_service_account_key]
mask_secret(service_account_secret)

sample_commands=dedent('''
set +x;
pwd;
cd /home/airflow/gcs/data/bin/
chmod 775 -R .;
./bin/test.sh '{{ params.service_account_secret }}';
''')
task_1 = BashOperator(
task_id='',
bash_command=sample_commands,
params={
'service_account_secret' : service_account_secret
}
)

@cpdean
Copy link

cpdean commented Aug 13, 2021

you still need to set the below 2 parameters into airflow.cfg in order to hide the secret in logs and rendered template.

hide_sensitive_var_conn_fields = True
sensitive_var_conn_names = 'password', 'secret', 'passwd', 'authorization', 'api_key', 'apikey', 'access_token'

When I apply these config under the [core] stanza, inferred from the default config template as per the changes in this PR, I have identical behavior of passwords not getting masked in rendered templates on Airflow 2.1.2. These config options don't seem to have any effect on triggering the code @ashb added in this PR.

@dstandish
Copy link
Contributor

dstandish commented Aug 15, 2021

I am noticing in 2.1.2 that passwords are not masked when running in interactive console locally ... but in deployed environment they are 🤔

i.e. passwords from this log line in base hook:

            log.info(
                "Using connection to: id: %s. Host: %s, Port: %s, Schema: %s, Login: %s, Password: %s, "
                "extra: %s",
                conn.conn_id,
                conn.host,
                conn.port,
                conn.schema,
                conn.login,
                redact(conn.password),
                redact(conn.extra_dejson),
            )

@cpdean
Copy link

cpdean commented Aug 20, 2021

Spent the day playing around with this some more. I've been able to eventually get passwords to mask, but sometimes they leak through. When a task has been queued, just before it starts running or before it has succeeded, the password is visible. What's going on here?

Task queued:
image

Task finished:
image

@Poovarasan-dv
Copy link

@ashb ashb I am using the airflow version 2.2.4 in the AKS cluster. so, In our airflow logs, I can see the user name and passwords for connection, I already encrypt the passwords by using the fernet key, and I also changed the airflow conf file sensitive_var_conn_names = comma,separated,sensitive,names,password,secret,extra,passwd

but still i could see the passwords in airflow logs anyone can help how to mask the connections passwords in airflow logs.

@potiuk
Copy link
Member

potiuk commented Jun 12, 2022

I think (Hard to say) that your problem can be solved when #24362 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hide sensitive data in UI