Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Airflow 4923] Fix Databricks hook leaks API secret in logs #5635

Merged
merged 3 commits into from
Jul 24, 2019

Conversation

neil90
Copy link
Contributor

@neil90 neil90 commented Jul 22, 2019

When users store token in the extra field for a Databricks connection the DatabricksHook leaks the token to the airflow logs. Adding ability(and updating docs) for the hook to get the host from the extra json will not cause the Basehook.get_connection to send the extra json to the airflow logs since 'host' field will be empty.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

https://issues.apache.org/jira/browse/AIRFLOW-4923

Description

  • Here are some details about my PR, including screenshots of any UI changes:

When users store token in the extra field for a Databricks connection the DatabricksHook leaks the token to the airflow logs. Adding ability(and updating docs) for the hook to get the host from the extra json will not cause the Basehook.get_connection to send the extra json to the airflow logs since 'host' field will be empty.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Updated testcase to look for host in extra json
    tests.contrib.hooks.test_databricks_hook:DatabricksHookTokenTest.test_submit_run

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

Copy link

@zenyui zenyui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Two questions:

  1. do you know why BaseHook.get_connection has this behavior of logging connections ONLY when there is a host set? Is this going to be true long term?
  2. would you guys consider having a third option like setting username to token and using the password field? I believe this JSON filed is always visible in plaintext in the airflow UI. (Can be a future PR, just a thought).

@neil90
Copy link
Contributor Author

neil90 commented Jul 22, 2019

Hi @zenyui

  1. the BaseHook.get_connection once it gets the connection info checks to see if the host field exists and if it does it logs the extra field - Link

  2. Currently users can do something similar which is put their login name into the username field and the token into password field.

Basically it is now similar to aws_hook functionality

@dimberman dimberman merged commit db770cf into apache:master Jul 24, 2019
serkef pushed a commit to serkef/airflow that referenced this pull request Jul 24, 2019
)

* Update databricks operator

* Updated token auth to get from extra_dejson

* Update test DatabricksHookTokenTest to use get host from 'extra'
wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019
)

* Update databricks operator

* Updated token auth to get from extra_dejson

* Update test DatabricksHookTokenTest to use get host from 'extra'
ashb pushed a commit to ashb/airflow that referenced this pull request Oct 16, 2019
)

* Update databricks operator

* Updated token auth to get from extra_dejson

* Update test DatabricksHookTokenTest to use get host from 'extra'

(cherry picked from commit db770cf)
@neil90
Copy link
Contributor Author

neil90 commented Aug 19, 2020

Hi @zenyui

  1. the BaseHook.get_connection once it gets the connection info checks to see if the host field exists and if it does it logs the extra field - Link
  2. Currently users can do something similar which is put their login name into the username field and the token into password field.

Basically it is now similar to aws_hook functionality

If somebody stumbles upon this PR, Ignore point 2, it is incorrect. Putting Login and Password causes it to Authenticate via Basic Auth. There can be a future PR to add this ability via adding an some identifier in the extra json to let us know we are using the token in the password field and authenticate via token rather then Basic Auth which keeps the token not in plain text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants