Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure HttpHook's auth_type from Connection #35591

Open
wants to merge 137 commits into
base: main
Choose a base branch
from

Conversation

Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Nov 12, 2023

Hello,

This PR makes possible to setup and parameterize HttpHook and HttpAsyncHook's auth_type from the Connection UI.

Concretely, this PR:

  • Add two extra settings in Http 'Extra' Connection:
    • The reserved auth_type field to define a auth class
    • The reserved auth_kwargs field to provide a dict of extra parameters to the auth_type class.
  • The auth_type is validated against a list of Auth classes, to protect against code injection.
  • The list of Auth classes can be customized:
    • Via the airflow_local_settings.
    • Via the new AIRFLOW__HTTP__EXTRA_AUTH_TYPES config
  • The changes are applied to both HttpHook and HttpAsyncHook, via a new common mixin class.
  • The UI of the Http Connection is refactored:
    • To make those Auth classes more discoverable, with a SelectField for auth_type
    • To have a convenient dedicated field for auth_kwargs

Side effect of the UI changes: The Extra field was until now used to pass params to the Headers (anything in the Extra was passed to the Headers). But now, auth_kwargs and auth_type are also being written over there, which I don't find very convenient. Furthermore, this PR add logic to exclude those keys from the Headers (IMO this start to be a bit of tech-debt). And finally, user cannot pass a header named like those keys (it is unlikely, but it could happen).

Thus, I propose to deprecate headers parameters passed directly in the 'Extra' field. And to pass them via a dedicated "Headers" field.

UI:
Screenshot from 2024-01-05 09-05-03

Screenshot from 2024-01-05 09-04-59

Side effect:
Screenshot from 2024-01-05 09-11-34

I also tried to add a CodeMirrorField (for Headers and Auth kwargs), and a CollapsibleField (to hide Extra), but it was a bit too much compared to the initial goal of this PR. Maybe in a future one.

Use-case:

The auth_type is typically a subclass of request.AuthBase. Many custom Auth classes exist for many different protocols. Sometimes, passing only two hard-coded conn.username and conn.password is not enough: The Auth class expects more than two arguments.

Examples:

Right now, to deal with those cases, they are three possibilities:

  • Using functools.partial, like mentioned in this PR, in the dag file / in the operator declaration.
    Opinion: The dag developer should not care about handling the connection. He just want a working connection_id to call an endpoint (especially if its a beginner / low-experienced dev). Furthermore, some parameters are sensitive and cannot be written in a dag.
  • Writing a custom Hook, which dispatch the parameters from the Connection correctly (eventually using partial).
    Opinion: This is not okay. Other hooks are doing better. Take the ODBCHook, which allows to parameterize every aspect of the connection without subclassing anything:
    • Defining which driver has to be used
    • Defining connection schema for SQLAlchemy
    • Parameterize the driver via extra driver-specific parameters
    • Parameterize the behavior of pyodbc ("connect_kwargs").
    Everything can be controlled in the Connection UI ! I'd expect the HttpHook to behave similarly, and let me configure everything, which includes the underlying authentication.
  • Misusing the "username" and "password" fields of a Connection, to stack and pass multiple parameters in it + implementing a thin layer on top of a Auth class to re-dispatch the parameters.
    Opinion: This is definitively a bad workaround. I'm mentioning it because this PR won't entirely solve the issue, and this may (continue to) happen.

Coming to this PR, I propose to add two reserved field: "auth_type" and "auth_kwargs", which are passed to the underlying Auth class. No breaking change. This solve most of the issues: a partial is not needed anymore, a subclass is not needed anymore, and there are less cases where conn.username and conn.password will be misused.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

This looks good in principle (small and simple, yet powerful), but it would require a more complete documentation and examples in order to be mergable.

@Joffreybvn
Copy link
Contributor Author

@potiuk What about a more aggressive PR, with most likely a breaking change ? (Sorry for the big chunk of text, here is the important part:)

This PR could copy further what the ODBCHook does, to:

  • Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.
  • Remove the two hard-coded conn.username and conn.password from class instantiation to replaces them by keywords arguments, eventually customizable in conn.extra_json too.
  • Implement a way to customize proxy, ssl verify, and all other extra requests parameters from the Connection UI / conn.extra_json.

Would that be okay ?

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

  • Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.

I'd say I am not so thrilled by the other option. and Certainly would not comment on it unless you show the code rather than explain in words what you really mean by changing it. I am not sure if you can pass in words what you want to do do without actually trying and implementing POC where you would show the code and we could assess how "breaking" it is. Airflow is used by 10s of thousands of enterprises and we cannot afford breaking changes that will make everyones workflows broken when migrating. We can break few peple workflows (this is inevitable) but not everyoene's

So before even attempting that, you should answer yourself a few questions. And decide if you want to go there at all.

  • Will that make everyone's connection stop working and require them to manually modify their connnections ? -> certainly NOT - probably not even in Airflow 3 (which we do not even plan yet).

  • Will there be change in any of the Public APIs that you can use to access connections ? -> absolutely notv in Airflow 2. This is forbidden by SemVer and we cannot change the API (incliuding Connection Python object that is retrieved by Hooks and operators) deliberately in Airflow 2.

  • Will there be any migration (automated or not) for the users? Will it handle all the cases?

  • what exactly do you want to achieve and Why you think is benefitial - i.e. is the chnge cost and potential problems justified by the benefit ?

  • How will it impact the UI and ways of defining connections via other mechanisms (env vars, secrets? )

....

I think most of thos questions are only worth looking in detail when there is at least Proof-Of-Concept where discussion can be done over the code rather than abstract concept of the change :) . Otherwise It will take too much time of those who review it to understand what you really want to do - having a code to look at is pretty much starting point of someone looking at proposing a change touching this part - part that is pretty much "core" of Airflow and part of Public Interface of Airflow: https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html

@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from c8574c1 to ba6d715 Compare November 23, 2023 06:19
@Joffreybvn Joffreybvn marked this pull request as draft November 23, 2023 21:30
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from 73a3cfe to 0daaa70 Compare November 23, 2023 23:26
@Joffreybvn Joffreybvn marked this pull request as ready for review November 24, 2023 00:20
@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Nov 24, 2023

Thanks for the detailed answer !

I won't go for proxy settings and extra parameters in this PR. Just for info, adding a tool like forwarder solve globally all proxy configuration issues. And disabling ssl "verify" can be done via the CURL_CA_BUNDLE trick instead of adding parameters to control that in the Connection.


For the rest, currently this PR:

  • Allows to configure the Auth class from the Connection
  • Allows to pass extra parameters to instantiate the Auth class from the Connection

There is no breaking change in Airflow. It may be a breaking change for user relying on the previous logic of the property (see below's code-review). But as the property was introduced recently, that won't break many users' worflows.


@auth_type.setter
def auth_type(self, v):
self._auth_type = v
Copy link
Contributor Author

@Joffreybvn Joffreybvn Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the auth_type property by a simple class attribute. In a previous PR (#29206 (review) - 8 months ago), it was introduced as a mechanism to detect if the user passed a custom auth class. I replaced that by the self._is_auth_type_setup attribute.

Why ? Maintaining this property means adding in it all the logic to load and return a auth_type potentially defined in the Connection. Which is useless regarding how this property is used in the rest of the codebase (here in livy, and in dbt hook).

@Joffreybvn Joffreybvn changed the title Pass parameters to HttpHook's auth_type from Connection Configure HttpHook's auth_type from Connection Nov 24, 2023
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from 14d4b30 to af69628 Compare November 25, 2023 15:19
@potiuk
Copy link
Member

potiuk commented Nov 29, 2023

i just realised there is likely one big problem here - security.

While we cannot prevent it completely for some kind of connections (this is why Connection Editing user should be highly priviledged, introducing RCE deliberately is another thing.

If I understand correctly, someone who edits connection can decide which arbtirary class will be instantiated and executed when HTTP connection is established via HTTP Hook ? Which - if I understand correctly is basically a "no-go" - we removed a number of cases like that from the past from a number of providers precisely for that reason.

Is there any way we can make UI connection "declarative" for that? for example we could limit the list of predefined auth types we can choose. Does it make sense at all?

@Joffreybvn Joffreybvn marked this pull request as draft December 5, 2023 06:09
@Joffreybvn Joffreybvn marked this pull request as draft December 5, 2023 06:09
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch 3 times, most recently from 8ddedcf to b01d620 Compare December 6, 2023 19:25
@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Dec 6, 2023

Makes total sense ! I added a AIRFLOW__HTTP__EXTRA_AUTH_TYPES parameter for the Airflow 'Deployment manager' to control which classes can be set as "auth_type". By default, only classes from requests.auth and the ones added via this parameter can be used. If any other class is setup, the following message will appear in the logs, when executing the task:
MicrosoftTeams-image


Still WIP: I'm looking into customizing the UI to have special fields based on the extra params.

@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch 3 times, most recently from 8763e34 to 11408ce Compare December 20, 2023 15:18
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from c22fd95 to 29c3cff Compare December 23, 2023 13:25
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch 2 times, most recently from c9e5d10 to 53a38eb Compare January 5, 2024 07:49
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from 7f0e8fb to 070be4b Compare January 8, 2024 18:21
@potiuk
Copy link
Member

potiuk commented Aug 2, 2024

@dabla - for this one we have to make sure that the tests are "compatible" with older versions of Airflow - because that is how we are testing compatibility of providers with older versions of Airflow - we run the tests after installing old airflow versions. So some assumptions that work in main version of airflow might not work. Which means that sometimes you might need to add tests/test_utils/compat.py code to make such test be compatible with Airflow 2.7+ and import / run some common test code from there.

Details here: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases

@dabla
Copy link
Contributor

dabla commented Aug 3, 2024

@dabla - for this one we have to make sure that the tests are "compatible" with older versions of Airflow - because that is how we are testing compatibility of providers with older versions of Airflow - we run the tests after installing old airflow versions. So some assumptions that work in main version of airflow might not work. Which means that sometimes you might need to add tests/test_utils/compat.py code to make such test be compatible with Airflow 2.7+ and import / run some common test code from there.

Details here: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases

Hello @potiuk , I know this is a tuff one on different levels. Will check the compat part and come back to it. For the backward compat with 2.7 I did the following in the Hook itself, maybe that's the thing that needs to be changed then:

class ConnectionWithExtra(Connection):
    """
    Patched Connection class added for backward compatibility.
    Implements the `get_extra_dejson` method which was added in the Connection class in Airflow 2.10.0.
    This patched class must be removed once the http provider depends on Airflow 2.10.0 or higher.
    """

    def __init__(
        self,
        conn_id: str | None = None,
        conn_type: str | None = None,
        description: str | None = None,
        host: str | None = None,
        login: str | None = None,
        password: str | None = None,
        schema: str | None = None,
        port: int | None = None,
        extra: str | dict | None = None,
        uri: str | None = None,
    ):
        super().__init__(conn_id, conn_type, description, host, login, password, schema, port, extra, uri)

    def get_extra_dejson(self, nested: bool = False) -> dict:
        """
        Deserialize extra property to JSON.
        :param nested: Determines whether nested structures are also deserialized into JSON (default False).
        """
        extra = {}

        if self.extra:
            try:
                if nested:
                    for key, value in json.loads(self.extra).items():
                        extra[key] = value
                        if isinstance(value, str):
                            with suppress(JSONDecodeError):
                                extra[key] = json.loads(value)
                else:
                    extra = json.loads(self.extra)
            except JSONDecodeError:
                self.log.exception("Failed parsing the json for conn_id %s", self.conn_id)

            # Mask sensitive keys from this list
            mask_secret(extra)

        return extra

    # Then in HttpHook

    @cached_property
    def connection(self) -> Connection:
        """
        Return a cached connection property.

        This method calls the original get_connection method from the BaseHook and returns a patched version
        of the Connection class which also has the `get_extra_dejson` method that has been added in Apache
        Airflow since 2.10.0. Once the provider depends on Airflow version 2.10.0 or higher, this method and
        the ConnectionWithExtra class can be removed.
        """
        conn = BaseHook.get_connection(conn_id=self.http_conn_id)

        return ConnectionWithExtra(
            conn_id=self.http_conn_id,
            conn_type=conn.conn_type,
            description=conn.description,
            host=conn.host,
            login=conn.login,
            password=conn.password,
            schema=conn.schema,
            port=conn.port,
            extra=conn.extra,
        )

I've also added a test on it to check when it has to be removed once http provider depends on Airflow 2.9.3 or higher.

@dabla
Copy link
Contributor

dabla commented Sep 3, 2024

@potiuk Maybe we can merge this one for Airflow 3.0.0 ? Ofc dunno what the implications will be regarding UI as it will not be FAB anymore, but I'm willing to help there also if needed.

@potiuk
Copy link
Member

potiuk commented Sep 5, 2024

@potiuk Maybe we can merge this one for Airflow 3.0.0 ? Ofc dunno what the implications will be regarding UI as it will not be FAB anymore, but I'm willing to help there also if needed.

I am not opposed but there is a change that would be required to be backported to Airflow 2 for that one to work (with select field not supported there) and I am not sure if we want to do it this way with FAB components being exposed via connection.

I'd say that one looks like a great candidate to implement as Airflow-3 feature only once we know how connection definition will look like for providers in Airlfow 3 react-ui world.

Also there are certain security implications here that we have to be aware of. The "connection configuration user" has special role in Airflow's security model https://airflow.apache.org/docs/apache-airflow/stable/security/security_model.html#connection-configuration-users as such user can do considerably more "security-related" actions and some of those actions might be considered "security vulnerabilities" if it can be done by users with lower priviledges (RCE, DOS etc. - especially when we have "Test connection" feature enabled, that is very, very sensitive subject).

So we should be really careful about anything that is exposed to the user who has "connection configuration" permissions. And I would not like "convenience" to trump "security" here - and by adding more features with back-compatibility with the old "code-bound" mechanism we increase the security attack surface.

Adding it in "declarative based future mechanism" which presumably will be much more secure by design seems like a better idea.

CC: @jscheffl @bbovenzi @Joffreybvn - WDYT?

@dabla dabla requested a review from jscheffl as a code owner October 8, 2024 19:15
@jscheffl
Copy link
Contributor

Oh, this is a longer thread. Took a moment to read it through.

As this PR is currently constructed it can not be merged - as we decided that Provider and Core changes need to be separated (See https://github.com/apache/airflow/blob/main/dev/README_AIRFLOW3_DEV.md#developing-for-providers-and-helm-chart).

Besides this (even if the PR is split between core and provider) the changes add a cross depency, the provider will only work with the Core adjustments (because the additional connection fields only will be accepted by a new core) and this would be a problem supporting old airflow versions. And the other way around, yes the changes in core would require to be in Airflow 3.0. Whereas the facility for extended connection fields is to be re-worked and to be designed - not there today.

I would not like to add features on the legacy, this would make the migration harder and changed anyway would never be released.

So I assume - even though the extensions look nice (not considering the security limits we need to cut off) it would need to be parked until all required mechanisms are in place for Airflow 3.

Would it make sense to "strip" the PR down to be 100% compliant to Airflow 2.8++?

@dabla
Copy link
Contributor

dabla commented Oct 18, 2024

Oh, this is a longer thread. Took a moment to read it through.

As this PR is currently constructed it can not be merged - as we decided that Provider and Core changes need to be separated (See https://github.com/apache/airflow/blob/main/dev/README_AIRFLOW3_DEV.md#developing-for-providers-and-helm-chart).

Besides this (even if the PR is split between core and provider) the changes add a cross depency, the provider will only work with the Core adjustments (because the additional connection fields only will be accepted by a new core) and this would be a problem supporting old airflow versions. And the other way around, yes the changes in core would require to be in Airflow 3.0. Whereas the facility for extended connection fields is to be re-worked and to be designed - not there today.

I would not like to add features on the legacy, this would make the migration harder and changed anyway would never be released.

So I assume - even though the extensions look nice (not considering the security limits we need to cut off) it would need to be parked until all required mechanisms are in place for Airflow 3.

Would it make sense to "strip" the PR down to be 100% compliant to Airflow 2.8++?

I think it would indeed be better to wait until providers are dependent on Airflow 3.x before merging this one. What do you think @joffreybienvenu-infrabel ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants