Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dataset URI validation to raise warning instead of error in Airflow 2.9 #39670

Merged
merged 3 commits into from
May 17, 2024

Conversation

tatiana
Copy link
Contributor

@tatiana tatiana commented May 16, 2024

Closes: #39486

Context

Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as Dataset("postgres://postgres:5432/postgres.dbt.stg_customers"), stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

How to reproduce

By running the following DAG with apache-airflow==2.9.1 and apache-airflow-providers-postgres==5.11.0, as an example:

from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2

Causes to the exception:

Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names

About the changes introduced

This PR introduces the following:

  1. A boolean configuration within [core], named strict_dataset_uri_validation, which should be False by default.

  2. When this configuration is False, Airflow should raise a warning saying:

From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
  1. If this configuration is True, Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

  2. From Airflow 3.0, we change this configuration to be True by default.

@tatiana tatiana marked this pull request as ready for review May 16, 2024 14:24
@tatiana tatiana changed the title Change dataset validation to raise warning Change dataset URI validation to raise warning instead of error in Airflow 2.9 May 16, 2024
@pankajkoti pankajkoti added this to the Airflow 2.9.2 milestone May 16, 2024
@pankajkoti
Copy link
Member

Since we have multiple approvals on this, I'm merging this. However, please drop a comment here in case someone has comments and we can address those in a subsequent PR.

@pankajkoti pankajkoti merged commit a07d799 into apache:main May 17, 2024
42 checks passed
@pankajkoti pankajkoti deleted the fix-39486 branch May 17, 2024 06:51
RNHTTR pushed a commit to RNHTTR/airflow that referenced this pull request Jun 1, 2024
…rflow 2.9 (apache#39670)


Closes: apache#39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
@utkarsharma2 utkarsharma2 added type:improvement Changelog: Improvements type:bug-fix Changelog: Bug Fixes and removed type:improvement Changelog: Improvements labels Jun 3, 2024
ephraimbuddy pushed a commit that referenced this pull request Jun 4, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
ephraimbuddy pushed a commit that referenced this pull request Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
utkarsharma2 pushed a commit that referenced this pull request Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
ephraimbuddy pushed a commit that referenced this pull request Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
tatiana pushed a commit to astronomer/astronomer-cosmos that referenced this pull request Jul 17, 2024
I bumped the astronomer image to 11.6 since it uses the newest airflow,
incorporating [tatiana's change in airflow
version](apache/airflow#39670) for datasets and
I am installing requirements.txt by default in the image, since if the
user wants to add some custom package, he needs to change not only the
requirements.txt file but also the Dockerfile. With this only adding
dependencies to requirements.txt will already install them in the image
for dev development.
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
…rflow 2.9 (apache#39670)


Closes: apache#39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs
5 participants