Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs #39486

Closed
2 tasks done
tatiana opened this issue May 8, 2024 · 4 comments · Fixed by #39670
Closed
2 tasks done

[Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs #39486

tatiana opened this issue May 8, 2024 · 4 comments · Fixed by #39670
Assignees
Labels
affected_version:2.9 area:core area:datasets Issues related to the datasets feature good first issue kind:bug This is a clearly a bug

Comments

@tatiana
Copy link
Contributor

tatiana commented May 8, 2024

Apache Airflow version

2.9.0 & 2.9.1

What happened?

Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as Dataset("postgres://postgres:5432/postgres.dbt.stg_customers"), stopped working in Airflow 2.9.0, after #37005 was merged.

What you think should happen instead?

This is a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

How to reproduce

By running the following DAG, as an example:

from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2

Causes to the exception:

Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names

Operating System

All

Versions of Apache Airflow Providers

apache-airflow-providers-postgres==5.11.0

Deployment

Astronomer

Deployment details

This issue can be seen in any Airflow 2.9.0 or 2.9.1 deployment.

Anything else?

An apparent "quick-fix" could be to replace "." by "/" in the Dataset URI of the DAG. However, this would break any DAGs that use data-aware schedules and consume from this Dataset - in a silent way. For this reason, let's fix this in Airflow 2.9 and prepare users for a breaking change in Airflow 3.0.

Suggested approach to fix the problem:

  1. Introduce a boolean configuration within [core], named strict_dataset_uri_validation, which should be False by default.

  2. When this configuration is False, Airflow should raise a warning saying:

From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
  1. If this configuration is True, Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

  2. From Airflow 3.0, we change this configuration to be True by default.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@tatiana tatiana added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 8, 2024
@tatiana tatiana changed the title [Bug] Strict validation in Dataset URI in Airflow 2.9 break some DAGs [Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs May 8, 2024
@uranusjr uranusjr added this to the Airflow 2.9.2 milestone May 10, 2024
@uranusjr
Copy link
Member

Tentatively targeting 2.9.2, I think we can argue this is a bug fix (the issue being backward incompatibility).

@tatiana
Copy link
Contributor Author

tatiana commented May 10, 2024

@uranusjr I'll try to get the PR ready for today - I can't assign the ticket to myself, but feel free to assign it if possible.

@eladkal eladkal added area:datasets Issues related to the datasets feature affected_version:2.9 good first issue and removed needs-triage label for new issues that we didn't triage yet labels May 11, 2024
@kaxil
Copy link
Member

kaxil commented May 15, 2024

@tatiana Any progress on this one?

@tatiana
Copy link
Contributor Author

tatiana commented May 16, 2024

@kaxil just proposed a PR to fix the problem: #39670

pankajkoti pushed a commit that referenced this issue May 17, 2024
…rflow 2.9 (#39670)


Closes: #39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
RNHTTR pushed a commit to RNHTTR/airflow that referenced this issue Jun 1, 2024
…rflow 2.9 (apache#39670)


Closes: apache#39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
ephraimbuddy pushed a commit that referenced this issue Jun 4, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
ephraimbuddy pushed a commit that referenced this issue Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
utkarsharma2 pushed a commit that referenced this issue Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
ephraimbuddy pushed a commit that referenced this issue Jun 5, 2024
…rflow 2.9 (#39670)

Closes: #39486

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator

with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.

(cherry picked from commit a07d799)
@eladkal eladkal removed this from the Airflow 2.9.2 milestone Jun 9, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this issue Jul 26, 2024
…rflow 2.9 (apache#39670)


Closes: apache#39486

# Context

Valid DAGs that worked in Airflow 2.8.x  and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged.

This was a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

# How to reproduce

By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example:
```
from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2
```

Causes to the exception:
```
Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names
```

# About the changes introduced

This PR introduces the following:

1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default.

2. When this configuration is `False,` Airflow should raise a warning saying:
```
From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
```

3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

4. From Airflow 3.0, we change this configuration to be `True` by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.9 area:core area:datasets Issues related to the datasets feature good first issue kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants