Make postgresql default engine args comply with SA 2.0 #38362

Taragolis · 2024-03-21T09:51:51Z

'values' is old mode from SQLAlchemy <= 1.3
SA 1.3: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-fast-execution-helpers
SA 1.4: https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#psycopg2-fast-execution-helpers
SA 2.0: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html#psycopg2-fast-execution-helpers

It still work with 1.4 hover it presumably fail in 2.0

As far as I understand values_plus_batch it is the same as values in 1.4: sqlalchemy/sqlalchemy#5401

This one is continuation of: #38066

closes: #38064 ?

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2024-03-22T13:26:07Z

I do not think we ever use row count after executemany in airflow core so it should be safe to merge.

I am not 100% sure (maybe others who know SQLAlchemy better) - but I believe the only place where SQL Alchemy runs execute many for update would be in our "bulk_" methods.

The only thin that values_plus_batch changes is that when it run bulk update it uses 'execute_batchwhich does not return row-count properly. I looked at all thebulk_*` methods of ours in core to see if they return any row count or make use of it, but I could not find any.

The only place I could find that the bulk method would return anything is (in dag.py)

return cls.bulk_write_to_db(dags=dags, session=session)

But after closer inspection it always return None - because that's what bulk_write_to_db returns.

I also looked at the rowcount use and the only places I could find it used is:

in delete statements (not relevant) - delete_dag.py, dag_run_endpoint.py, dataset_endpoint.py, pool_endpoint.py,
in single update statements:
- dag_processin/manager.py

            deactivated_dagmodel = session.execute(
                update(DagModel)
                .where(DagModel.dag_id.in_(to_deactivate))
                .values(is_active=False)
                .execution_options(synchronize_session="fetch")
            )
            deactivated = deactivated_dagmodel.rowcount

scheduler_job_runner.py:

                    num_failed = session.execute(
                        update(Job)
                        .where(
                            Job.job_type == "SchedulerJob",
                            Job.state == JobState.RUNNING,
                            Job.latest_heartbeat < (timezone.utcnow() - timedelta(seconds=timeout)),
                        )
                        .values(state=JobState.FAILED)
                    ).rowcount

and

            update(TI)
            .where(
                TI.state == TaskInstanceState.DEFERRED,
                TI.trigger_timeout < timezone.utcnow(),
            )
            .values(
                state=TaskInstanceState.SCHEDULED,
                next_method="__fail__",
                next_kwargs={"error": "Trigger/execution timeout"},
                trigger_id=None,
            )

models/dagrun.py:

                count += session.execute(
                    update(TI)
                    .where(
                        TI.dag_id == self.dag_id,
                        TI.run_id == self.run_id,
                        tuple_in_condition((TI.task_id, TI.map_index), schedulable_ti_ids_chunk),
                    )
                    .values(state=TaskInstanceState.SCHEDULED)
                    .execution_options(synchronize_session=False)
                ).rowcount

and

                    update(TI)
                    .where(
                        TI.dag_id == self.dag_id,
                        TI.run_id == self.run_id,
                        tuple_in_condition((TI.task_id, TI.map_index), dummy_ti_ids_chunk),
                    )
                    .values(
                        state=TaskInstanceState.SUCCESS,
                        start_date=timezone.utcnow(),
                        end_date=timezone.utcnow(),
                        duration=0,
                    )
                    .execution_options(
                        synchronize_session=False,
                    )
                ).rowcount

Here is the doc,

'values_plus_batch'- SQLAlchemy’s native insertmanyvalues handler is used for qualifying INSERT statements, assuming create_engine.use_insertmanyvalues is left at its default value of True. Then, psycopg2’s execute_batch() handler is used for qualifying UPDATE and DELETE statements when executed with multiple parameter sets. When using this mode, the CursorResult.rowcount attribute will not contain a value for executemany-style executions against UPDATE and DELETE statements.

It would be good if others (@ashb @dstandish @kaxil @uranusjr could take a look as well and see if they can confirm that we should not be affected.

potiuk

LGTM. But I think we need more confirmation that it should be safe to change the default

Taragolis · 2024-03-22T15:24:38Z

LGTM. But I think we need more confirmation that it should be safe to change the default

Note this not a change in default. values_plus_batch = values in SA 1.4 (at least it what happen if compare 1.3 and 1.4 doc and commit to SA), but values is not documented in 1.4 and do not work with SA 2.0, because it was removed, you can't even create an engine with values in SA2 even if you do not use it

So behavior should remaining the same in 1.4.

Some snippet for local validation

from sqlalchemy import create_engine
from importlib.metadata import version

SQL_ALCHEMY_CONN = "postgresql+psycopg2://postgres:[email protected]:25433/airflow"
args = {
    "current": {
        "executemany_mode": "values",
        "executemany_values_page_size": 10000,
        "executemany_batch_page_size": 2000,
    },
    "proposed": {
        "executemany_mode": "values_plus_batch",
        "executemany_values_page_size": 10000,
        "executemany_batch_page_size": 2000,
    },
    "not-set": {}
}

print(f"SQLAlchemy version: {version('sqlalchemy')}")
for name, pg_default_args in args.items():
    print(f" {name} ".center(72, "="))
    try:
        engine = create_engine(SQL_ALCHEMY_CONN, **pg_default_args, future=True)
    except Exception as e:
        print(f"Input: {pg_default_args.get('executemany_mode')!r}, error: {e}")
    else:
        print(f"Input: {pg_default_args.get('executemany_mode')!r}, output: {engine.dialect.executemany_mode}")
        engine.dispose()

Outputs on different versions

SQLAlchemy version: 1.4.52
=============================== current ================================
Input: 'values', output: symbol('executemany_values_plus_batch')
=============================== proposed ===============================
Input: 'values_plus_batch', output: symbol('executemany_values_plus_batch')
=============================== not-set ================================
Input: None, output: symbol('executemany_values')

SQLAlchemy version: 1.4.0
=============================== current ================================
Input: 'values', output: symbol('executemany_values_plus_batch')
=============================== proposed ===============================
Input: 'values_plus_batch', output: symbol('executemany_values_plus_batch')
=============================== not-set ================================
Input: None, output: symbol('executemany_values')

SQLAlchemy version: 2.0.28
=============================== current ================================
Input: 'values', error: Invalid value for 'executemany_mode': 'values'
=============================== proposed ===============================
Input: 'values_plus_batch', error: Invalid argument(s) 'executemany_values_page_size' sent to create_engine(), using configuration PGDialect_psycopg2/QueuePool/Engine.  Please check that the keyword arguments are appropriate for this combination of components.
=============================== not-set ================================
Input: None, output: symbol('EXECUTEMANY_VALUES')

Taragolis · 2024-03-22T15:59:39Z

Input: 'values_plus_batch', error: Invalid argument(s) 'executemany_values_page_size' sent to create_engine(), using configuration PGDialect_psycopg2/QueuePool/Engine. Please check that the keyword arguments are appropriate for this combination of components.

Yeah it just postpone error to another value, in SA 2.0 it should be insertmanyvalues_page_size

potiuk

I am convinced then :)

Taragolis · 2024-03-22T16:13:47Z

And modified snippet, regular tests still can't run with SA20 😢

from sqlalchemy import create_engine
from importlib.metadata import version
from airflow.settings import DEFAULT_ENGINE_ARGS

SQL_ALCHEMY_CONN = "postgresql+psycopg2://postgres:[email protected]:25433/airflow"
args = {
    "current": {
        "executemany_mode": "values",
        "executemany_values_page_size": 10000,
        "executemany_batch_page_size": 2000,
    },
    "from-airflow-settings": DEFAULT_ENGINE_ARGS.get("postgresql"),
    "not-set": {}
}

print(f"SQLAlchemy version: {version('sqlalchemy')}")
for name, pg_default_args in args.items():
    print(f" {name} ".center(72, "="))
    print(f"Input: {pg_default_args}")
    try:
        engine = create_engine(SQL_ALCHEMY_CONN, **pg_default_args)
    except Exception as e:
        print(f"Error: {e}")
    else:
        print(f"executemany_mode: {getattr(engine.dialect, 'executemany_mode', 'NOT-EXISTS')}")
        print(f"executemany_batch_page_size: {getattr(engine.dialect, 'executemany_batch_page_size', 'NOT-EXISTS')}")
        print(f"insertmanyvalues_page_size: {getattr(engine.dialect, 'insertmanyvalues_page_size', 'NOT-EXISTS')}")
        print(f"executemany_values_page_size: {getattr(engine.dialect, 'executemany_values_page_size', 'NOT-EXISTS')}")
        engine.dispose()

SQLAlchemy version: 2.0.28
=============================== current ================================
Input: {'executemany_mode': 'values', 'executemany_values_page_size': 10000, 'executemany_batch_page_size': 2000}
Error: Invalid value for 'executemany_mode': 'values'
======================== from-airflow-settings =========================
Input: {'executemany_mode': 'values_plus_batch', 'insertmanyvalues_page_size': 10000, 'executemany_batch_page_size': 2000}
executemany_mode: symbol('EXECUTEMANY_VALUES_PLUS_BATCH')
executemany_batch_page_size: 2000
insertmanyvalues_page_size: 10000
executemany_values_page_size: NOT-EXISTS
=============================== not-set ================================
Input: {}
executemany_mode: symbol('EXECUTEMANY_VALUES')
executemany_batch_page_size: 100
insertmanyvalues_page_size: 1000
executemany_values_page_size: NOT-EXISTS

SQLAlchemy version: 1.4.52
=============================== current ================================
Input: {'executemany_mode': 'values', 'executemany_values_page_size': 10000, 'executemany_batch_page_size': 2000}
executemany_mode: symbol('executemany_values_plus_batch')
executemany_batch_page_size: 2000
insertmanyvalues_page_size: NOT-EXISTS
executemany_values_page_size: 10000
======================== from-airflow-settings =========================
Input: {'executemany_mode': 'values_plus_batch', 'executemany_values_page_size': 10000, 'executemany_batch_page_size': 2000}
executemany_mode: symbol('executemany_values_plus_batch')
executemany_batch_page_size: 2000
insertmanyvalues_page_size: NOT-EXISTS
executemany_values_page_size: 10000
=============================== not-set ================================
Input: {}
executemany_mode: symbol('executemany_values')
executemany_batch_page_size: 100
insertmanyvalues_page_size: NOT-EXISTS
executemany_values_page_size: 1000

* Change Postgres default executemany_mode to 'values_plus_batch' * executemany_values_page_size > insertmanyvalues_page_size for SA20

Change Postgres default executemany_mode to 'values_plus_batch'

c288260

Taragolis requested review from ashb, potiuk and uranusjr March 21, 2024 09:51

Taragolis added the full tests needed We need to run full set of tests for this PR to merge label Mar 21, 2024

Taragolis added this to the Airflow 2.9.0 milestone Mar 21, 2024

Taragolis closed this Mar 21, 2024

Taragolis reopened this Mar 21, 2024

Taragolis added the type:bug-fix Changelog: Bug Fixes label Mar 22, 2024

potiuk approved these changes Mar 22, 2024

View reviewed changes

executemany_values_page_size > insertmanyvalues_page_size for SA20

268830b

Taragolis changed the title ~~Change Postgres default executemany_mode to 'values_plus_batch'~~ Make postgresql default engine args comply with SA 2.0 Mar 22, 2024

potiuk approved these changes Mar 22, 2024

View reviewed changes

ephraimbuddy approved these changes Mar 22, 2024

View reviewed changes

potiuk approved these changes Mar 22, 2024

View reviewed changes

potiuk merged commit 5344881 into apache:main Mar 22, 2024
86 checks passed

Taragolis mentioned this pull request May 5, 2024

Sqlalchemy dependency update #38066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make postgresql default engine args comply with SA 2.0 #38362

Make postgresql default engine args comply with SA 2.0 #38362

Taragolis commented Mar 21, 2024 •

edited

Loading

potiuk commented Mar 22, 2024

potiuk left a comment

Taragolis commented Mar 22, 2024 •

edited

Loading

Taragolis commented Mar 22, 2024

potiuk left a comment

Taragolis commented Mar 22, 2024

Make postgresql default engine args comply with SA 2.0 #38362

Make postgresql default engine args comply with SA 2.0 #38362

Conversation

Taragolis commented Mar 21, 2024 • edited Loading

potiuk commented Mar 22, 2024

potiuk left a comment

Choose a reason for hiding this comment

Taragolis commented Mar 22, 2024 • edited Loading

Taragolis commented Mar 22, 2024

potiuk left a comment

Choose a reason for hiding this comment

Taragolis commented Mar 22, 2024

Taragolis commented Mar 21, 2024 •

edited

Loading

Taragolis commented Mar 22, 2024 •

edited

Loading