Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for InvocationMode.DBT_RUNNER for local execution mode #850

Merged
merged 36 commits into from
Feb 27, 2024

Conversation

jbandoro
Copy link
Collaborator

Description

This PR adds dbtRunner programmatic invocation for ExecutionMode.LOCAL. I decided to not make a new execution mode for each (e.g. ExecutionMode.LOCAL_DBT_RUNNER) and all of the child operators but instead added an additional config ExecutionConfig.invocation_mode where InvocationMode.DBT_RUNNER could be specified. This is so that users who are already using local execution mode could use dbt runner and see performance improvements.

With the dbtRunnerResult it makes it easy to know whether the dbt run was successful and logs do not need to be parsed but are still logged in the operator:

image

Performance Testing

After #827 was added, I modified it slightly to use postgres adapter instead of sqlite because the latest dbt-core support for sqlite is 1.4 when programmatic invocation requires >=1.5.0. I got the following results comparing subprocess to dbt runner for 10 models:

  1. InvocationMode.SUBPROCESS:
Ran 10 models in 23.77661895751953 seconds
NUM_MODELS=10
TIME=23.77661895751953
  1. InvocationMode.DBT_RUNNER:
Ran 10 models in 8.390100002288818 seconds
NUM_MODELS=10
TIME=8.390100002288818

So using InvocationMode.DBT_RUNNER is almost 3x faster, and can speed up dag runs if there are a lot of models that execute relatively quickly since there seems to be a 1-2s speed up per task.

One thing I found while working on this is that a manifest is stored in the result if you parse a project with the runner, and can be reused in subsequent commands to avoid reparsing. This could be a useful way for caching the manifest if we use dbt runner for dbt ls parsing and could speed up the initial render as well.

I thought at first it would be easy to have this also work for virtualenv execution, since I at first thought the entire execute method was run in the virtualenv, which is not the case since the virtualenv operator creates a virtualenv and then passes the executable path to a subprocess. It may be possible to have this work for virtualenv and would be better suited for a follow-up PR.

Related Issue(s)

closes #717

Breaking Change?

None

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works - added unit tests and integration tests.

@jbandoro jbandoro requested a review from a team as a code owner February 17, 2024 01:12
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Feb 17, 2024
@dosubot dosubot bot added area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment profile:postgres Related to Postgres ProfileConfig labels Feb 17, 2024
Copy link

netlify bot commented Feb 17, 2024

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit ce2efd9
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/65de1f5888b8d20008f37d86

@jbandoro
Copy link
Collaborator Author

The performance tests for this branch (push) are succeeding, but the performance tests for (pull_request_target) are failing because I added in a check to the peformance dag test to ensure it was successful and the main GH action file does not have the necessary postgres service that was added in this PR.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic performance improvement, @jbandoro ! Great work!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 27, 2024
@tatiana tatiana mentioned this pull request Feb 27, 2024
@jbandoro
Copy link
Collaborator Author

The performance tests are now failing because I removed this branch from the on-push from the test workflow. The performance tests passed in the previous commit here: https://github.com/astronomer/astronomer-cosmos/actions/runs/8068823803

@jbandoro jbandoro merged commit 1168f58 into main Feb 27, 2024
53 of 57 checks passed
@jbandoro jbandoro deleted the 717-add-dbtrunner-local-executor-2 branch February 27, 2024 18:02
tatiana added a commit that referenced this pull request Mar 1, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in #737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in #850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in #800
* Add Azure Container Instance as Execution Mode by @danielvdende in
#771
* Add dbt build operators by @dylanharper-qz in #795
* Add dbt profile config variables to mapped profile by @ykuc in #794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in #683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in #856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in #859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in #835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in #849

Docs

* Fix docs homepage link by @jlaneve in #860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847
* Fix typo in MWAA getting started guide by @jlaneve in #846

Others

* Add performance integration tests by @jlaneve in #827
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in #826
* Add import sorting (isort) to Cosmos by @jbandoro in #866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824
and #825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in #854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in #858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in #845
* Pre-commit hook updates in #834, #843 and #852
@tatiana tatiana mentioned this pull request May 2, 2024
tatiana added a commit that referenced this pull request May 13, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in #737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in #850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in #800
* Improve performance by 22-35% or more by caching partial parse
artefact by @tatiana in #904
* Add Azure Container Instance as Execution Mode by @danielvdende in
#771
* Add dbt build operators by @dylanharper-qz in #795
* Add dbt profile config variables to mapped profile by @ykuc in #794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786
* Add ``pip_install_options`` argument to operators by @octiva in #808

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in #683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in #856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in #859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in #835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in #849
* Fix ``TrinoBaseProfileMapping`` required parameter for non method
authentication by @AlexandrKhabarov in #921
* Fix global flags for lists by @ms32035 in #863
* Fix ``GoogleCloudServiceAccountDictProfileMapping`` when getting
values from the Airflow connection ``extra__`` keys by @glebkrapivin in
#923
* Fix using the dag as a keyword argument as ``specific_args_keys`` in
DbtTaskGroup by @tboutaour in #916
* Fix ACI integration (``DbtAzureContainerInstanceBaseOperator``) by
@danielvdende in #872
* Fix setting dbt project dir to the tmp dir by @dwreeves in #873
* Fix dbt docs operator to not use ``graph.gpickle`` file when
``--no-write-json`` is passed by @dwreeves in #883
* Make Pydantic a required dependency by @pankajkoti in #939
* Gracefully error if users try to ``emit_datasets`` with ``Airflow
2.9.0`` or ``2.9.1`` by @tatiana in #948
* Fix parsing tests that have no parents in #933 by @jlaneve
* Correct ``root_path`` in partial parse cache by @pankajkoti in #950

Docs

* Fix docs homepage link by @jlaneve in #860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847
* Fix typo in MWAA getting started guide by @jlaneve in #846
* Fix typo related to exporting docs to GCS by @tboutaour in #922
* Improve partial parsing docs by @tatiana in #898
* Improve docs for datasets for airflow >= 2.4 by @SiddiqueAhmad in #879
* Improve test behaviour docs to highlight ``warning`` feature in the
``virtualenv`` mode by @mc51 in #910
* Fix docs typo by @SiddiqueAhmad in #917
* Improve Astro docs by @RNHTTR in #951

Others

* Add performance integration tests by @jlaneve in #827
* Enable ``append_env`` in ``operator_args`` by default by @tatiana in
#899
* Change default ``append_env`` behaviour depending on Cosmos
``ExecutionMode`` by @pankajkoti and @pankajastro in #954
* Expose the ``dbt`` graph in the ``DbtToAirflowConverter`` class by
@tommyjxl in #886
* Improve dbt docs plugin rendering padding by @dwreeves in #876
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in #826
* Add import sorting (isort) to Cosmos by @jbandoro in #866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824
and #825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in #854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in #858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in #845
* Add Apache Airflow 2.9 to the test matrix by @tatiana in #940
* Replace deprecated ``DummyOperator`` by ``EmptyOperator`` if Airflow
>=2.4.0 by @tatiana in #900
* Improve logs to troubleshoot issue in 1.4.0a2 with astro-cli by
@tatiana in #947
* Fix issue when publishing a new release to PyPI by @tatiana in #946
* Pre-commit hook updates in #820, #834, #843 and #852, #890, #896,
#901, #905, #908, #919, #931, #941
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
…stronomer#850)

## Description

This PR adds `dbtRunner` programmatic invocation for
`ExecutionMode.LOCAL`. I decided to not make a new execution mode for
each (e.g. `ExecutionMode.LOCAL_DBT_RUNNER`) and all of the child
operators but instead added an additional config
`ExecutionConfig.invocation_mode` where `InvocationMode.DBT_RUNNER`
could be specified. This is so that users who are already using local
execution mode could use dbt runner and see performance improvements.

With the `dbtRunnerResult` it makes it easy to know whether the dbt run
was successful and logs do not need to be parsed but are still logged in
the operator:


![image](https://github.com/astronomer/astronomer-cosmos/assets/79104794/76a4cf82-f0f2-4133-8d68-a0a6a145b1d8)

## Performance Testing

After astronomer#827 was added, I modified it slightly to use postgres adapter
instead of sqlite because the latest dbt-core support for sqlite is 1.4
when programmatic invocation requires >=1.5.0. I got the following
results comparing subprocess to dbt runner for 10 models:

1. `InvocationMode.SUBPROCESS`:
```shell
Ran 10 models in 23.77661895751953 seconds
NUM_MODELS=10
TIME=23.77661895751953
```
2. `InvocationMode.DBT_RUNNER`:
```shell
Ran 10 models in 8.390100002288818 seconds
NUM_MODELS=10
TIME=8.390100002288818
```

So using `InvocationMode.DBT_RUNNER` is almost 3x faster, and can speed
up dag runs if there are a lot of models that execute relatively quickly
since there seems to be a 1-2s speed up per task.


One thing I found while working on this is that a
[manifest](https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects)
is stored in the result if you parse a project with the runner, and can
be reused in subsequent commands to avoid reparsing. This could be a
useful way for caching the manifest if we use dbt runner for dbt ls
parsing and could speed up the initial render as well.


I thought at first it would be easy to have this also work for
virtualenv execution, since I at first thought the entire `execute`
method was run in the virtualenv, which is not the case since the
virtualenv operator creates a virtualenv and then passes the executable
path to a subprocess. It may be possible to have this work for
virtualenv and would be better suited for a follow-up PR.

## Related Issue(s)

closes astronomer#717 

## Breaking Change?

None

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works - added unit tests and integration tests.
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in astronomer#737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in astronomer#850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in astronomer#800
* Add Azure Container Instance as Execution Mode by @danielvdende in
astronomer#771
* Add dbt build operators by @dylanharper-qz in astronomer#795
* Add dbt profile config variables to mapped profile by @ykuc in astronomer#794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in astronomer#786

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in astronomer#683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in astronomer#856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in astronomer#859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in astronomer#835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in astronomer#849

Docs

* Fix docs homepage link by @jlaneve in astronomer#860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in astronomer#847
* Fix typo in MWAA getting started guide by @jlaneve in astronomer#846

Others

* Add performance integration tests by @jlaneve in astronomer#827
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in astronomer#826
* Add import sorting (isort) to Cosmos by @jbandoro in astronomer#866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in astronomer#821, astronomer#824
and astronomer#825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in astronomer#854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in astronomer#858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in astronomer#845
* Pre-commit hook updates in astronomer#834, astronomer#843 and astronomer#852
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
Features

* Add dbt docs natively in Airflow via plugin by @dwreeves in astronomer#737
* Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode
by @jbandoro in astronomer#850
* Support partial parsing to render DAGs faster when using
``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and
``LoadMode.DBT_LS`` by @dwreeves in astronomer#800
* Improve performance by 22-35% or more by caching partial parse
artefact by @tatiana in astronomer#904
* Add Azure Container Instance as Execution Mode by @danielvdende in
astronomer#771
* Add dbt build operators by @dylanharper-qz in astronomer#795
* Add dbt profile config variables to mapped profile by @ykuc in astronomer#794
* Add more template fields to ``DbtBaseOperator`` by @dwreeves in astronomer#786
* Add ``pip_install_options`` argument to operators by @octiva in astronomer#808

Bug fixes

* Make ``PostgresUserPasswordProfileMapping`` schema argument optional
by @FouziaTariq in astronomer#683
* Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator``
by @PrimOox in astronomer#856
* Improve ``dbt ls`` parsing resilience to missing tags/config by
@tatiana in astronomer#859
* Fix ``operator_args`` modified in place in Airflow converter by
@jbandoro in astronomer#835
* Fix Docker and Kubernetes operators execute method resolution by
@jbandoro in astronomer#849
* Fix ``TrinoBaseProfileMapping`` required parameter for non method
authentication by @AlexandrKhabarov in astronomer#921
* Fix global flags for lists by @ms32035 in astronomer#863
* Fix ``GoogleCloudServiceAccountDictProfileMapping`` when getting
values from the Airflow connection ``extra__`` keys by @glebkrapivin in
astronomer#923
* Fix using the dag as a keyword argument as ``specific_args_keys`` in
DbtTaskGroup by @tboutaour in astronomer#916
* Fix ACI integration (``DbtAzureContainerInstanceBaseOperator``) by
@danielvdende in astronomer#872
* Fix setting dbt project dir to the tmp dir by @dwreeves in astronomer#873
* Fix dbt docs operator to not use ``graph.gpickle`` file when
``--no-write-json`` is passed by @dwreeves in astronomer#883
* Make Pydantic a required dependency by @pankajkoti in astronomer#939
* Gracefully error if users try to ``emit_datasets`` with ``Airflow
2.9.0`` or ``2.9.1`` by @tatiana in astronomer#948
* Fix parsing tests that have no parents in astronomer#933 by @jlaneve
* Correct ``root_path`` in partial parse cache by @pankajkoti in astronomer#950

Docs

* Fix docs homepage link by @jlaneve in astronomer#860
* Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in astronomer#847
* Fix typo in MWAA getting started guide by @jlaneve in astronomer#846
* Fix typo related to exporting docs to GCS by @tboutaour in astronomer#922
* Improve partial parsing docs by @tatiana in astronomer#898
* Improve docs for datasets for airflow >= 2.4 by @SiddiqueAhmad in astronomer#879
* Improve test behaviour docs to highlight ``warning`` feature in the
``virtualenv`` mode by @mc51 in astronomer#910
* Fix docs typo by @SiddiqueAhmad in astronomer#917
* Improve Astro docs by @RNHTTR in astronomer#951

Others

* Add performance integration tests by @jlaneve in astronomer#827
* Enable ``append_env`` in ``operator_args`` by default by @tatiana in
astronomer#899
* Change default ``append_env`` behaviour depending on Cosmos
``ExecutionMode`` by @pankajkoti and @pankajastro in astronomer#954
* Expose the ``dbt`` graph in the ``DbtToAirflowConverter`` class by
@tommyjxl in astronomer#886
* Improve dbt docs plugin rendering padding by @dwreeves in astronomer#876
* Add ``connect_retries`` to databricks profile to fix expensive
integration failures by @jbandoro in astronomer#826
* Add import sorting (isort) to Cosmos by @jbandoro in astronomer#866
* Add Python 3.11 to CI/tests by @tatiana and @jbandoro in astronomer#821, astronomer#824
and astronomer#825
* Fix failing ``test_created_pod`` for
``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by
@jbandoro in astronomer#854
* Extend ``DatabricksTokenProfileMapping`` test to include session
properties by @tatiana in astronomer#858
* Fix broken integration test uncovered from Pytest 8.0 update by
@jbandoro in astronomer#845
* Add Apache Airflow 2.9 to the test matrix by @tatiana in astronomer#940
* Replace deprecated ``DummyOperator`` by ``EmptyOperator`` if Airflow
>=2.4.0 by @tatiana in astronomer#900
* Improve logs to troubleshoot issue in 1.4.0a2 with astro-cli by
@tatiana in astronomer#947
* Fix issue when publishing a new release to PyPI by @tatiana in astronomer#946
* Pre-commit hook updates in astronomer#820, astronomer#834, astronomer#843 and astronomer#852, astronomer#890, astronomer#896,
astronomer#901, astronomer#905, astronomer#908, astronomer#919, astronomer#931, astronomer#941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment lgtm This PR has been approved by a maintainer profile:postgres Related to Postgres ProfileConfig size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to DBT_RUNNER execution mode
3 participants