Support partial parsing #800

dwreeves · 2024-01-15T18:46:46Z

Description

dbt uses partial_parse.msgpack to make rendering things a lot faster. This PR adds support for partial_parse.msgpack in the following places:

ExecutionMode.LOCAL
ExecutionMode.VIRTUALENV
LoadMode.DBT_LS

This PR also allows users to explicitly turn off partial parsing. If this is done, then --no-partial-parse will be passed as an explicit flag in all dbt command invocations (i.e. all ExecutionModes and LoadMode.DBT_LS, albeit not the dbt deps invocation.)

This should address some performance complaints that users have, e.g. this message from Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829 Albeit, this user will also need to provide a partial_parse.msgpack.

My experimentation and looking at dbt-core source code confirms that dbt does not use manifest.json when partial parsing. It appears that these are just output artifacts, but not input artifacts. Only partial_parse.msgpack is used. (There are a couple ways to confirm this other than just checking source code

Also, I added a minor refactor of a feature I added a year ago: I added send_sigint() to the custom subprocess hook, since this custom subprocess hook did not exist back when I added it (if you want me to split this refactor into a different PR then let me know).

API change

I decided the best way to go about this would be to just add a partial_parse: bool to both the execution config and render config. For example:

dbt_group = DbtTaskGroup(
    ...,
    execution_config=ExecutionConfig(
        ...,
        partial_parse=True
    ),
    render_config=RenderConfig(
        ...,
        partial_parse=False
    )
)

That said, in all honesty users will not need to set this at all, except unless they want to suppress the little warning message about not being able to find the partial_parse.msgpack. This is because by default this addition searches for a msgpack if it exists, which is already the existing behavior in a sense, except right now the msgpack file never exists (dbt does look for it though).

When inserting into the AbstractDbtBaseOperator, I did not use global_boolean_flags. See the subsection below for why.

Other execution performance improvements

The main reason I am adding this feature is that it should dramatically improve performance for users. However, it is not the only way to improve

It's possible that we should add a way to add the flag --no-write-json as an explicit kwarg to the dbt base operator. Right now users can support this via dbt_cmd_global_flags=["--no-write-json"]. Some users (e.g. those using Elementary, or those using the dbt local operator callback kwarg) will want to write the JSON, but I suspect the majority of users will not. Similarly, --log-level-file is not used at all, and at minimum dbt should work best the vast majority of time with --no-cache-selected-only set.

It's also possible there should be some sort of "performant" mode that automatically sets all these defaults for optimal performance:

--no-write-json
--log-level-file=none
--no-cache-selected-only

Perhaps a "performant" config would be too cumbersome to implement (I would agree with that). In which case the docs could also have a section on performance tips.

A note on `global_boolean_flags`

I did not add the partial parse support to global_boolean_flags because it doesn't quite fit into the existing paradigm for this. Right now the default for each of these global_boolean_flags is False, whereas the default for partial parse is actually True. This makes fitting it in awkward.

I think it's possible that just having a tuple[str] is insufficient here, as some flags you may want to add (not just --no-partial-parse but also --no-write-json are by default True and must be explicitly turned off. Meaning that the parsing Cosmos does with flags of '--{flag.replace("_", "-")}' is ineffective for flags like this.

Right now, we have an example of putting no in front with no_version_check. Meaning that the default behavior of version checking is True, but the flag itself starts as negated so the default is actually False.

My proposal is, instead of global_boolean_flags: tuple[str], this should instead be global_boolean_flags: tuple[str | tuple[str, str | None, str | None]]. In the case that a global flag is a tuple[str, str | None, str | None], then the first arg should be the flag, the second should be "if true," and the third should be "if false." None indicates, when true/false (respectively), then do nothing.

For example:

class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta):
    ...
    global_boolean_flags = (
        ("no_version_check", "--no-version-check", None),
        ("cache_selected_only", "-cache-selected-only", None),
        ("partial_parse", None, "--no-partial-parse"),
    )

And Cosmos want to support str parsing for backwards compatibility. It's pretty straightforward to convert the data type:

if isinstance(flag, str):
    flag = (flag, '--{flag.replace("_", "-")}', None)

Related Issue(s)

Resolves DBT_MANIFEST mode does not use manifest for task execution #791
Partially resolves Option to copy partial_parse.msgpack or target/ dir into tmpdir. #785
- Option to copy partial_parse.msgpack or target/ dir into tmpdir. #785 should probably be split up into two different stages: (1) support for partial parsing (2) (a) dbt project dir / manifest / partial_parse.msgpack is allowed to come from cloud storage. (b) dbt compile is able to dump into cloud storage.
Closes Tasks are running full parse each time #722

Breaking Change?

Should not break anything. This doesn't do anything when partial_parse.msgpack is missing, and the default behavior (partial_parse=True) does not alter the dbt cmd flags.

Checklist

I have made corresponding changes to the documentation (if required)
I have added tests that prove my fix is effective or that my feature works

netlify · 2024-01-15T18:46:51Z

✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name	Link
🔨 Latest commit	`a9e9412`
🔍 Latest deploy log	https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/65d3b14bceb262000716e169

codecov · 2024-01-15T20:35:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9af7067) 94.72% compared to head (a9e9412) 95.07%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   94.72%   95.07%   +0.35%     
==========================================
  Files          56       56              
  Lines        2520     2540      +20     
==========================================
+ Hits         2387     2415      +28     
+ Misses        133      125       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dwreeves · 2024-01-18T03:42:12Z

I don't know how the coverage thing works and if it is checking for 100% coverage of new commits vs just making sure the coverage goes up, but I helped increase the coverage of the code by adding tests for the subprocess hook.

jlaneve

This is awesome for two reasons: the change in and of itself, and the writeup on where we should look in the future. Apologies it's taken longer to review, but I'm going to check this out today.

I agree the performant mode is nice in theory but may be difficult to maintain. Maybe we should create a docs page specifically on performance with our recommendations? Inclusive of this flag, using the manifest parsing method, install LibYAML, etc. That way, for users who care about performance, there's a single doc for all optimization recommendations.

Quick question re: how this partial parsing plays out in practice. My assumption is that a user would:

run dbt parse locally / in CICD (pre-deploying), which generates the necessary artifacts
push the entire project to the Airflow deployment
set the partial_parse flag

Does that sound right?

dwreeves · 2024-02-01T02:25:18Z

run dbt parse locally / in CICD (pre-deploying), which generates the necessary artifacts

Yes or dbt compile. Both of these commands will produce a manifest.json and partial_parse.msgpack. (A note: This code addition only uses partial_parse.msgpack; however a user making use of pre-compilation for performance reasons would also probably want to use manifest.json as a load method similarly for performance reasons. So two birds with one stone.)

push the entire project to the Airflow deployment

set the partial_parse flag

By default, partial_parse is True. They wouldn't need to explicitly set it unless they wanted to.

I made this API decision (i.e. the decision to have default as True) in part because the user already receives the following warning from dbt when they are not using partial parsing:

Unable to do partial parsing because saved manifest not found. Starting full parse.

Meaning that a user who intended on using partial parsing, and one who did not, is already being warned by dbt that they are not; and by default dbt is looking for it. So I think having the default behavior of the operator match the default behavior of dbt is sensible. (Also, the performance implications of checking for the existence of the file, and failing to find it, are negligible, so very low harm in doing so.)

That's a lot of text for a very small API decision, but I think it's really important to get stuff like this right for popular packages. And I do think this is the right call.

I agree the performant mode is nice in theory but may be difficult to maintain. Maybe we should create a docs page specifically on performance with our recommendations? Inclusive of this flag, using the manifest parsing method, install LibYAML, etc. That way, for users who care about performance, there's a single doc for all optimization recommendations.

I do like this approach. I agree that this would be better than a "performant" mode.

…nomer-cosmos into 785-copy-msgpack

…os into 785-copy-msgpack

dwreeves · 2024-02-18T20:32:39Z

@jbandoro Done

@dwreeves

Features * Add dbt docs natively in Airflow via plugin by @dwreeves in #737 * Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode by @jbandoro in #850 * Support partial parsing to render DAGs faster when using ``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and ``LoadMode.DBT_LS`` by @dwreeves in #800 * Add Azure Container Instance as Execution Mode by @danielvdende in #771 * Add dbt build operators by @dylanharper-qz in #795 * Add dbt profile config variables to mapped profile by @ykuc in #794 * Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786 Bug fixes * Make ``PostgresUserPasswordProfileMapping`` schema argument optional by @FouziaTariq in #683 * Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator`` by @PrimOox in #856 * Improve ``dbt ls`` parsing resilience to missing tags/config by @tatiana in #859 * Fix ``operator_args`` modified in place in Airflow converter by @jbandoro in #835 * Fix Docker and Kubernetes operators execute method resolution by @jbandoro in #849 Docs * Fix docs homepage link by @jlaneve in #860 * Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847 * Fix typo in MWAA getting started guide by @jlaneve in #846 Others * Add performance integration tests by @jlaneve in #827 * Add ``connect_retries`` to databricks profile to fix expensive integration failures by @jbandoro in #826 * Add import sorting (isort) to Cosmos by @jbandoro in #866 * Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824 and #825 * Fix failing ``test_created_pod`` for ``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by @jbandoro in #854 * Extend ``DatabricksTokenProfileMapping`` test to include session properties by @tatiana in #858 * Fix broken integration test uncovered from Pytest 8.0 update by @jbandoro in #845 * Pre-commit hook updates in #834, #843 and #852

…ct (#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of #800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented #800 available to more users. Closes: #722 I updated the documentation in the PR: #898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

Improves docs to highlight the limitation of the parsing parsing approach (introduced in #800), following up on the feedback on #722 and the changes introduced in #904

@dwreeves

Features * Add dbt docs natively in Airflow via plugin by @dwreeves in #737 * Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode by @jbandoro in #850 * Support partial parsing to render DAGs faster when using ``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and ``LoadMode.DBT_LS`` by @dwreeves in #800 * Improve performance by 22-35% or more by caching partial parse artefact by @tatiana in #904 * Add Azure Container Instance as Execution Mode by @danielvdende in #771 * Add dbt build operators by @dylanharper-qz in #795 * Add dbt profile config variables to mapped profile by @ykuc in #794 * Add more template fields to ``DbtBaseOperator`` by @dwreeves in #786 * Add ``pip_install_options`` argument to operators by @octiva in #808 Bug fixes * Make ``PostgresUserPasswordProfileMapping`` schema argument optional by @FouziaTariq in #683 * Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator`` by @PrimOox in #856 * Improve ``dbt ls`` parsing resilience to missing tags/config by @tatiana in #859 * Fix ``operator_args`` modified in place in Airflow converter by @jbandoro in #835 * Fix Docker and Kubernetes operators execute method resolution by @jbandoro in #849 * Fix ``TrinoBaseProfileMapping`` required parameter for non method authentication by @AlexandrKhabarov in #921 * Fix global flags for lists by @ms32035 in #863 * Fix ``GoogleCloudServiceAccountDictProfileMapping`` when getting values from the Airflow connection ``extra__`` keys by @glebkrapivin in #923 * Fix using the dag as a keyword argument as ``specific_args_keys`` in DbtTaskGroup by @tboutaour in #916 * Fix ACI integration (``DbtAzureContainerInstanceBaseOperator``) by @danielvdende in #872 * Fix setting dbt project dir to the tmp dir by @dwreeves in #873 * Fix dbt docs operator to not use ``graph.gpickle`` file when ``--no-write-json`` is passed by @dwreeves in #883 * Make Pydantic a required dependency by @pankajkoti in #939 * Gracefully error if users try to ``emit_datasets`` with ``Airflow 2.9.0`` or ``2.9.1`` by @tatiana in #948 * Fix parsing tests that have no parents in #933 by @jlaneve * Correct ``root_path`` in partial parse cache by @pankajkoti in #950 Docs * Fix docs homepage link by @jlaneve in #860 * Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in #847 * Fix typo in MWAA getting started guide by @jlaneve in #846 * Fix typo related to exporting docs to GCS by @tboutaour in #922 * Improve partial parsing docs by @tatiana in #898 * Improve docs for datasets for airflow >= 2.4 by @SiddiqueAhmad in #879 * Improve test behaviour docs to highlight ``warning`` feature in the ``virtualenv`` mode by @mc51 in #910 * Fix docs typo by @SiddiqueAhmad in #917 * Improve Astro docs by @RNHTTR in #951 Others * Add performance integration tests by @jlaneve in #827 * Enable ``append_env`` in ``operator_args`` by default by @tatiana in #899 * Change default ``append_env`` behaviour depending on Cosmos ``ExecutionMode`` by @pankajkoti and @pankajastro in #954 * Expose the ``dbt`` graph in the ``DbtToAirflowConverter`` class by @tommyjxl in #886 * Improve dbt docs plugin rendering padding by @dwreeves in #876 * Add ``connect_retries`` to databricks profile to fix expensive integration failures by @jbandoro in #826 * Add import sorting (isort) to Cosmos by @jbandoro in #866 * Add Python 3.11 to CI/tests by @tatiana and @jbandoro in #821, #824 and #825 * Fix failing ``test_created_pod`` for ``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by @jbandoro in #854 * Extend ``DatabricksTokenProfileMapping`` test to include session properties by @tatiana in #858 * Fix broken integration test uncovered from Pytest 8.0 update by @jbandoro in #845 * Add Apache Airflow 2.9 to the test matrix by @tatiana in #940 * Replace deprecated ``DummyOperator`` by ``EmptyOperator`` if Airflow >=2.4.0 by @tatiana in #900 * Improve logs to troubleshoot issue in 1.4.0a2 with astro-cli by @tatiana in #947 * Fix issue when publishing a new release to PyPI by @tatiana in #946 * Pre-commit hook updates in #820, #834, #843 and #852, #890, #896, #901, #905, #908, #919, #931, #941

@dwreeves

[Daniel Reeves](https://www.linkedin.com/in/daniel-reeves-27700545/) (@dwreeves ) is an experienced Open-Source Developer currently working as a Data Architect at Battery Ventures. He has significant experience with Apache Airflow, SQL, and Python and has contributed to many [OSS projects](https://github.com/dwreeve). Not only has he been using Cosmos since its early stages, but since January 2023, he has actively contributed to the project: ![Screenshot 2024-05-14 at 10 47 30](https://github.com/astronomer/astronomer-cosmos/assets/272048/57829cb6-7eee-4b02-998b-46cc7746f15a) He has been a critical driver for the Cosmos 1.4 release, and some of his contributions include new features, bug fixes, and documentation improvements, including: * Creation of an Airflow plugin to render dbt docs: #737 * Support using dbt partial parsing file: #800 * Add more template fields to `DbtBaseOperator`: #786 * Add cancel on kill functionality: #101 * Make region optional in Snowflake profile mapping: #100 * Fix the dbt docs operator to not look for `graph.pickle`: #883 He thinks about the project long-term and proposes thorough solutions to problems faced by the community, as can be seen in Github tickets: * Introducing composability in the middle layer of Cosmos's API: #895 * Establish a general pattern for uploading artifacts to storage: #894 * Support `operator_arguments` injection at a node level: #881 One of Daniel's notable traits is his collaborative and supportive approach. He has actively engaged with users in the #airflow-dbt Slack channel, demonstrating his commitment to fostering a supportive community. We want to promote him as a Cosmos committer and maintainer for all these, recognising his constant efforts and achievements towards our community. Thank you very much, @dwreeves !

Partial parsing support was introduced in #800 and improved in #904 (caching). However, as the caching layer was introduced, we removed support to use partial parsing if the cache was disabled. This PR solves the issue. Fix: #1041

## Description This PR adds a step to our CI to measure how quickly Cosmos can run models. This is part of a larger initiative to make the project more performant now that it's reached a certain level of maturity. How it works: - We now have [a test that generates a dbt project with a certain number of sequential models](https://github.com/astronomer/astronomer-cosmos/blob/performance-int-tests/tests/perf/test_performance.py) (based on a parameter that gets passed in), runs a simple DAG, and measures task throughput (measured in terms of models run per second - I've extended our CI to run this test for 1, 10, 50, and 100 models to start - This CI reports out a GitHub Actions output that gets shown in the actions summary, [at the bottom](https://github.com/astronomer/astronomer-cosmos/actions/runs/7894490582) While this isn't perfect, it's a step in the right direction - we now have some general visibility! Note that these numbers may not be indicative of a production Airflow environment running something like the Kubernetes Executor, because this runs a local executor on GH Actions runners. Still, it's meant as a benchmark to demonstrate whether we're moving in the right direction or not. As part of this, I've also refactored our tests to call a script from the pyproject file instead of embedding the scripts directly in the file. This should make it easier to maintain and track changes.  ## Related Issue(s)   astronomer#800 ## Breaking Change?  ## Checklist - [ ] I have made corresponding changes to the documentation (if required) - [ ] I have added tests that prove my fix is effective or that my feature works

## Description dbt uses `partial_parse.msgpack` to make rendering things a lot faster. This PR adds support for `partial_parse.msgpack` in the following places: - `ExecutionMode.LOCAL` - `ExecutionMode.VIRTUALENV` - `LoadMode.DBT_LS` This PR also allows users to explicitly _turn off_ partial parsing. If this is done, then `--no-partial-parse` will be passed as an explicit flag in all dbt command invocations (i.e. all `ExecutionMode`s and `LoadMode.DBT_LS`, albeit not the `dbt deps` invocation.) This should address some performance complaints that users have, e.g. this message from Slack: https://apache-airflow.slack.com/archives/C059CC42E9W/p1704483361206829 Albeit, this user will also need to provide a `partial_parse.msgpack`. My experimentation and looking at dbt-core source code confirms that dbt does not use `manifest.json` when partial parsing. It appears that these are just output artifacts, but not input artifacts. Only `partial_parse.msgpack` is used. (There are a couple ways to confirm this other than just checking source code Also, I added a minor refactor of a feature I added a year ago: I added `send_sigint()` to the custom subprocess hook, since this custom subprocess hook did not exist back when I added it (if you want me to split this refactor into a different PR then let me know). ### API change I decided the best way to go about this would be to just add a `partial_parse: bool` to both the execution config and render config. For example: ```python dbt_group = DbtTaskGroup( ..., execution_config=ExecutionConfig( ..., partial_parse=True ), render_config=RenderConfig( ..., partial_parse=False ) ) ``` That said, in all honesty users will not need to set this at all, except unless they want to suppress the little warning message about not being able to find the `partial_parse.msgpack`. This is because by default this addition searches for a msgpack if it exists, which is already the existing behavior in a sense, except right now the msgpack file never exists (dbt does look for it though). When inserting into the `AbstractDbtBaseOperator`, I did not use `global_boolean_flags`. See the subsection below for why. ### Other execution performance improvements The main reason I am adding this feature is that it should dramatically improve performance for users. However, it is not the only way to improve It's possible that we should add a way to add the flag `--no-write-json` as an explicit kwarg to the dbt base operator. Right now users can support this via `dbt_cmd_global_flags=["--no-write-json"]`. Some users (e.g. those using Elementary, or those using the dbt local operator `callback` kwarg) will want to write the JSON, but I suspect the majority of users will not. Similarly, `--log-level-file` is not used at all, and at minimum dbt should work best the vast majority of time with `--no-cache-selected-only` set. It's also possible there should be some sort of "performant" mode that automatically sets all these defaults for optimal performance: - `--no-write-json` - `--log-level-file=none` - `--no-cache-selected-only` Perhaps a "performant" config would be too cumbersome to implement (I would agree with that). In which case the docs could also have a section on performance tips. ### A note on `global_boolean_flags` I did not add the partial parse support to `global_boolean_flags` because it doesn't quite fit into the existing paradigm for this. Right now the default for each of these `global_boolean_flags` is False, whereas the default for partial parse is actually True. This makes fitting it in awkward. I think it's possible that just having a `tuple[str]` is insufficient here, as some flags you may want to add (not just `--no-partial-parse` but also `--no-write-json` are by default _True_ and must be explicitly turned off. Meaning that the parsing Cosmos does with flags of `'--{flag.replace("_", "-")}'` is ineffective for flags like this. Right now, we have an example of putting _no_ in front with `no_version_check`. Meaning that the default behavior of version checking is True, but the flag itself starts as negated so the default is actually `False`. My proposal is, instead of `global_boolean_flags: tuple[str]`, this should instead be `global_boolean_flags: tuple[str | tuple[str, str | None, str | None]]`. In the case that a global flag is a `tuple[str, str | None, str | None]`, then the first arg should be the flag, the second should be "if true," and the third should be "if false." `None` indicates, when true/false (respectively), then do nothing. For example: ```python class AbstractDbtBaseOperator(BaseOperator, metaclass=ABCMeta): ... global_boolean_flags = ( ("no_version_check", "--no-version-check", None), ("cache_selected_only", "-cache-selected-only", None), ("partial_parse", None, "--no-partial-parse"), ) ``` And Cosmos want to support `str` parsing for backwards compatibility. It's pretty straightforward to convert the data type: ```python if isinstance(flag, str): flag = (flag, '--{flag.replace("_", "-")}', None) ``` ## Related Issue(s) - Resolves astronomer#791 - Partially resolves astronomer#785 - astronomer#785 should probably be split up into two different stages: (1) support for partial parsing (2) (a) dbt project dir / manifest / `partial_parse.msgpack` is allowed to come from cloud storage. (b) `dbt compile` is able to dump into cloud storage. ## Breaking Change? Should not break anything. This doesn't do anything when `partial_parse.msgpack` is missing, and the default behavior (`partial_parse=True`) does not alter the dbt cmd flags. ## Checklist - [x] I have made corresponding changes to the documentation (if required) - [x] I have added tests that prove my fix is effective or that my feature works --------- Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Julian LaNeve <[email protected]> Co-authored-by: Justin Bandoro <[email protected]>

@dwreeves

Features * Add dbt docs natively in Airflow via plugin by @dwreeves in astronomer#737 * Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode by @jbandoro in astronomer#850 * Support partial parsing to render DAGs faster when using ``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and ``LoadMode.DBT_LS`` by @dwreeves in astronomer#800 * Add Azure Container Instance as Execution Mode by @danielvdende in astronomer#771 * Add dbt build operators by @dylanharper-qz in astronomer#795 * Add dbt profile config variables to mapped profile by @ykuc in astronomer#794 * Add more template fields to ``DbtBaseOperator`` by @dwreeves in astronomer#786 Bug fixes * Make ``PostgresUserPasswordProfileMapping`` schema argument optional by @FouziaTariq in astronomer#683 * Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator`` by @PrimOox in astronomer#856 * Improve ``dbt ls`` parsing resilience to missing tags/config by @tatiana in astronomer#859 * Fix ``operator_args`` modified in place in Airflow converter by @jbandoro in astronomer#835 * Fix Docker and Kubernetes operators execute method resolution by @jbandoro in astronomer#849 Docs * Fix docs homepage link by @jlaneve in astronomer#860 * Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in astronomer#847 * Fix typo in MWAA getting started guide by @jlaneve in astronomer#846 Others * Add performance integration tests by @jlaneve in astronomer#827 * Add ``connect_retries`` to databricks profile to fix expensive integration failures by @jbandoro in astronomer#826 * Add import sorting (isort) to Cosmos by @jbandoro in astronomer#866 * Add Python 3.11 to CI/tests by @tatiana and @jbandoro in astronomer#821, astronomer#824 and astronomer#825 * Fix failing ``test_created_pod`` for ``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by @jbandoro in astronomer#854 * Extend ``DatabricksTokenProfileMapping`` test to include session properties by @tatiana in astronomer#858 * Fix broken integration test uncovered from Pytest 8.0 update by @jbandoro in astronomer#845 * Pre-commit hook updates in astronomer#834, astronomer#843 and astronomer#852

…ct (astronomer#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of astronomer#800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented astronomer#800 available to more users. Closes: astronomer#722 I updated the documentation in the PR: astronomer#898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (astronomer#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (astronomer#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (astronomer#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (astronomer#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (astronomer#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (astronomer#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (astronomer#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

Improves docs to highlight the limitation of the parsing parsing approach (introduced in astronomer#800), following up on the feedback on astronomer#722 and the changes introduced in astronomer#904

@dwreeves

Features * Add dbt docs natively in Airflow via plugin by @dwreeves in astronomer#737 * Add support for ``InvocationMode.DBT_RUNNER`` for local execution mode by @jbandoro in astronomer#850 * Support partial parsing to render DAGs faster when using ``ExecutionMode.LOCAL``, ``ExecutionMode.VIRTUALENV`` and ``LoadMode.DBT_LS`` by @dwreeves in astronomer#800 * Improve performance by 22-35% or more by caching partial parse artefact by @tatiana in astronomer#904 * Add Azure Container Instance as Execution Mode by @danielvdende in astronomer#771 * Add dbt build operators by @dylanharper-qz in astronomer#795 * Add dbt profile config variables to mapped profile by @ykuc in astronomer#794 * Add more template fields to ``DbtBaseOperator`` by @dwreeves in astronomer#786 * Add ``pip_install_options`` argument to operators by @octiva in astronomer#808 Bug fixes * Make ``PostgresUserPasswordProfileMapping`` schema argument optional by @FouziaTariq in astronomer#683 * Fix ``folder_dir`` not showing on logs for ``DbtDocsS3LocalOperator`` by @PrimOox in astronomer#856 * Improve ``dbt ls`` parsing resilience to missing tags/config by @tatiana in astronomer#859 * Fix ``operator_args`` modified in place in Airflow converter by @jbandoro in astronomer#835 * Fix Docker and Kubernetes operators execute method resolution by @jbandoro in astronomer#849 * Fix ``TrinoBaseProfileMapping`` required parameter for non method authentication by @AlexandrKhabarov in astronomer#921 * Fix global flags for lists by @ms32035 in astronomer#863 * Fix ``GoogleCloudServiceAccountDictProfileMapping`` when getting values from the Airflow connection ``extra__`` keys by @glebkrapivin in astronomer#923 * Fix using the dag as a keyword argument as ``specific_args_keys`` in DbtTaskGroup by @tboutaour in astronomer#916 * Fix ACI integration (``DbtAzureContainerInstanceBaseOperator``) by @danielvdende in astronomer#872 * Fix setting dbt project dir to the tmp dir by @dwreeves in astronomer#873 * Fix dbt docs operator to not use ``graph.gpickle`` file when ``--no-write-json`` is passed by @dwreeves in astronomer#883 * Make Pydantic a required dependency by @pankajkoti in astronomer#939 * Gracefully error if users try to ``emit_datasets`` with ``Airflow 2.9.0`` or ``2.9.1`` by @tatiana in astronomer#948 * Fix parsing tests that have no parents in astronomer#933 by @jlaneve * Correct ``root_path`` in partial parse cache by @pankajkoti in astronomer#950 Docs * Fix docs homepage link by @jlaneve in astronomer#860 * Fix docs ``ExecutionConfig.dbt_project_path`` by @jbandoro in astronomer#847 * Fix typo in MWAA getting started guide by @jlaneve in astronomer#846 * Fix typo related to exporting docs to GCS by @tboutaour in astronomer#922 * Improve partial parsing docs by @tatiana in astronomer#898 * Improve docs for datasets for airflow >= 2.4 by @SiddiqueAhmad in astronomer#879 * Improve test behaviour docs to highlight ``warning`` feature in the ``virtualenv`` mode by @mc51 in astronomer#910 * Fix docs typo by @SiddiqueAhmad in astronomer#917 * Improve Astro docs by @RNHTTR in astronomer#951 Others * Add performance integration tests by @jlaneve in astronomer#827 * Enable ``append_env`` in ``operator_args`` by default by @tatiana in astronomer#899 * Change default ``append_env`` behaviour depending on Cosmos ``ExecutionMode`` by @pankajkoti and @pankajastro in astronomer#954 * Expose the ``dbt`` graph in the ``DbtToAirflowConverter`` class by @tommyjxl in astronomer#886 * Improve dbt docs plugin rendering padding by @dwreeves in astronomer#876 * Add ``connect_retries`` to databricks profile to fix expensive integration failures by @jbandoro in astronomer#826 * Add import sorting (isort) to Cosmos by @jbandoro in astronomer#866 * Add Python 3.11 to CI/tests by @tatiana and @jbandoro in astronomer#821, astronomer#824 and astronomer#825 * Fix failing ``test_created_pod`` for ``apache-airflow-providers-cncf-kubernetes`` after v8.0.0 update by @jbandoro in astronomer#854 * Extend ``DatabricksTokenProfileMapping`` test to include session properties by @tatiana in astronomer#858 * Fix broken integration test uncovered from Pytest 8.0 update by @jbandoro in astronomer#845 * Add Apache Airflow 2.9 to the test matrix by @tatiana in astronomer#940 * Replace deprecated ``DummyOperator`` by ``EmptyOperator`` if Airflow >=2.4.0 by @tatiana in astronomer#900 * Improve logs to troubleshoot issue in 1.4.0a2 with astro-cli by @tatiana in astronomer#947 * Fix issue when publishing a new release to PyPI by @tatiana in astronomer#946 * Pre-commit hook updates in astronomer#820, astronomer#834, astronomer#843 and astronomer#852, astronomer#890, astronomer#896, astronomer#901, astronomer#905, astronomer#908, astronomer#919, astronomer#931, astronomer#941

@dwreeves

[Daniel Reeves](https://www.linkedin.com/in/daniel-reeves-27700545/) (@dwreeves ) is an experienced Open-Source Developer currently working as a Data Architect at Battery Ventures. He has significant experience with Apache Airflow, SQL, and Python and has contributed to many [OSS projects](https://github.com/dwreeve). Not only has he been using Cosmos since its early stages, but since January 2023, he has actively contributed to the project: ![Screenshot 2024-05-14 at 10 47 30](https://github.com/astronomer/astronomer-cosmos/assets/272048/57829cb6-7eee-4b02-998b-46cc7746f15a) He has been a critical driver for the Cosmos 1.4 release, and some of his contributions include new features, bug fixes, and documentation improvements, including: * Creation of an Airflow plugin to render dbt docs: astronomer#737 * Support using dbt partial parsing file: astronomer#800 * Add more template fields to `DbtBaseOperator`: astronomer#786 * Add cancel on kill functionality: astronomer#101 * Make region optional in Snowflake profile mapping: astronomer#100 * Fix the dbt docs operator to not look for `graph.pickle`: astronomer#883 He thinks about the project long-term and proposes thorough solutions to problems faced by the community, as can be seen in Github tickets: * Introducing composability in the middle layer of Cosmos's API: astronomer#895 * Establish a general pattern for uploading artifacts to storage: astronomer#894 * Support `operator_arguments` injection at a node level: astronomer#881 One of Daniel's notable traits is his collaborative and supportive approach. He has actively engaged with users in the #airflow-dbt Slack channel, demonstrating his commitment to fostering a supportive community. We want to promote him as a Cosmos committer and maintainer for all these, recognising his constant efforts and achievements towards our community. Thank you very much, @dwreeves !

Partial parsing support was introduced in astronomer#800 and improved in astronomer#904 (caching). However, as the caching layer was introduced, we removed support to use partial parsing if the cache was disabled. This PR solves the issue. Fix: astronomer#1041

support partial parsing

849f36e

dwreeves requested a review from a team as a code owner January 15, 2024 18:46

dwreeves requested a review from a team January 15, 2024 18:46

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 15, 2024

dwreeves temporarily deployed to external January 15, 2024 18:46 — with GitHub Actions Inactive

dwreeves mentioned this pull request Jan 15, 2024

DBT_MANIFEST mode does not use manifest for task execution #791

Closed

add tests for subprocesshook

82a9e85

dwreeves had a problem deploying to external January 18, 2024 03:03 — with GitHub Actions Error

Merge branch 'main' into 785-copy-msgpack

58bdc02

dwreeves had a problem deploying to external January 18, 2024 03:40 — with GitHub Actions Error

Merge branch 'main' into 785-copy-msgpack

86b44ca

tatiana temporarily deployed to external January 18, 2024 10:21 — with GitHub Actions Inactive

Merge branch 'main' into 785-copy-msgpack

e5905e8

tatiana added this to the 1.4.0 milestone Jan 26, 2024

tatiana temporarily deployed to external January 26, 2024 23:47 — with GitHub Actions Inactive

tatiana temporarily deployed to external January 31, 2024 15:39 — with GitHub Actions Inactive

jlaneve reviewed Jan 31, 2024

View reviewed changes

dwreeves added 3 commits January 31, 2024 21:48

add tests

cce152c

Merge branch '785-copy-msgpack' of github.com-dwreeves:dwreeves/astro…

d7386f9

…nomer-cosmos into 785-copy-msgpack

Merge branch 'main' of github.com-dwreeves:astronomer/astronomer-cosm…

9c73304

…os into 785-copy-msgpack

dwreeves had a problem deploying to external February 1, 2024 02:49 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

cefc967

Merge branch 'main' into 785-copy-msgpack

a9e9412

jbandoro temporarily deployed to external February 19, 2024 19:52 — with GitHub Actions Inactive

jbandoro approved these changes Feb 19, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 19, 2024

jbandoro merged commit 39acd4c into astronomer:main Feb 19, 2024
60 checks passed

tatiana mentioned this pull request Feb 27, 2024

Release 1.4.0a1 #864

Merged

This was referenced Mar 21, 2024

Tasks are running full parse each time #722

Closed

Improve partial parsing docs #898

Merged

tatiana mentioned this pull request Apr 24, 2024

Improve performance by 22-35% or more by caching partial parse artefact #904

Merged

tatiana added a commit that referenced this pull request Apr 24, 2024

Update partial parsing docs based on the changes introduced in #800

7cc53c7

This was referenced Apr 29, 2024

Cache TaskGroup/DAG regardless of the load_method #926

Closed

Support partial parsing when using ExecutionMode.KUBERNETES and DOCKER #929

Open

tatiana added a commit that referenced this pull request May 1, 2024

Improve partial parsing docs (#898)

46db262

Improves docs to highlight the limitation of the parsing parsing approach (introduced in #800), following up on the feedback on #722 and the changes introduced in #904

tatiana mentioned this pull request May 2, 2024

Release 1.4.0 #934

Merged

tatiana mentioned this pull request May 14, 2024

Promote @dwreeves to committer #960

Merged

tatiana mentioned this pull request Jun 27, 2024

Support partial parsing when cache is disabled #1070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support partial parsing #800

Support partial parsing #800

dwreeves commented Jan 15, 2024 •

edited by tatiana

Loading

netlify bot commented Jan 15, 2024 •

edited

Loading

codecov bot commented Jan 15, 2024 •

edited

Loading

dwreeves commented Jan 18, 2024 •

edited

Loading

jlaneve left a comment

dwreeves commented Feb 1, 2024 •

edited

Loading

dwreeves commented Feb 18, 2024

Support partial parsing #800

Support partial parsing #800

Conversation

dwreeves commented Jan 15, 2024 • edited by tatiana Loading

Description

API change

Other execution performance improvements

A note on global_boolean_flags

Related Issue(s)

Breaking Change?

Checklist

netlify bot commented Jan 15, 2024 • edited Loading

✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.

codecov bot commented Jan 15, 2024 • edited Loading

Codecov Report

dwreeves commented Jan 18, 2024 • edited Loading

jlaneve left a comment

Choose a reason for hiding this comment

dwreeves commented Feb 1, 2024 • edited Loading

dwreeves commented Feb 18, 2024

dwreeves commented Jan 15, 2024 •

edited by tatiana

Loading

A note on `global_boolean_flags`

netlify bot commented Jan 15, 2024 •

edited

Loading

codecov bot commented Jan 15, 2024 •

edited

Loading

dwreeves commented Jan 18, 2024 •

edited

Loading

dwreeves commented Feb 1, 2024 •

edited

Loading