Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to store and fetch dbt ls cache in remote stores #1147

Merged
merged 15 commits into from
Aug 16, 2024

Conversation

pankajkoti
Copy link
Contributor

@pankajkoti pankajkoti commented Aug 9, 2024

Summary

This PR introduces the functionality to store and retrieve the dbt ls output cache in remote storage systems. This enhancement improves the efficiency and scalability of cache management for Cosmos dbt projects that use the dbt ls cache option (enabled by default) introduced in PR #1014

Key Changes

  1. dbt ls Cache Storage in Remote Stores:
    Added support to store the dbt ls cache as a JSON file in remote storage paths configured in the Airflow settings under the cosmos section.
    The cache is saved in the specified remote storage path & it includes the cosmos_cache__ prefix.
  2. Cache Retrieval from Remote Stores:
    Implemented logic to check the existence of the cache in the remote storage path before falling back to the Variable cache.
    If the remote_cache_dir is specified and it exists in the remote store, it is read and used; We try creating the specified path if it does not exist.
  3. Backward Compatibility:
    Maintained backward compatibility by allowing users to continue using local cache storage through Airflow Variables if a remote_cache_dir is not specified.

Impact

  1. Scalability: Enables the use of remote, scalable storage systems for dbt cache management.
  2. Performance: Reduces the load on Airflow's metadata database by offloading cache storage to external systems.
  3. Flexibility: Provides users with the option to choose between local (Airflow metadata using Variables) and remote cache storage based on their infrastructure needs.

Configuration

To leverage this feature, users need to set the remote_cache_dir in their Airflow settings in the cosmos section. This path should point to a compatible remote storage location. You can also specify the remote_cache_dir_conn_id which is your Airflow connection that can connect to your remote store. If it's not specified, Cosmos will aim to identify the scheme for the specified path and use the default Airflow connection ID as per the scheme.

Testing

  1. Tested with various remote storage backends (AWS S3 and GCP GS) to ensure compatibility and reliability
  2. Verified that cache retrieval falls back to Variable based caching approach if the remote_cache_dir is not configured.

Documentation

Updated the documentation to include instructions on configuring remote_cache_dir.

Limitations

  1. Users must be on Airflow version 2.8 or higher because the underlying Airflow Object Store feature we utilise to access remote stores was introduced in this version. If users attempt to specify a remote_cache_dir on an older Airflow version, they will encounter an error indicating the version requirement.
  2. Users would observe a slight delay for the tasks being in queued state (approx 1-2 seconds queued duration vs the 0-1 seconds previously in the Variable approach) due to remote storage calls to retrieve the cache from.

closes: #1072

Breaking Change?

No

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

Copy link

netlify bot commented Aug 9, 2024

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 94eebc1
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66be67aa6c3ed00008c88c40

Copy link

codecov bot commented Aug 9, 2024

Codecov Report

Attention: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.

Project coverage is 96.37%. Comparing base (e1ff924) to head (94eebc1).
Report is 1 commits behind head on main.

Files Patch % Lines
cosmos/cache.py 95.23% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1147   +/-   ##
=======================================
  Coverage   96.36%   96.37%           
=======================================
  Files          64       64           
  Lines        3443     3480   +37     
=======================================
+ Hits         3318     3354   +36     
- Misses        125      126    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cosmos/dbt/graph.py Outdated Show resolved Hide resolved
cosmos/config.py Outdated Show resolved Hide resolved
cosmos/dbt/graph.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @pankajkoti ! I left some feedback - we can discuss and iterate over it tomorrow in a call to speed up getting this merged

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pankajkoti , summarising what we agreed:

  1. Rename the configuration, so we minimize the amount of configuration and are consistent with how we are setting up variables when hosting docs:
    • drop remote_cache_path and reuse pre-existing cache_dir
    • rename remote_cache_conn_id to cache_conn_id
  2. Avoid exposing the method in ProjectConfig - if we could keep cache aux functions in either cache.py or graph.py, it would be ideal. We should refactor this in the future, as we discussed. If for some reason this refactor lead to the error "this class is not fork safe", it's okay to expose it in the ProjectConfig as a property prefixed with _. If we manage to get this to work in cache.py or graph.py, we can consider decorating it with the lru cache method to avoid re-computing it twice.
  3. Make sure delete_unused_dbt_ls_cache works for the remote cache
  4. Do not have a dedicated CI job for this feature, but try to leverage using something like the following in the relevant tests, setting the necessary envvars: @patch.dict(os.environ, {"AIRFLOW__COSMOS__ENABLE_CACHE": "False"}, clear=True). We could also consider running one of the integration DAGs with this enabled, but without the need to create a new CI job.

It's looking great, and it will be very exciting to have this feature out.

@pankajkoti
Copy link
Contributor Author

Thanks @tatiana for the review. I have managed to address the review comments as per our recent discussion

  1. Rename the configuration, so we minimize the amount of configuration and are consistent with how we are setting up variables when hosting docs:

    • drop remote_cache_path and reuse pre-existing cache_dir
    • rename remote_cache_conn_id to cache_conn_id

At this time, we couldn't use the existing cache_dir due to other caching implementation implications that are challenging to address. I've renamed it to remote_cache_dir and noted in the documentation that this is an experimental feature, which will be merged into cache_dir in future releases.

  1. Avoid exposing the method in ProjectConfig - if we could keep cache aux functions in either cache.py or graph.py, it would be ideal. We should refactor this in the future, as we discussed. If for some reason this refactor lead to the error "this class is not fork safe", it's okay to expose it in the ProjectConfig as a property prefixed with _. If we manage to get this to work in cache.py or graph.py, we can consider decorating it with the lru cache method to avoid re-computing it twice.

Yes, as per the pairing session, we moved it to cache.py. As of now, we're avoiding initiating the configuration in the module but deferring it while retrieving/updating the cache due to the fsspec constraints.

  1. Make sure delete_unused_dbt_ls_cache works for the remote cache

I have created a new function delete_unused_dbt_ls_remote_cache_files and included this in the example DAG. I ensured that it is able to identify the stale cache files and delete those (verified that the file actually got deleted in the remote store).
I haven't yet added tests for these as we're missing to add remote storage config in the CI yet, I have marked it as nocover for now, but will add tests for it in the follow-up PR when we enable CI integration tests for remote storage. Hope that's okay.
Snapshot of deletion logs
Screenshot 2024-08-16 at 2 04 16 AM

  1. Do not have a dedicated CI job for this feature, but try to leverage using something like the following in the relevant tests, setting the necessary envvars: @patch.dict(os.environ, {"AIRFLOW__COSMOS__ENABLE_CACHE": "False"}, clear=True). We could also consider running one of the integration DAGs with this enabled, but without the need to create a new CI job.

Yes, I have removed the separate job for now. And as discussed, I will follow-up in a subsequent PR to address this point.

Requesting a re-review please.

cc: @pankajastro

@tatiana tatiana merged commit 41053ed into main Aug 16, 2024
63 checks passed
@tatiana tatiana deleted the remote-db-ls-cache branch August 16, 2024 10:55
@tatiana tatiana mentioned this pull request Aug 16, 2024
2 tasks
@pankajkoti pankajkoti mentioned this pull request Aug 16, 2024
pankajkoti added a commit that referenced this pull request Aug 20, 2024
New Features

* Add support for loading manifest from cloud stores using Airflow
Object Storage by @pankajkoti in #1109
* Cache ``package-lock.yml`` file by @pankajastro in #1086
* Support persisting the ``LoadMode.VIRTUALENV`` directory by @tatiana
in #1079
* Add support to store and fetch ``dbt ls`` cache in remote stores by
@pankajkoti in #1147
* Add default source nodes rendering by @arojasb3 in #1107
* Add Teradata ``ProfileMapping`` by @sc250072 in #1077

Enhancements

* Add ``DatabricksOauthProfileMapping`` profile by @CorsettiS in #1091
* Use ``dbt ls`` as the default parser when ``profile_config`` is
provided by @pankajastro in #1101
* Add task owner to dbt operators by @wornjs in #1082
* Extend Cosmos custom selector to support + when using paths and tags
by @mvictoria in #1150
* Simplify logging by @dwreeves in #1108

Bug fixes

* Fix Teradata ``ProfileMapping`` target invalid issue by @sc250072 in
#1088
* Fix empty tag in case of custom parser by @pankajastro in #1100
* Fix ``dbt deps`` of ``LoadMode.DBT_LS`` should use
``ProjectConfig.dbt_vars`` by @tatiana in #1114
* Fix import handling by lazy loading hooks introduced in PR #1109 by
@dwreeves in #1132
* Fix Airflow 2.10 regression and add Airflow 2.10 in test matrix by
@pankajastro in #1162

Docs

* Fix typo in azure-container-instance docs by @pankajastro in #1106
* Use Airflow trademark as it has been registered by @pankajastro in
#1105

Others

* Run some example DAGs in Kubernetes execution mode in CI by
@pankajastro in #1127
* Install requirements.txt by default during dev env spin up by
@@CorsettiS in #1099
* Remove ``DbtGraph.current_version`` dead code by @tatiana in #1111
* Disable test for Airflow-2.5 and Python-3.11 combination in CI by
@pankajastro in #1124
* Pre-commit hook updates in #1074, #1113, #1125, #1144, #1154,  #1167

---------

Co-authored-by: Pankaj Koti <[email protected]>
Co-authored-by: Pankaj Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:list Primarily related to dbt list command or functionality lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Allow storing dbt ls cache into Object Store
3 participants