-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactoring-external-materialization #332
refactoring-external-materialization #332
Conversation
Ah these are all good questions, will inline some comments shortly |
@@ -27,6 +30,8 @@ | |||
{%- set backup_relation = make_backup_relation(target_relation, backup_relation_type) -%} | |||
-- as above, the backup_relation should not already exist | |||
{%- set preexisting_backup_relation = load_cached_relation(backup_relation) -%} | |||
|
|||
--What is grants here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GRANTS in the context of e.g. Snowflake or another cloud DWH. DuckDB doesn't have an equivalent concept yet, though it wouldn't surprise me to see it in the context of MotherDuck or Iceberg/Delta tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will skip it for now to simplify, but the access grants here are an exciting topic. We want to export from the current context/duckdb into external parquet, delta, iceberg, and register a view over this location so that the table can be referenced in the next steps.
@@ -76,9 +82,10 @@ | |||
{{ drop_relation_if_exists(temp_relation) }} | |||
|
|||
-- register table into glue | |||
--I dont know glue so can you explain me a bit about that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of history here-- the ability to register an external table with the AWS Glue catalog (and thus make it accessible to e.g. AWS Athena) actually predates the concept of dbt-duckdb plugins (in the BasePlugin
sense.) The original idea of plugins for external relations was to generalize this capability to other cataloging systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I see; I will need you to support me from the requirements point of view because I am unsure if I can imagine all possible use cases implemented until now. I don't have so much experience in the AWS world so more azure, google cloud but happy to learn about glue and athena
I will skip this one till we don't agree on some structure of the code; then, we can speak about details and each use case.
{%- set language = model['language'] -%} | ||
|
||
{%- set target_relation = this.incorporate(type='view') %} | ||
|
||
-- Continue as normal materialization | ||
-- Why do we have temp and intermediate relation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to not destroy an existing relation that has the same name as this one (if it exists in the catalog) in case something fails during the construction of the relation defined by this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand; we prepare everything and do all the transformation, and at the very end, we exchange the names
I understand how it fits into the context but am not sure what it means for me right now in the implementation, so make sure that i don't overlook that one in the implementation.
{%- set language = model['language'] -%} | ||
|
||
{%- set target_relation = this.incorporate(type='view') %} | ||
|
||
-- Continue as normal materialization | ||
-- Why do we have temp and intermediate relation? | ||
-- What does load_cached_relation do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Built-in dbt method; dbt will do the equivalent of caching the DB catalog metadata it loaded during the run so as to minimize the number of round-trips it does to the database (less relevant for DuckDB, more relevant for e.g. MotherDuck.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to take a look into MotherDuck here because I still don't understand how this fits here and what the places are where I have to be careful not to overlook the functionality. So please, if you see something that is against MotherDuck logic, make me aware of that.
Hi @jwills, The main goal, for now, is to showcase the code structure, then we can go into details. To simplify a bit, I will move out from the scope for now:
|
So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense. This is still a work in progress, but If you have some comments or questions I would be happy to answer them. I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run |
There isn't a nightly or some such thing as of yet; if we add it, we will add it as a net-new thing that is disabled by default (which is how we've done a bunch of other things, like e.g. the fast CSV loading for |
Hi @jwills, can you take a look. I did the first draft of the native functionality and a new test which should cover some of the general cases |
Hi @jwills,
The new code flow provides a nice feature that the exported file can be pointed directly from the target destination. It sounds trivial for native format but it is especially important for the further implementations of the iceberg, delta, (big query, snowflake ?) adapters, which can now be implemented within external materializations. What is there still to do:
has to do:
nice to have:
I hope this gives you a bit overview of what i did, i am happy to answer all the questions I still fight with the mypy and formater in the pre-commit if you have some suggestions how to go about this i would appreciate it a lot |
@milicevica23 ah that is very cool, thanks for the update! I have some stuff going on this week but I will dive into this on Thursday! |
@milicevica23 when you get a chance could you redo the formatting according to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes/comments/Qs-- general comment that there is lots of "coursor" vs. "cursor" stuff in the codebase now, would be good to do a find/replace on "coursor" -> "cursor"
Biggest Q is the one on the write perf impact of DuckDB doing an extra level of indirection via arrow before it writes the results of a query to a file
def store(self, df: DuckDBPyRelation, target_config: TargetConfig, cursor=None): | ||
location = target_config.location.path | ||
options = external_write_options(location, target_config.config.get("options", {})) | ||
cursor.sql(f"COPY (SELECT * FROM df) to '{location}' ({options})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so this is my core question: does this extra-level of indirection here (query to df to query of DF that writes to an external file) introduce overhead or a perf hit that is going to cause issues for existing external
users? Or is the DuckDB optimizer smart enough to optimize this away (for e.g. arrow record batches or some such thing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jwills, a very good question
I am unsure if here goes and convert to arrow, but I would empirically try something today to showcase if this is the case. I intentionally hand over DuckDBPyRelation
so the plugin store
function can choose the needed conversion.
Independent of the result, we would have a target_config.config.model.compiled_code
variable, which we can embed into the string, and this should be the same as in the old path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jwills, in my unscientific empirical experiment, we can see there that the new flow/method is faster than the previous one, where we first create a table and then export it
I also added the conversion over the arrow and pandas data frame to make it better comparable
The code is here: https://github.com/milicevica23/wtf-memory/blob/main/002-basics-duckdb/export-parquet-benchmark.py
Please be aware that there can be some mistakes in terms of cache and swapping memory, but I think it is enough to show that there is no significant difference in the concept of how duckdb retrieves the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also did a bit of experimenting with memory usage to see if i can visualize it; the new c) variant is better because there is no intermediate materialization
for other variants you can see them here: https://github.com/milicevica23/wtf-memory/tree/main/002-basics-duckdb/parquet-export
Is there any update on the PR for the support of delta/iceberg/hudi table format? This would be a great feature!! Let me know if there is an estimate and plan for the support!! Thanks in advance!! |
I'm mostly opting to defer to the support for Iceberg/Delta that are getting built into DuckDB extensions (e.g. https://github.com/duckdb/uc_catalog indicates where I think things are going); I don't want to have a custom way of handling these table formats/catalogs inside of dbt-duckdb that is different from the extensions unless it's absolutely necessary. |
As @jwills said we kinda stopped here because of the native support and it is better to focus on the underlying functionality which is in the long run benefiting for more users from duckdb than just dbt-duckdb users and here in this repository concetrate on the dbt benefits and workflow around dbt core functionalities |
Hi @jwills,
I wanted to open a pull request to ask questions and potentially propose the refactoring.
I will try in the next few days to list all the requirements and showcase all the implementation of the current external materialization here: https://github.com/l-mds/demo-dbt-duckdb-external-materialization
I would also like to go through that list and implement the missing tests before I start to refactor to ensure that the API stays the same I would also have some questions and need support there.