Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to evolve the schema #124

Closed
Fokko opened this issue Nov 17, 2020 · 3 comments · Fixed by #125
Closed

Unable to evolve the schema #124

Fokko opened this issue Nov 17, 2020 · 3 comments · Fixed by #125

Comments

@Fokko
Copy link
Contributor

Fokko commented Nov 17, 2020

We're running into an issue with Spark + DBT. When we add a column to an existing model, it won't be added to the table itself.

Let's say we have the following:
image

This translates into the following table:
image

However, if we add a column, and rerun it again:
image

We don't see the freshly added column in the resulting table:
image

If we remove the existing table:

DROP TABLE fokko.my_first_dbt_model;

And rerun it again:
image

Then the column appears and everything looks sane again.

However, more disturbingly, when we remove the column again:
image
It will give an error that Spark is unable to resolve the column.

Looking at the logs, reveals the issue to us:

    create temporary view my_first_dbt_model__dbt_tmp as
with random_data as (
      SELECT
         0                         AS just_a_number,
         0                     AS a_slightly_bigger_number
      
        UNION ALL
      
...

       )

And then it will do the upsert:

    insert overwrite table fokko.my_first_dbt_model
    select `just_a_number`, `a_slightly_bigger_number`, `a_random_number` from my_first_dbt_model__dbt_tmp

So, the issue is when selecting the columns from the dbt_tmp table, it will take the existing columns, instead of the columns of the model. Changing this to the columns of the model should allow us to evolve the schema.

@jtcohen6
Copy link
Contributor

@Fokko thanks for the detailed writeup!

In dbt more generally, incremental models cannot handle column additions or deletions without a --full-refresh. Figuring out a reasonable and extensible approach to this, across databases, is one of our longest-lived issues (dbt-labs/dbt-core#1132).

The reason that dbt grabs the list of columns from the existing table is to handle cases where the set of columns is the same between the temp view (new records) and the existing table (old records), but the order is different (#59).

When you talk about schema evolution, is this or this what you have in mind? We could write some more code here: if file_format = 'delta' and evolve_schema = true, then the incremental materialization shouldn't grab the list of columns, but instead allow the insert overwrite or merge to reconcile schema differences.

@Fokko
Copy link
Contributor Author

Fokko commented Nov 17, 2020

I agree that deletions extremely hard, or handling breaking changes in general, for example changing an integer column to a string. I don't have to tell, as dbt-labs/dbt-core#1132. Also, as a committer on Parquet and Avro I had my fair share on this :)

The things that you've pointed out are exactly what I've ment. There are two situations here that apply for Delta:

  • The full import, without partitioning. As @charlottevdscheun mentioned earlier in replace partitionOverwriteMode inside merge strategy #117. Currently we use INSERT INTO, but I would suggest to replace this by CREATE OR REPLACE TABLE, this allows us to atomically update the table, but also allowing changing the schema. This will keep full history of the table, and is fully supported by Delta.
  • In the case of partition by, forward compatible schema evolution is allowed. We can add new fields to partitions, and they will be just null for the other partitions.

I'll come up with some code examples the upcoming days, so we can discuss it in more detail.

@jtcohen6
Copy link
Contributor

I'd definitely welcome contributions to support both of those on Delta.

Currently we use INSERT INTO, but I would suggest to replace this by CREATE OR REPLACE TABLE, this allows us to atomically update the table, but also allowing changing the schema. This will keep full history of the table, and is fully supported by Delta.

Agreed! I think it would make sense to update the table materialization to use create or replace table where possible. That's really what we've wanted there all along. Using the incremental materialization for full table replacement is a bit of hack in the meantime :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants