Unable to evolve the schema #124

Fokko · 2020-11-17T07:55:20Z

We're running into an issue with Spark + DBT. When we add a column to an existing model, it won't be added to the table itself.

Let's say we have the following:

This translates into the following table:

However, if we add a column, and rerun it again:

We don't see the freshly added column in the resulting table:

If we remove the existing table:

DROP TABLE fokko.my_first_dbt_model;

And rerun it again:

Then the column appears and everything looks sane again.

However, more disturbingly, when we remove the column again:

It will give an error that Spark is unable to resolve the column.

Looking at the logs, reveals the issue to us:

    create temporary view my_first_dbt_model__dbt_tmp as
with random_data as (
      SELECT
         0                         AS just_a_number,
         0                     AS a_slightly_bigger_number
      
        UNION ALL
      
...

       )

And then it will do the upsert:

    insert overwrite table fokko.my_first_dbt_model
    select `just_a_number`, `a_slightly_bigger_number`, `a_random_number` from my_first_dbt_model__dbt_tmp

So, the issue is when selecting the columns from the dbt_tmp table, it will take the existing columns, instead of the columns of the model. Changing this to the columns of the model should allow us to evolve the schema.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-11-17T13:31:40Z

@Fokko thanks for the detailed writeup!

In dbt more generally, incremental models cannot handle column additions or deletions without a --full-refresh. Figuring out a reasonable and extensible approach to this, across databases, is one of our longest-lived issues (dbt-labs/dbt-core#1132).

The reason that dbt grabs the list of columns from the existing table is to handle cases where the set of columns is the same between the temp view (new records) and the existing table (old records), but the order is different (#59).

When you talk about schema evolution, is this or this what you have in mind? We could write some more code here: if file_format = 'delta' and evolve_schema = true, then the incremental materialization shouldn't grab the list of columns, but instead allow the insert overwrite or merge to reconcile schema differences.

Fokko · 2020-11-17T16:51:57Z

I agree that deletions extremely hard, or handling breaking changes in general, for example changing an integer column to a string. I don't have to tell, as dbt-labs/dbt-core#1132. Also, as a committer on Parquet and Avro I had my fair share on this :)

The things that you've pointed out are exactly what I've ment. There are two situations here that apply for Delta:

The full import, without partitioning. As @charlottevdscheun mentioned earlier in replace partitionOverwriteMode inside merge strategy #117. Currently we use INSERT INTO, but I would suggest to replace this by CREATE OR REPLACE TABLE, this allows us to atomically update the table, but also allowing changing the schema. This will keep full history of the table, and is fully supported by Delta.
In the case of partition by, forward compatible schema evolution is allowed. We can add new fields to partitions, and they will be just null for the other partitions.

I'll come up with some code examples the upcoming days, so we can discuss it in more detail.

jtcohen6 · 2020-11-17T16:59:01Z

I'd definitely welcome contributions to support both of those on Delta.

Currently we use INSERT INTO, but I would suggest to replace this by CREATE OR REPLACE TABLE, this allows us to atomically update the table, but also allowing changing the schema. This will keep full history of the table, and is fully supported by Delta.

Agreed! I think it would make sense to update the table materialization to use create or replace table where possible. That's really what we've wanted there all along. Using the incremental materialization for full table replacement is a bit of hack in the meantime :)

Fokko mentioned this issue Nov 18, 2020

Enable create or replace sql syntax #125

Merged

4 tasks

jtcohen6 closed this as completed in #125 Dec 31, 2020

laiyuanliu mentioned this issue Apr 29, 2021

support schema evolution for delta lake #162

Closed

alberttwong mentioned this issue Mar 11, 2024

StarRocks DBT connector to support schema evolution StarRocks/dbt-starrocks#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to evolve the schema #124

Unable to evolve the schema #124

Fokko commented Nov 17, 2020

jtcohen6 commented Nov 17, 2020

Fokko commented Nov 17, 2020

jtcohen6 commented Nov 17, 2020

Unable to evolve the schema #124

Unable to evolve the schema #124

Comments

Fokko commented Nov 17, 2020

jtcohen6 commented Nov 17, 2020

Fokko commented Nov 17, 2020

jtcohen6 commented Nov 17, 2020