Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental insert overwrite requires identical column ordering #59

Closed
jtcohen6 opened this issue Mar 9, 2020 · 1 comment · Fixed by #60
Closed

Incremental insert overwrite requires identical column ordering #59

jtcohen6 opened this issue Mar 9, 2020 · 1 comment · Fixed by #60

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 9, 2020

Column order matters

In Spark, tables store their partition columns last. In the scenario featured in our integration test, given a seed file seed

id,first_name,last_name,email,gender,ip_address
1,Jack,Hunter,[email protected],Male,59.80.20.168
2,Kathryn,Walker,[email protected],Female,194.121.179.35
3,Gerald,Ryan,[email protected],Male,11.3.212.243
4,Bonnie,Spencer,[email protected],Female,216.32.196.175
5,Harold,Taylor,[email protected],Male,253.10.246.136

And an incremental model

{{
          config(
              materialized='incremental,
              partition_by='id',
              file_format='parquet'
          )
      }}
      select * from {{ ref('seed') }}

The resulting table will look like

first_name last_name email gender ip_address id
Jack Hunter [email protected] Male 59.80.20.168 1
Kathryn Walker [email protected] Female 194.121.179.35 2
Gerald Ryan [email protected] Male 11.3.212.243 3
Bonnie Spencer [email protected] Female 216.32.196.175 4
Harold Taylor [email protected] Male 253.10.246.136 5

In subsequent incremental runs, dbt would attempt to run two queries

create temporary view incremental_relation__dbt_tmp as
    
      select * from dbt_jcohen.seed;

insert overwrite table dbt_jcohen.incremental_relation
       partition (id)
       select * from incremental_relation__dbt_tmp

Since the columns in seed are in different order from the columns in incremental_relation (partitioned on id), the result would be

first_name last_name email gender ip_address id
Kathryn Walker [email protected] Female 194.121.179.35 2
Harold Taylor [email protected] Male 253.10.246.136 5
Bonnie Spencer [email protected] Female 216.32.196.175 4
Jack Hunter [email protected] Male 59.80.20.168 1
Gerald Ryan [email protected] Male 11.3.212.243 3
3 Gerald Ryan [email protected] Male
4 Bonnie Spencer [email protected] Female
5 Harold Taylor [email protected] Male
1 Jack Hunter [email protected] Male
2 Kathryn Walker [email protected] Female

Why hasn't the integration test been failing?

The equality test between seed and incremental_relation has been passing because we didn't have the right quoting character defined for Spark. " is the default quoting character in dbt-core; in Spark, " encloses a string literal, not a column name.

Therefore, a query like

-- setup



with a as (

    select * from dbt_jcohen.incremental_relation

),

b as (

    select * from dbt_jcohen.seed

),

a_minus_b as (

    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from a
    
  

    except



    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from b

),

b_minus_a as (

    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from b
    
  

    except



    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from a

),

unioned as (

    select * from a_minus_b
    union all
    select * from b_minus_a

),

final as (

    select (select count(*) from unioned) +
        (select abs(
            (select count(*) from a_minus_b) -
            (select count(*) from b_minus_a)
            ))
        as count

)

select count from final

Looks okay prima facie. There's some metadata/comment column names included, which is weirdly not erroring. I thought to run just the snippet

select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from a

Which returns

first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id
first_name last_name email gender ip_address id # Partition Information # col_name id

Yeah.

Solutions

@Fokko
Copy link
Contributor

Fokko commented Mar 16, 2020

Wow, great work! This is a tricky one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants