fix(ingest/dbt): introduce lowercase column urn option #7418

alex-magno · 2023-02-23T13:51:08Z

This PR tries to solve the inconsistent URN casing issue with DBT. Closes #7377

The approach here is to introduce a convert_urns_to_lowercase flag, similar to what we have in the Snowflake ingestion. So the user of the recipe can have the option to force all URNs to lowercase.

Also, this is my first time contributing to Datahub so please review carefully and give feedback 🙏

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

alex-magno · 2023-02-23T13:55:00Z

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py

    ) -> str:
        db_fqn = self.get_db_fqn()
-        if target_platform != DBT_PLATFORM:
+        if (target_platform != DBT_PLATFORM) and convert_urns_to_lowercase:


This part concerns me because it won't assure backwards compatibility. If convert_urns_to_lowercase=False (default), then db_fqn won't get lowercased. It may cause problems in setups where DBT ingestion is already running. Possible mismatch to source platform nodes? 🤔

Let me know what you think and if we can do it differently. The other parts of the code keep backwards compatibility when convert_urns_to_lowercase=False

I need to check this in more detail, but I thought this was supposed to lowercase all outbound lineage edges by default. Basically dbt nodes retain their original casing, but their lineage to the snowflake/bigquery/etc tables would always be lowercased

@hsheth2 nice! If you confirm that outbound lineages (db_fqn, in this case) need to be lowercased by default, its just a matter of removing this condition, reverting it back to the original behavior 👍

To make sure I understand, what was the issue you were seeing that necessitated this change? Was it that the overall lineage didn't match up, or was it related to column-level lineage?

hsheth2 · 2023-03-08T00:35:23Z

Having looked over #7063 and #7377 again, I think I understand what's going on.

The model_name is only used internally in the dbt sources, so that one should be fine to leave as-is.

The db_fqns are already being lowercased by default in the dbt source when we're referencing an external system (e.g. Snowflake, BigQuery, etc.). As such, the recommended approach is that you do set convert_urns_to_lowercase=True in Snowflake/BigQuery/etc.

#7063 definitely did introduce a regression. That regression is only specific to Snowflake. In the Snowflake source, convert_urns_to_lowercase applies to all urns, including the schemaField references. For BigQuery / other sources, convert_urns_to_lowercase only applies to dataset urns.

All that said, I think what we actually need is a convert_column_urns_to_lowercase option for dbt, which should only be enabled when dbt sits on top of Snowflake. It might even make sense to dynamically set the default depending on the value of target_platform.

Let me know what you think @alex-magno. cc @remisalmon, since you also commented on #7377.

remisalmon · 2023-03-08T23:56:34Z

All that said, I think what we actually need is a convert_column_urns_to_lowercase option for dbt, which should only be enabled when dbt sits on top of Snowflake. It might even make sense to dynamically set the default depending on the value of target_platform.

Let me know what you think @alex-magno. cc @remisalmon, since you also commented on #7377.

This last point makes sense considering that dbt itself has a different default for Snowflake specifically: https://docs.getdbt.com/reference/project-configs/quoting#default

(the convert_column_urns_to_lowercase option may still be useful if some DataHub user enable quoting in their dbt-Snowflake project as a very edge case)

alex-magno · 2023-03-09T10:37:59Z

Thanks for the input @hsheth2 and @remisalmon! It definitely makes sense.

Personally I would prefer setting this type of change explicitly rather than implicitly - meaning that I would go for a convert_column_urns_to_lowercase so the user can specifically opt-in for this behavior. I think it is more generic and covers possible edge cases like @remisalmon mentioned.

Let me know if you all agree and I'm happy to do it. I will also add it for dbt-cloud as well, which I forgot to do 👍

hsheth2 · 2023-03-13T19:28:16Z

Let's go ahead and add the convert_column_urns_to_lowercase option, which will apply to the column names in the SchemaMetadata object but leaves the current behavior for dataset urns as-is.

I agree that it makes sense to match the defaults of dbt here. Basically if the user explicitly sets convert_column_urns_to_lowercase, then we respect that. If they haven't set it, then we default it to true when target_platform == snowflake and false otherwise. My general philosophy is that we want to optimize the defaults for a "it works out of the box" experience, even if it requires a bit of implicit behavior.

Re: dbt cloud - because dbt core and dbt cloud share the common logic, your PR already adds support for dbt-cloud too :)

hsheth2

@alex-magno made some tweaks, but otherwise LGTM

do the lowercasing on the way out instead of the way in
use a pydantic validator to handle the snowflake special casing

…ct#7418) Co-authored-by: Harshal Sheth <[email protected]>

alex-magno and others added 2 commits February 23, 2023 13:46

fix(ingest/dbt): introduce lowercase urn option

237216d

Merge branch 'master' into fix/dbt-urn-casing

7105d98

alex-magno commented Feb 23, 2023

View reviewed changes

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 23, 2023

anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Feb 27, 2023

jjoyce0510 assigned hsheth2 Feb 28, 2023

alex-magno and others added 2 commits March 15, 2023 10:50

Merge branch 'master' into fix/dbt-urn-casing

f540a55

fix(ingest/dbt): add covnert_columns_urns_to_lowercase option

528702f

alex-magno requested a review from hsheth2 March 15, 2023 12:54

simplify + add test

dfc70f4

hsheth2 approved these changes Mar 17, 2023

View reviewed changes

fix lint

bdeec0c

hsheth2 changed the title ~~fix(ingest/dbt): introduce lowercase urn option~~ fix(ingest/dbt): introduce lowercase column urn option Mar 20, 2023

hsheth2 merged commit 6ab606b into datahub-project:master Mar 20, 2023

iprentic pushed a commit to iprentic/datahub that referenced this pull request Mar 20, 2023

fix(ingest/dbt): introduce lowercase column urn option (datahub-proje…

3068d6f

…ct#7418) Co-authored-by: Harshal Sheth <[email protected]>

shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Mar 22, 2023

fix(ingest/dbt): introduce lowercase column urn option (datahub-proje…

c9ca10d

…ct#7418) Co-authored-by: Harshal Sheth <[email protected]>

shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Mar 22, 2023

fix(ingest/dbt): introduce lowercase column urn option (datahub-proje…

9326e11

…ct#7418) Co-authored-by: Harshal Sheth <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/dbt): introduce lowercase column urn option #7418

fix(ingest/dbt): introduce lowercase column urn option #7418

alex-magno commented Feb 23, 2023 •

edited by hsheth2

Loading

alex-magno Feb 23, 2023

hsheth2 Mar 1, 2023

alex-magno Mar 6, 2023

hsheth2 Mar 6, 2023

hsheth2 commented Mar 8, 2023

remisalmon commented Mar 8, 2023

alex-magno commented Mar 9, 2023

hsheth2 commented Mar 13, 2023

hsheth2 left a comment •

edited

Loading

fix(ingest/dbt): introduce lowercase column urn option #7418

fix(ingest/dbt): introduce lowercase column urn option #7418

Conversation

alex-magno commented Feb 23, 2023 • edited by hsheth2 Loading

Checklist

alex-magno Feb 23, 2023

Choose a reason for hiding this comment

hsheth2 Mar 1, 2023

Choose a reason for hiding this comment

alex-magno Mar 6, 2023

Choose a reason for hiding this comment

hsheth2 Mar 6, 2023

Choose a reason for hiding this comment

hsheth2 commented Mar 8, 2023

remisalmon commented Mar 8, 2023

alex-magno commented Mar 9, 2023

hsheth2 commented Mar 13, 2023

hsheth2 left a comment • edited Loading

Choose a reason for hiding this comment

alex-magno commented Feb 23, 2023 •

edited by hsheth2

Loading

hsheth2 left a comment •

edited

Loading