Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/dbt): introduce lowercase column urn option #7418

Merged
merged 6 commits into from
Mar 20, 2023

Conversation

alex-magno
Copy link
Contributor

@alex-magno alex-magno commented Feb 23, 2023

This PR tries to solve the inconsistent URN casing issue with DBT. Closes #7377

The approach here is to introduce a convert_urns_to_lowercase flag, similar to what we have in the Snowflake ingestion. So the user of the recipe can have the option to force all URNs to lowercase.

Also, this is my first time contributing to Datahub so please review carefully and give feedback 🙏

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

) -> str:
db_fqn = self.get_db_fqn()
if target_platform != DBT_PLATFORM:
if (target_platform != DBT_PLATFORM) and convert_urns_to_lowercase:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part concerns me because it won't assure backwards compatibility. If convert_urns_to_lowercase=False (default), then db_fqn won't get lowercased. It may cause problems in setups where DBT ingestion is already running. Possible mismatch to source platform nodes? 🤔

Let me know what you think and if we can do it differently. The other parts of the code keep backwards compatibility when convert_urns_to_lowercase=False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check this in more detail, but I thought this was supposed to lowercase all outbound lineage edges by default. Basically dbt nodes retain their original casing, but their lineage to the snowflake/bigquery/etc tables would always be lowercased

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsheth2 nice! If you confirm that outbound lineages (db_fqn, in this case) need to be lowercased by default, its just a matter of removing this condition, reverting it back to the original behavior 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand, what was the issue you were seeing that necessitated this change? Was it that the overall lineage didn't match up, or was it related to column-level lineage?

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 23, 2023
@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Feb 27, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 8, 2023

Having looked over #7063 and #7377 again, I think I understand what's going on.

The model_name is only used internally in the dbt sources, so that one should be fine to leave as-is.

The db_fqns are already being lowercased by default in the dbt source when we're referencing an external system (e.g. Snowflake, BigQuery, etc.). As such, the recommended approach is that you do set convert_urns_to_lowercase=True in Snowflake/BigQuery/etc.

#7063 definitely did introduce a regression. That regression is only specific to Snowflake. In the Snowflake source, convert_urns_to_lowercase applies to all urns, including the schemaField references. For BigQuery / other sources, convert_urns_to_lowercase only applies to dataset urns.

All that said, I think what we actually need is a convert_column_urns_to_lowercase option for dbt, which should only be enabled when dbt sits on top of Snowflake. It might even make sense to dynamically set the default depending on the value of target_platform.

Let me know what you think @alex-magno. cc @remisalmon, since you also commented on #7377.

@remisalmon
Copy link
Contributor

All that said, I think what we actually need is a convert_column_urns_to_lowercase option for dbt, which should only be enabled when dbt sits on top of Snowflake. It might even make sense to dynamically set the default depending on the value of target_platform.

Let me know what you think @alex-magno. cc @remisalmon, since you also commented on #7377.

This last point makes sense considering that dbt itself has a different default for Snowflake specifically: https://docs.getdbt.com/reference/project-configs/quoting#default

(the convert_column_urns_to_lowercase option may still be useful if some DataHub user enable quoting in their dbt-Snowflake project as a very edge case)

@alex-magno
Copy link
Contributor Author

Thanks for the input @hsheth2 and @remisalmon! It definitely makes sense.

Personally I would prefer setting this type of change explicitly rather than implicitly - meaning that I would go for a convert_column_urns_to_lowercase so the user can specifically opt-in for this behavior. I think it is more generic and covers possible edge cases like @remisalmon mentioned.

Let me know if you all agree and I'm happy to do it. I will also add it for dbt-cloud as well, which I forgot to do 👍

@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 13, 2023

Let's go ahead and add the convert_column_urns_to_lowercase option, which will apply to the column names in the SchemaMetadata object but leaves the current behavior for dataset urns as-is.

I agree that it makes sense to match the defaults of dbt here. Basically if the user explicitly sets convert_column_urns_to_lowercase, then we respect that. If they haven't set it, then we default it to true when target_platform == snowflake and false otherwise. My general philosophy is that we want to optimize the defaults for a "it works out of the box" experience, even if it requires a bit of implicit behavior.

Re: dbt cloud - because dbt core and dbt cloud share the common logic, your PR already adds support for dbt-cloud too :)

@alex-magno alex-magno requested a review from hsheth2 March 15, 2023 12:54
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-magno made some tweaks, but otherwise LGTM

  1. do the lowercasing on the way out instead of the way in
  2. use a pydantic validator to handle the snowflake special casing

@hsheth2 hsheth2 changed the title fix(ingest/dbt): introduce lowercase urn option fix(ingest/dbt): introduce lowercase column urn option Mar 20, 2023
@hsheth2 hsheth2 merged commit 6ab606b into datahub-project:master Mar 20, 2023
iprentic pushed a commit to iprentic/datahub that referenced this pull request Mar 20, 2023
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Mar 22, 2023
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent URN casing in DBT ingestion
4 participants