Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect case-sensitive json key path extractions #269

Open
sgoley opened this issue Sep 29, 2022 · 7 comments
Open

Respect case-sensitive json key path extractions #269

sgoley opened this issue Sep 29, 2022 · 7 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@sgoley
Copy link

sgoley commented Sep 29, 2022

Describe the bug
sqlfmt utility automatically formats json keys which can be case sensitive.
It will make all keys fully lower case and return nulls as values if the .sql keys do not access the actual case sensitive json keys.

To Reproduce
Use sqlfmt on any sql file that includes something like the following key path extractions:

src:salesperson.name <- all lowercase
SRC:salesperson.name <- column ref lower case
SRC:Salesperson.Name <- json keys have first letter capitals

Expected behavior
sqlfmt should be sensitive to adapter specific json extraction notations.

Example: Snowflake "dot" or "bracket" semi-structured notation

Example: Postgres "colon" semi-structured notation

Exception: Bigquery at least appears to have only stringified semi-structured extracts so this adapter should be safe as-is.

@tconbeer
Copy link
Owner

Thanks for the report, @sgoley !

BigQuery is case-sensitive when accessing JSON properties using dot notation, also (mentioned in this discussion). I don't think I'll be solving the BQ issue, since parsing the names as field/property names instead of table/column names is out of scope as long as I'm the sole maintainer.

BUT if colons are rare enough to make parsing easy (e.g., they don't appear in other, similar contexts), I would consider respecting case sensitivity here. The hardest and most time-consuming part of this will just be researching the colon operator in all of the major dialects and determining how unique its usage as a JSON property accessor is. I know the colon can also be used to slice arrays in some dialects, but that should be easy to disambiguate from this operator (since it will be directly inside of square brackets). I don't know if there are other uses for the colon in SQL.

If you or anyone else could contribute this research, I'd be very grateful.

Until then, the workaround is to quote all json properties that are not snake_case:

src:salesperson.name
"SRC":salesperson.name
"SRC":"Salesperson"."Name"

@tconbeer tconbeer added bug Something isn't working help wanted Extra attention is needed labels Sep 29, 2022
@tconbeer tconbeer changed the title [Bug] Handling for non-string json key path extractions Respect case-sensitive json key path extractions Sep 29, 2022
@gregbayer
Copy link

gregbayer commented Feb 21, 2023

I'm running into this issue in with DBT + Snowflake as well.
Example: After formating, parse_json(event_properties):numFieldsFilledIn becomes parse_json(event_properties):numfieldsfilledin which returns null.

@tconbeer
Copy link
Owner

Thanks. Still looking for help researching the various uses for the colon in different dialects

@ryaminal
Copy link

Oh man, i just got burned by this. took me forever to find the problem. This was for snowflake specifically.

I know pyfmt is opinionated without options, but perhaps one of the opinions could be to force quotation marks around the keys e.g. c:"FIELD_ONE" instead of just to lowering them or perhaps force bracket notation e.g. c['FIELD_ONE'] ?

@jlucas91
Copy link

Coming over from #568.

Completely empathize with the challenges of sole maintenance and respect the call here. If I might suggest a documentation change to save others in my shoes similar pain - I would consider documenting the known cases where sqlfmt will break a project. We missed this change when implementing sqlfmt on a portion of our project, which ended up causing a significant outage for us in production (since case changes lead to silent failures). Ultimately the responsibility lies fully on us for not catching it. But given it appears to be a known concern - I might consider listing it in the Beta warning block (or similar) in the docs. We had (incorrectly in retrospect) assumed that given sqlfmt is now the default implementation in DBT cloud that it was fairly unlikely to break a dbt/snowflake project out of the box.

@ryaminal
Copy link

@tconbeer , i was looking at this to work on i). after a quick perusal of the code, it appears the casing happens here in node_manager.py.

would a modification to standardize_value in node_manager be where this work is done?

Is one solution for this to add a dialect-specific check for a SEMICOLON token(and others) that either:

  • doesn't touch the case
  • surrounds in double quotes
  • uses brackets

to simplify this work, would it be possible/advised that we break this in to dialect-specific implementation? not meaning a different implementation but slowly adding this for snowflake first then BQ or whatever.

my guess is doing snowflake first would solve the problem for the largest amount of users and potentially simplify this implementation.

@brianschillaci
Copy link

I just ran into this issue as well.
Something like properties:someRandomKey was formatted to properties:somerandomkey which broke the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants