Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T-SQL HASHBYTES not Converted #1508

Closed
cjkoester opened this issue Apr 29, 2023 · 2 comments
Closed

T-SQL HASHBYTES not Converted #1508

cjkoester opened this issue Apr 29, 2023 · 2 comments

Comments

@cjkoester
Copy link

cjkoester commented Apr 29, 2023

T-SQL HASHBYTES function is not replaced when converting to Spark SQL.

import sqlglot

qry = 'SELECT HASHBYTES('SHA2_256', 'input') as hash'
sqlglot.transpile(qry, read="tsql", write="spark")[0]

Output:

SELECT HASHBYTES('SHA2_256', 'input') AS hash

Expected output:

SELECT SHA2('input', 256) AS hash

This conversion is complicated by the fact that HASHBYTES and it's algorithm arguments can translate to different functions in Spark SQL:

T-SQL Spark SQL
HASHBYTES('SHA1', 'input') SHA1('input')
HASHBYTES('SHA2_256', 'input') SHA2('input', 256)
HASHBYTES('SHA2_512', 'input') SHA2('input', 512)
HASHBYTES('MD5', 'input') MD5('input')

It's also important to note that equivalent functions in T-SQL and Spark SQL don't produce matching outputs since HASHBYTES returns VARBINARY and Spark SQL equivalents (sha1, sha2, etc.) return hex strings. This may or may not be important depending on the project.

@cjkoester cjkoester changed the title T-SQL Hashbytes not Converted T-SQL HASHBYTES not Converted Apr 29, 2023
@tobymao
Copy link
Owner

tobymao commented Apr 29, 2023

thanks for the clear input and output. we’ll have this in soon.

should we convert hex strings into binary? does spark support binary types?

@cjkoester
Copy link
Author

Thanks for sharing this library and your quick response!

There is a binary type in Spark, but at the moment I'm not sure how to get an equivalent result to HASHBYTES in Spark, or if the added complexity is warranted. My interest in this involves data warehouse migrations, where the hashes aren't necessarily required to match between systems.

It is trivial to modify T-SQL to match Spark, but that is the reverse of this scenario.

The T-SQL below returns the same result as SELECT sha2('input', 256) as hash in Spark.

SELECT lower(convert(char(64), HASHBYTES('SHA2_256', 'input'), 2)) as hash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants