Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support converting zero bytes and high-bit-set bytes to octal sequences for VARBINARY #65

Open
Obsecurus opened this issue Nov 10, 2020 · 4 comments

Comments

@Obsecurus
Copy link

Obsecurus commented Nov 10, 2020

PostgreSQL has a function encode(data, 'escape') where data is a bytea.
https://www.postgresql.org/docs/9.4/functions-binarystring.html

Currently when using athencli w/ the Athena JDBC the default output for VARBINARY is to convert to UTF8 and if unable to convert it just outputs the hex representation of the VARBINARY. It would be extremely useful to have the ability to support this output mode directly in athenacli for any VARBINARY column when set.

Here is an example select on a VARBINARY column from Athena when using the Python pyathenajdbc library which is a wrapper for the Athena JAR. Notice how the ASCII converts but other other bytes are left as-is:

b'\x03\x00\x00+&\xe0\x00\x00\x00\x00\x00Cookie: mstshash=hello\r\n\x01\x00\x08\x00\x03\x00\x00\x00'

@zzl0
Copy link
Contributor

zzl0 commented Nov 13, 2020

@Obsecurus Thanks for your feedback.

From what I understood, the database engine behind Athena is Presto, could you use Presto's binary function to encode the binary data?

https://prestodb.io/docs/0.217/functions/binary.html

@Obsecurus
Copy link
Author

you can encode to utf-8 but then this loses all non-ASCII characters. i think anyone using athenacli and working with VARBINARY would want this sort of functionality. So essentially athenacli is receiving the correct bytes it's all just in how they are finally displayed. Ideally as utf-8 and hex escaped for things that won't be encode.

@Obsecurus
Copy link
Author

So the same query in athenacli using the from_utf8() PrestoDB function for example yields:

^C^@^@/*�^@^@^@^@^@Cookie: mstshash=Administr\r\n^A^@^H^@^C^@^@^@

@Obsecurus
Copy link
Author

@zzl0 did you have any ideas how we might approach this? We could contribute I just haven't done any dev against athenacli yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants