Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySql source : source binary node transforms to text node #5878

Closed
Tracked by #6994
DoNotPanicUA opened this issue Sep 7, 2021 · 10 comments · Fixed by #8047
Closed
Tracked by #6994

MySql source : source binary node transforms to text node #5878

DoNotPanicUA opened this issue Sep 7, 2021 · 10 comments · Fixed by #8047

Comments

@DoNotPanicUA
Copy link
Contributor

DoNotPanicUA commented Sep 7, 2021

Found during work on #3932

Current Behavior

Initial investigation showed that the data type test gets airbyte messages with binary values as text.
As a result, check isBinary returns false and the value contains useless data (not even string version).

Note! The issue is not investigated fully. There might be a global issue of binary values.

Expected Behavior

Binary values in Airbyte message data should be in Binnary nodes for proper processing.

@irynakruk
Copy link
Contributor

The issue with BinaryNode is general for all source connectors: after formatting binary data to the Airbyte record, data is stored in BinaryNode, but after Jsons.serialize(object) -> OBJECT_MAPPER.writeValueAsString(object) (READ command) records are transformed to a String, so all BinaryNodes are "lost" and binary data is stored as String.
Deserialization: Jsons.deserialize(jsonString) -> OBJECT_MAPPER.readTree(jsonString).

@sherifnada
Copy link
Contributor

@irynakruk the source should output the binary data as a base64 or hex-encoded string. Can you elaborate on what's needed, code-wise, to make this change? what's the scope/subtasks?

@irynakruk
Copy link
Contributor

@sherifnada I'll investigate and come with an update

@irynakruk
Copy link
Contributor

@sherifnada in current implementation binary data is read using JdbcSourceOperations.putBinary()

protected void putBinary(final ObjectNode node, final String columnName, final ResultSet resultSet, final int index) throws SQLException {
    node.put(columnName, resultSet.getBytes(index));
  }

The node, which is inserted, is of type BinaryNode.

So, e.x. we have 'test' value in MySql VARBINARY column, created BinaryNode after read contains
value = "dGVzdA==", data = [116, 101, 115, 116]. So value is already an encoded Base64 string.

During serialization it is passed as "dGVzdA==" String value and is inserted to a destination as String value without any additional conversion.

Is this an expected behavior based on your question? Or maybe I misunderstood you?

@sherifnada
Copy link
Contributor

I think encoding it as b64 is reasonable behavior for the moment. Let's add a note to the data type mappings in the docs and call it good I think. @DoNotPanicUA do you have any thoughts on this?

@DoNotPanicUA
Copy link
Contributor Author

DoNotPanicUA commented Oct 29, 2021

@sherifnada,
At this moment we actually lost original data at all. I think it definitely should be changed here.
What actually happens in the middle of our dataflow:

  1. A source reads binary data type and puts it as Binary Node. For example, the original text is test.
  2. Inside the Airbite core we do a serialization/deserialization of the message which transforms the Binary node into a Text node.
  3. Destination gets a message with the Text node and puts it into the destination place. But instead of the original text test we get the encoded value dGVzdA== in the end.

Note! Of course, the user can decode the values after all, but it doesn't sound like a convenient tool :)

In my humble opinion, we should transform all binary values into text by our own inside sources.
Please note, that we don't match source binary types with something similar at destinations. Binary types are very specific for each destination, so it makes sense to store them as text. (As we already do)

@irynakruk
Copy link
Contributor

Based on Andrii's comment, if we come up with a decision to transform all binary values into a text, there are two possible ways to do it:

  • use ResultSet.getString() method instead of ResultSet.getBytes(): in this case we'll rely on custom JDBC driver implementation of each database, but we won't need to worry about a charset;

  • decode ResultSet.getBytes() value on our side, but we'll be responsible for a charset.

I think, with proper tests which should cover binary datatype for all source JDBC connectors, it'd better to move on with the first option.

In case any connector would have an issue with ResultSet.getString() for binary data type, we could add custom transformer with required charset for a particular connector.

@sherifnada
Copy link
Contributor

sherifnada commented Nov 1, 2021

just to make sure we are aligned, the desired behavior of the connector is to output JSONs as follows:

{
  "binaryField": "<base64 representation of the binary data>"
}

this would allow us to use the contentEncoding directive of json schema to convert this back to binary in the destination. We should separately make tickets to:

  1. annotate the type for those fields with contentEncoding=base64 in our json schema, and
  2. make destinations recognize the encoding and write binary columns

So for now, the source should output only base64 string encoding of the binary data, and the destination should be expected to convert it to binary again. @irynakruk @DoNotPanicUA do you think this makes sense?

@DoNotPanicUA DoNotPanicUA assigned mkhokh-33 and unassigned irynakruk Nov 2, 2021
@DoNotPanicUA
Copy link
Contributor Author

I totally agree that it's the best solution, but I'm not sure that we can make it in one dash. I see a few major challenges here:

  • JSON schema for Java primitives needs rework. We found that problem when I was fixing BiqQuery Denormalized Destination. (By the way, @VitaliiMaltsev is working on this improvement for another issue right now)
  • The common transformation for destinations should be pretty simple, but we also should properly map with binary data types and make sure we can handle it at our implementation. At the most optimistic scenario, we need to do an investigation of it.

So, I propose to split this task into three milestones:

  1. Fix the current state and start processing binary values as decoded strings. (As Irina proposed)
  2. Improve our JSON schema logic for the Java sources
  3. Make a soft migration of all destinations from Binary-String to Binary-Binary processing.

@sherifnada
For me, it's hard to estimate what is more important here unblock users or store the data at binary data types. So, we need your final solution here. Shall we focus on fixing the issue and moving to your solution step by step or it's better to focus on the final solution from the begging and fix this issue as part of the big change?

@sherifnada
Copy link
Contributor

@DoNotPanicUA this sounds like a great approach. So we will first make sure the source outputs b64 encoding, then change the JSON schema representation to express that the data is binary, then migrate destinations to start handling this correctly.

I think we should do it step by step like you proposed Andrii.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants