MySql source : source binary node transforms to text node #5878

DoNotPanicUA · 2021-09-07T12:43:37Z

Found during work on #3932

Current Behavior

Initial investigation showed that the data type test gets airbyte messages with binary values as text.
As a result, check isBinary returns false and the value contains useless data (not even string version).

Note! The issue is not investigated fully. There might be a global issue of binary values.

Expected Behavior

Binary values in Airbyte message data should be in Binnary nodes for proper processing.

The text was updated successfully, but these errors were encountered:

irynakruk · 2021-10-27T14:55:34Z

The issue with BinaryNode is general for all source connectors: after formatting binary data to the Airbyte record, data is stored in BinaryNode, but after Jsons.serialize(object) -> OBJECT_MAPPER.writeValueAsString(object) (READ command) records are transformed to a String, so all BinaryNodes are "lost" and binary data is stored as String.
Deserialization: Jsons.deserialize(jsonString) -> OBJECT_MAPPER.readTree(jsonString).

sherifnada · 2021-10-27T20:00:15Z

@irynakruk the source should output the binary data as a base64 or hex-encoded string. Can you elaborate on what's needed, code-wise, to make this change? what's the scope/subtasks?

irynakruk · 2021-10-28T13:15:43Z

@sherifnada I'll investigate and come with an update

irynakruk · 2021-10-28T18:49:35Z

@sherifnada in current implementation binary data is read using JdbcSourceOperations.putBinary()

protected void putBinary(final ObjectNode node, final String columnName, final ResultSet resultSet, final int index) throws SQLException {
    node.put(columnName, resultSet.getBytes(index));
  }

The node, which is inserted, is of type BinaryNode.

So, e.x. we have 'test' value in MySql VARBINARY column, created BinaryNode after read contains
value = "dGVzdA==", data = [116, 101, 115, 116]. So value is already an encoded Base64 string.

During serialization it is passed as "dGVzdA==" String value and is inserted to a destination as String value without any additional conversion.

Is this an expected behavior based on your question? Or maybe I misunderstood you?

sherifnada · 2021-10-28T19:31:17Z

I think encoding it as b64 is reasonable behavior for the moment. Let's add a note to the data type mappings in the docs and call it good I think. @DoNotPanicUA do you have any thoughts on this?

DoNotPanicUA · 2021-10-29T14:44:42Z

@sherifnada,
At this moment we actually lost original data at all. I think it definitely should be changed here.
What actually happens in the middle of our dataflow:

A source reads binary data type and puts it as Binary Node. For example, the original text is test.
Inside the Airbite core we do a serialization/deserialization of the message which transforms the Binary node into a Text node.
Destination gets a message with the Text node and puts it into the destination place. But instead of the original text test we get the encoded value dGVzdA== in the end.

Note! Of course, the user can decode the values after all, but it doesn't sound like a convenient tool :)

In my humble opinion, we should transform all binary values into text by our own inside sources.
Please note, that we don't match source binary types with something similar at destinations. Binary types are very specific for each destination, so it makes sense to store them as text. (As we already do)

irynakruk · 2021-10-29T15:35:22Z

Based on Andrii's comment, if we come up with a decision to transform all binary values into a text, there are two possible ways to do it:

use ResultSet.getString() method instead of ResultSet.getBytes(): in this case we'll rely on custom JDBC driver implementation of each database, but we won't need to worry about a charset;
decode ResultSet.getBytes() value on our side, but we'll be responsible for a charset.

I think, with proper tests which should cover binary datatype for all source JDBC connectors, it'd better to move on with the first option.

In case any connector would have an issue with ResultSet.getString() for binary data type, we could add custom transformer with required charset for a particular connector.

sherifnada · 2021-11-01T21:34:44Z

just to make sure we are aligned, the desired behavior of the connector is to output JSONs as follows:

{
  "binaryField": "<base64 representation of the binary data>"
}

this would allow us to use the contentEncoding directive of json schema to convert this back to binary in the destination. We should separately make tickets to:

annotate the type for those fields with contentEncoding=base64 in our json schema, and
make destinations recognize the encoding and write binary columns

So for now, the source should output only base64 string encoding of the binary data, and the destination should be expected to convert it to binary again. @irynakruk @DoNotPanicUA do you think this makes sense?

DoNotPanicUA · 2021-11-02T16:54:57Z

I totally agree that it's the best solution, but I'm not sure that we can make it in one dash. I see a few major challenges here:

JSON schema for Java primitives needs rework. We found that problem when I was fixing BiqQuery Denormalized Destination. (By the way, @VitaliiMaltsev is working on this improvement for another issue right now)
The common transformation for destinations should be pretty simple, but we also should properly map with binary data types and make sure we can handle it at our implementation. At the most optimistic scenario, we need to do an investigation of it.

So, I propose to split this task into three milestones:

Fix the current state and start processing binary values as decoded strings. (As Irina proposed)
Improve our JSON schema logic for the Java sources
Make a soft migration of all destinations from Binary-String to Binary-Binary processing.

@sherifnada
For me, it's hard to estimate what is more important here unblock users or store the data at binary data types. So, we need your final solution here. Shall we focus on fixing the issue and moving to your solution step by step or it's better to focus on the final solution from the begging and fix this issue as part of the big change?

sherifnada · 2021-11-05T18:31:52Z

@DoNotPanicUA this sounds like a great approach. So we will first make sure the source outputs b64 encoding, then change the JSON schema representation to express that the data is binary, then migrate destinations to start handling this correctly.

I think we should do it step by step like you proposed Andrii.

DoNotPanicUA added the type/bug Something isn't working label Sep 7, 2021

sherifnada added area/connectors Connector related issues lang/java labels Sep 10, 2021

DoNotPanicUA mentioned this issue Sep 21, 2021

MySql Source: Fix data processing #6093

Merged

38 tasks

alexandr-shegeda mentioned this issue Oct 13, 2021

[EPIC] Support all possible data-types for database sources #6994

Closed

10 tasks

irynakruk self-assigned this Oct 21, 2021

DoNotPanicUA assigned mkhokh-33 and unassigned irynakruk Nov 2, 2021

tuliren mentioned this issue Nov 3, 2021

Aggregate all existing database issues #7600

Closed

mkhokh-33 linked a pull request Nov 17, 2021 that will close this issue

🐛 Source MySQL: transform binary data base64 format #8047

Merged

40 tasks

mkhokh-33 closed this as completed in #8047 Nov 23, 2021

mkhokh-33 mentioned this issue Nov 23, 2021

Create end to end tests that check that binary data is passed correctly from GA DB sources to GA destinations #8210

Closed

igrankova added connectors/sources-database connectors/source/mysql labels Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySql source : source binary node transforms to text node #5878

MySql source : source binary node transforms to text node #5878

DoNotPanicUA commented Sep 7, 2021 •

edited

Loading

irynakruk commented Oct 27, 2021

sherifnada commented Oct 27, 2021

irynakruk commented Oct 28, 2021

irynakruk commented Oct 28, 2021

sherifnada commented Oct 28, 2021

DoNotPanicUA commented Oct 29, 2021 •

edited

Loading

irynakruk commented Oct 29, 2021

sherifnada commented Nov 1, 2021 •

edited

Loading

DoNotPanicUA commented Nov 2, 2021

sherifnada commented Nov 5, 2021

MySql source : source binary node transforms to text node #5878

MySql source : source binary node transforms to text node #5878

Comments

DoNotPanicUA commented Sep 7, 2021 • edited Loading

Current Behavior

Expected Behavior

irynakruk commented Oct 27, 2021

sherifnada commented Oct 27, 2021

irynakruk commented Oct 28, 2021

irynakruk commented Oct 28, 2021

sherifnada commented Oct 28, 2021

DoNotPanicUA commented Oct 29, 2021 • edited Loading

irynakruk commented Oct 29, 2021

sherifnada commented Nov 1, 2021 • edited Loading

DoNotPanicUA commented Nov 2, 2021

sherifnada commented Nov 5, 2021

DoNotPanicUA commented Sep 7, 2021 •

edited

Loading

DoNotPanicUA commented Oct 29, 2021 •

edited

Loading

sherifnada commented Nov 1, 2021 •

edited

Loading