Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files #9464

javrasya · 2024-01-13T01:42:02Z

The changes in this PR is addressing the problem in #9410 which makes it problematic to consume from an Iceberg table which is getting regular upserts.

The problem occurs on this line with Flink's DataOutputserializer object when the size of an object which is being serialized is above 65kb and that limit is easily exceeded for those tables which are having too many deleted files due to regular upserts.

javrasya · 2024-01-13T01:45:14Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

@@ -147,7 +147,7 @@ byte[] serializeV2() throws IOException {

      for (FileScanTask fileScanTask : fileScanTasks) {
        String taskJson = FileScanTaskParser.toJson(fileScanTask);
-        out.writeUTF(taskJson);
+        writeBytes(out, taskJson);


The reason why I don't use out.writeBytes here is because it is buggy. The position with buffer is already incremented by 1 in here (Actual place where the increment happens) and it is incremented by the whole size once again in here which makes it problematic.

So I had to take some inspiration from that code and introduce a custom version of it in the class further down.

javrasya · 2024-01-13T01:46:56Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+    for (int i = 0; i < s.length(); i++) {
+      out.writeByte(s.charAt(i));
+    }
+    out.writeByte('\n');


The reason for this is because, now the deserialize method uses readLine instead of readUTF (because it does not work with it anymore), that is the only way I could think of in which I could still load the tasks one by one in an iterator fashion.

What happens if we have comment with \n?

pvary · 2024-01-13T06:03:46Z

Let's take a step back before rushing to a solution. Here are some things we have to solve:

Serializing long Strings
Serializing extra chars, like Chinese chars
Backward compatibility to read old serialized splits
Performance will become even more important, as we have long buffers, and potentially many splits

My current ideas:

Compatibility: We might have to introduce SerializerV3
Extra chars: Is 2 bytes enough for all chars? For me, some research would be needed
Performance: if possible, reusing buffers

Thanks for starting the work on this @javrasya !

javrasya · 2024-01-13T09:13:47Z

Good catches @pvary , thank you. What if we get full inspiration from writeUTF and have our own writer but support longer texts. Btw, the reason why it limits the size to be 65K max because the first 2 bytes of the serialized value holds the length of the UTF and that is unsigned short which can be max 65K. I have introduced my own writeUTF and called it writeLongUTF/readLongUTF. It writes the first bytes which holds the length as int which is 4 bytes instead of unsigned short. That would be the most compatible (but still not full compatible) way with V2.

Do mind taking a look at those changes here and let me know what you think? I didn't want to update this PR directly without talking to you about it? If you think that is good idea, I can proceed and merge it on this branch first and we can continue with the discussions here.

The utilities in the Helper class is not exactly as they are in the Flink source but very similar. But it is not compatible with V2 since that is using initial 2 bytes to indicate the length. Introducing v3 is good idea as you suggested. But not really sure how we would be able to distinguish a serialized split with V2 earlier from V3 though 🤔 Do you know how this was done from v1 to v2? Can you help me there?

javrasya · 2024-01-14T00:01:50Z

Another idea is to still use original writeUTF since that is covering the most, but write it as chunks. The biggest character can be 3 bytes according to the writeUTF function. If we go for the worst case, we can create multiple chunks out of the json each worth of 65kb / 3 and write that each chunk which is guaranteed to fit in a UTF. That would entail zero copy pasting from the Flink source code.

main...javrasya:iceberg:issue-9410-introduce-v3-serde-write-utf-as-chunks

It would still break the compatibility with V2 though. I don't know how would that be achieved without really breaking it since we don't know the version information in the serialized value of a split 🤔

javrasya · 2024-01-16T14:03:37Z

Forget about me saying not knowing the version, I was looking at the code and saw that the version information is put in when enumerator serializes the data. I guess introducing v3 would be the way to go, so that wouldn't be a problem.

pvary · 2024-01-16T17:17:42Z

I would prefer creating a well balanced writeLongUTF solution which would be a candidate to get into the Flink code later.

I prefer your solution where you write out the length in a long instead of 2 chars, but instead of reserving the whole byte array in advance, I would prefer to write out the data in 64k sized byte arrays (a sorter one at the end)

Not sure about the read path. Maybe there is no way to optimize the reserved memory at that point, as we have to keep the Strings in memory anyways...

@stevenzwu: any suggestions?

javrasya · 2024-01-16T18:58:22Z

I was also leaning toward that one @pvary. But what benefit you think it would bring to write out the data in 65k sized byte arrays. DataOutputSerializer from Flink also writes it into a byte array under the hood. It even resize the byte array every time the size of the array is exceeded which is even be more expensive than doing it only once. (resizing is doing arraycopy to a new bigger byte array)

stevenzwu · 2024-01-17T01:07:09Z

I like Peter's suggestion of proper writeLongUTF support in Flink DataOutputSerializer without the 64 KB limit. Can we start a discussion in the Flink community? I didn't realize the compatibility/version change earlier until Peter pointed out. Hence I would prefer to align with the Flink community first before we decide the proper approach.

stevenzwu · 2024-01-17T01:21:25Z

The split/FileScanTask should only contain ASCII. The other alternative is to get the byte array (UTF-8) from String. I am not too concerned about the performance if this is only applied to strings longer than 64 KB.

public byte[] getBytes(String charsetName)

I don't quite agree with the current implementation as it writes 2 bytes for every char. It will double the size/bytes. We are hurting the common scenarios to address this corner cases.

pvary · 2024-01-17T08:18:11Z

The split/FileScanTask should only contain ASCII.

If I understand correctly, the FileScanTask json will contain the Schema. The Schema has a doc field for comments. Do we have restrictions defined for the doc field? When working on Hive, plenty of our Chinese users added comments using non-ASCII characters there.

stevenzwu · 2024-01-17T17:04:33Z

If I understand correctly, the FileScanTask json will contain the Schema. The Schema has a doc field for comments. Do we have restrictions defined for the doc field?

@pvary you are right. agree that doc field may contain non ASCII chars.

pvary · 2024-01-19T13:23:34Z

Started the discussion on the Flink dev list: https://lists.apache.org/thread/ocm6cj0h8o3wbwo7fz2l1b4odss750rk
In the letter I collected my findings - basically the 64k limit is defined by the JDK itself for the https://docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html#writeUTF-java.lang.String

Let's see what the community thinks about adding this feature to DataOutputSerializer

javrasya · 2024-01-22T20:14:06Z

Is the the conclusion to wait for them to provide that new utility or proceed and improve this PR according to our discussions?

pvary · 2024-01-23T05:08:21Z

Is the the conclusion to wait for them to provide that new utility or proceed and improve this PR according to our discussions?

Let's give the community a few more days to decide. Then we could proceed

pvary · 2024-01-24T14:40:05Z

Created the FLINK jira: https://issues.apache.org/jira/browse/FLINK-34228

pvary · 2024-01-24T16:27:29Z

Also, here is the Flink PR: apache/flink#24191

If the PR has been accepted there, then the serialization format is finalized.
Since it takes time for all the supported Flink versions to contain the serialization code, we can temporarily duplicate the code and update our serialization to use the new finalized format. We can remove this duplication once the supported Flink version contains the new serialization code. Since the serialization format will be the same between the Flink and the duplicated code, we do need to change the serialization version when we fix the issue now, but we do not need to change the serialization version when we start to use the Flink code.

javrasya · 2024-01-28T15:35:59Z

Do you want me to proceed and update my PR @pvary accordingly?

pvary · 2024-01-31T13:08:36Z

Do you want me to proceed and update my PR @pvary accordingly?

I would wait until the PR at least merged to the Flink main branch

pvary · 2024-03-19T08:42:29Z

@javrasya: Finally the Flink PR got merged. See the related jira. So we have a clear path forward.

The Flink serializer will be released only with Flink 1.20.0. In the meantime we can create our own method which has the same write format, with a deprecation message that this should be replaced with the Flink code whenever it is released.

@javrasya: Do you have time to move this forward?

Thanks,
Peter

javrasya · 2024-03-24T00:26:28Z

Hi @pvary, Good idea. Let me have a look at it and update this PR

elkhand · 2024-04-08T13:15:23Z

@javrasya wanted to check if you will have time to update the PR this week. If not I can take on and update this PR accordingly.

javrasya · 2024-04-12T13:55:12Z

Updated the PR @pvary @elkhand , appreciate if you guys could take a look

javrasya · 2024-04-12T14:04:00Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java

+    if (utflen > Integer.MAX_VALUE - 4) {
+      throw new UTFDataFormatException("Encoded string is too long: " + utflen);
+    }
+


resize is already happening in the out.writeInt function, I don't think the following line is needed;

https://github.com/apache/flink/pull/24191/files#diff-fe4433a7c83db13761a692e0d9eb4e7e3be7a511b7aa938bb1bcbb7b70fa45ccR290

The resize function is also a private function, so we don't have access to it in here.

out.writeInt only increases the buffer by 4 to hold the integer.

I think we need the resize call here. but this does impose the blocker here because DataInputDeserializer#resize(int minCapacityAdd) is a private method here. I am not sure this can work (just copying the read/writeLongUTF`.

if (utflen > Integer.MAX_VALUE - 4) { throw new UTFDataFormatException("Encoded string is too long: " + utflen); } else if (this.position > this.buffer.length - utflen - 2) { resize((int) utflen + 4); }

cc @pvary @elkhand

Wouldn't out.write(bytearr, 0, count); in writeUTFBytes(out, str, (int) utflen); call the resize for us?

https://github.com/apache/flink/blob/f75935245799471ddf025d2bab0d0d212e79088e/flink-core/src/main/java/org/apache/flink/core/memory/DataOutputSerializer.java#L119

@Override public void write(byte[] b, int off, int len) throws IOException { if (len < 0 || off > b.length - len) { throw new ArrayIndexOutOfBoundsException(); } if (this.position > this.buffer.length - len) { resize(len); } System.arraycopy(b, off, this.buffer, this.position, len); this.position += len; }

@pvary you are right.

I guess the resize call from the Flink PR is not necessary then. this is where @javrasya had the question for this thread. but I guess @javrasya 's original comment is not 100% inaccurate.

if (utflen > Integer.MAX_VALUE - 4) { throw new UTFDataFormatException("Encoded string is too long: " + utflen); } else if (this.position > this.buffer.length - utflen - 2) { resize((int) utflen + 4); }

Exactly, as @pvary said, writeUTFBytes will resize the the array as needed already. The only downside of not resizing it preliminary like in the Flink PR could be that there will be two resizing one when writing out.writeInt((int) utflen); and then out.write(bytearr, 0, count);. This won't happen all the time, but if it happens to occur, it will cause original array to be copied to a new resized array twice by the resize function.

As an improvement, maybe I can combine these two by turning writeUTFBytes into writeUTFBytesWithSize and write only once as out.write(bytearr, 0, count + 4) (4 which is for the size) so that resizing of the actual byte array would be done once when needed. Then it will do exactly as in the Flink PR.

Wdyt @pvary @stevenzwu ?

As an improvement, maybe I can combine these two by turning writeUTFBytes into writeUTFBytesWithSize and write only once as out.write(bytearr, 0, count + 4) (4 which is for the size) so that resizing of the actual byte array would be done once when needed.

I am not sure that is necessary. with @pvary 's explanation, I think the current code is correct.

With the current code, it is possible that it resizes twice due to these two lines. I am not saying that it would run wrongly but two resize operations would cause two array copy operation (when the size is not big enough) which is worse than one resize. Or am I missing something 🤔 ?

The early resize in the Flink PR kind of prevents that. Now we don't have such behavior in this PR due to the reasons mentioned earlier. My suggestion above would reduce the number of resize operation to 1 since we will write both the utf length and the utf bytes all in one go.

@javrasya: I would keep your code even with the possibility of the double resize.
This is only temp code until we get the Flink code available. If we do not see an actual bottleneck, I would keep the code as simple as possible.

…too many delete files which are created due to an upsert oepration on the table and then it is not possible to consume from such a table

…s V3 to allow smooth migration

Ahmet

…tion in the Flink code Ahmet

…e javadocs

…duce code duplication

javrasya · 2024-04-12T14:35:58Z

I've done some more refactorings.

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java

stevenzwu · 2024-04-15T21:32:24Z

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java

+    if (utflen > Integer.MAX_VALUE - 4) {
+      throw new UTFDataFormatException("Encoded string is too long: " + utflen);
+    }
+


out.writeInt only increases the buffer by 4 to hold the integer.

I think we need the resize call here. but this does impose the blocker here because DataInputDeserializer#resize(int minCapacityAdd) is a private method here. I am not sure this can work (just copying the read/writeLongUTF`.

if (utflen > Integer.MAX_VALUE - 4) { throw new UTFDataFormatException("Encoded string is too long: " + utflen); } else if (this.position > this.buffer.length - utflen - 2) { resize((int) utflen + 4); }

cc @pvary @elkhand

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java

pvary · 2024-04-17T08:25:49Z

@javrasya: Please fix the failures as well.
Thanks!

…sues

javrasya · 2024-04-17T11:42:21Z

Done @pvary , I guess the triggers are requiring an approval to kick in.

stevenzwu

LGTM

@javrasya there is still style check failure

Execution failed for task ':iceberg-flink:iceberg-flink-1.18:checkstyleTest'.
> A failure occurred while executing org.gradle.api.plugins.quality.internal.CheckstyleAction
   > Checkstyle rule violations were found. See the report at: file:///home/runner/work/iceberg/iceberg/flink/v1.18/flink/build/reports/checkstyle/test.html
     Checkstyle files with violations: 1
     Checkstyle violations by severity: [error:3]

Ahmet

javrasya · 2024-04-18T08:19:34Z

Sorry for the inconvenience, didn't know how to run he checks locally, now I know and it is all good in my local. Pushed the changes. 🙏 @stevenzwu

pvary · 2024-04-18T09:24:49Z

Thanks for the PR @javrasya and @stevenzwu for the review!

@javrasya: Please port the changes to v1.19, and v1.17

Thanks,
Peter

javrasya · 2024-04-18T15:17:09Z

Is that manually done @pvary or is there a utility in the repo to do that?

mas-chen · 2024-04-18T16:28:52Z

@javrasya manually but you can use this an example ~~#9464~~ #9334 (see PR description)

elkhand · 2024-04-18T16:57:40Z

@javrasya Thanks for the PR. I think @mas-chen wanted to mention this PR: #9334 instead.

…h bigger payload

Co-authored-by: Elkhan Dadashov <[email protected]>

…any delete files (apache#9464)

Co-authored-by: Elkhan Dadashov <[email protected]>

github-actions bot added the flink label Jan 13, 2024

javrasya commented Jan 13, 2024

View reviewed changes

javrasya changed the title ~~Flink: Don't fail to serialize IcebergSourceSplit when there is too ma…~~ Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files Jan 13, 2024

javrasya commented Apr 12, 2024

View reviewed changes

javrasya added 4 commits April 12, 2024 17:28

[apache#9410] It fails to serialize IcebergSourceSplit when there is …

6155e8b

…too many delete files which are created due to an upsert oepration on the table and then it is not possible to consume from such a table

Ahmet

066e4f1

[apache#9410] Introduce the new serialization with writing long UTF a…

ea5a425

…s V3 to allow smooth migration

[apache#9410] Serialize as v3 by default

7c0a441

Ahmet

javrasya added 3 commits April 12, 2024 17:28

[apache#9410] Make the v3 serialization look more like the implementa…

6ffdbc9

…tion in the Flink code Ahmet

[#issue-9410] Mark readLongUTF function also as deprecated and improv…

318ee3c

…e javadocs

[apache#9410] Extract common code parts into separate functions to re…

1b5a26f

…duce code duplication

javrasya force-pushed the issue-9410-serialize-splits-with-too-many-delete-files branch from 88fe53c to 1b5a26f Compare April 12, 2024 14:28

stevenzwu reviewed Apr 15, 2024

View reviewed changes

stevenzwu reviewed Apr 16, 2024

View reviewed changes

flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java Show resolved Hide resolved

javrasya added 2 commits April 17, 2024 13:35

[apache#9410] Refactor according to the review and fix some format is…

e7bd45b

…sues

[apache#9410] Refactor according to the review and fix some format is…

a35514f

…sues

stevenzwu approved these changes Apr 17, 2024

View reviewed changes

apache#9410 Fix code style issues

4f3eccf

Ahmet

pvary merged commit 8136463 into apache:main Apr 18, 2024
13 checks passed

elkhand pushed a commit to elkhand/iceberg that referenced this pull request Apr 18, 2024

Flink: port apache#9464 to v1.17 and v1.19

30a44f9

elkhand mentioned this pull request Apr 18, 2024

Flink: port #9464 to v1.17 and v1.19 #10177

Merged

javrasya added a commit to javrasya/iceberg that referenced this pull request Apr 18, 2024

Flink: backport PR apache#9464 for being able to serialize splits wit…

f87e9f0

…h bigger payload

stevenzwu pushed a commit that referenced this pull request Apr 18, 2024

Flink: port #9464 to v1.17 and v1.19 (#10177)

1f8cad3

Co-authored-by: Elkhan Dadashov <[email protected]>

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Flink: Don't fail to serialize IcebergSourceSplit when there is too m…

8836f1b

…any delete files (apache#9464)

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Flink: port apache#9464 to v1.17 and v1.19 (apache#10177)

769cdf1

Co-authored-by: Elkhan Dadashov <[email protected]>

Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files #9464

Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files #9464

Conversation

javrasya commented Jan 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Jan 13, 2024

javrasya commented Jan 13, 2024 • edited Loading

javrasya commented Jan 14, 2024 • edited Loading

javrasya commented Jan 16, 2024

pvary commented Jan 16, 2024

javrasya commented Jan 16, 2024 • edited Loading

stevenzwu commented Jan 17, 2024

stevenzwu commented Jan 17, 2024

pvary commented Jan 17, 2024

stevenzwu commented Jan 17, 2024

pvary commented Jan 19, 2024

javrasya commented Jan 22, 2024

pvary commented Jan 23, 2024

pvary commented Jan 24, 2024

pvary commented Jan 24, 2024

javrasya commented Jan 28, 2024

pvary commented Jan 31, 2024

pvary commented Mar 19, 2024

javrasya commented Mar 24, 2024

elkhand commented Apr 8, 2024 • edited Loading

javrasya commented Apr 12, 2024

javrasya Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javrasya Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javrasya Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javrasya commented Apr 12, 2024

Choose a reason for hiding this comment

pvary commented Apr 17, 2024

javrasya commented Apr 17, 2024

stevenzwu left a comment

Choose a reason for hiding this comment

javrasya commented Apr 18, 2024

pvary commented Apr 18, 2024

javrasya commented Apr 18, 2024

mas-chen commented Apr 18, 2024 • edited Loading

elkhand commented Apr 18, 2024

javrasya commented Jan 13, 2024 •

edited

Loading

javrasya commented Jan 14, 2024 •

edited

Loading

javrasya commented Jan 16, 2024 •

edited

Loading

elkhand commented Apr 8, 2024 •

edited

Loading

javrasya Apr 12, 2024 •

edited

Loading

javrasya Apr 16, 2024 •

edited

Loading

javrasya Apr 17, 2024 •

edited

Loading

mas-chen commented Apr 18, 2024 •

edited

Loading