-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to assign splits due to the serialized split size #9410
Comments
@stevenzwu @pvary could you guys take a look please? |
@stevenzwu: After a quick check, I have found this:
This means that anything which is above 64k could not be serialized by We could use
The upside that it should work regardless of the size of the string. |
It seems this bug has been introduced by version 1.4.0 which is kind of new. Tried fixing it by tweaking the SplitAssignerFactory I pass down to the IcebergSource but even though I reduce the size of FileScanTasks per split to be one, it still exceeds that 65K limit. So I ended up downgrading to 1.3 unfortunately until it is fixed with 1.4. My app works with version 1.3 Iceberg. |
@javrasya: Any idea what causes the big size? Wide table? Column stats? Long filename? |
@pvary No idea tbh since I run into this on production and there I don't have the ability to go deep and debug. I wouldn't say it is a wide table in terms of number of columns and I don't know about column stats. Is there a way to fetch that from somewhere? But the metadata about the file I know at least is like; One of the files: File name: Schema: [
{
"Name": "__key",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "1",
"iceberg.field.optional": "true"
}
},
{
"Name": "eventid",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "2",
"iceberg.field.optional": "false"
}
},
{
"Name": "someIdColumn1",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "3",
"iceberg.field.optional": "false"
}
},
{
"Name": "someIdColumn2",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "4",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn3",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "5",
"iceberg.field.optional": "false"
}
},
{
"Name": "someIdColumn4",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "6",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn5",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "7",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn6",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "8",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn7",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "9",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn8",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "10",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn9",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "11",
"iceberg.field.optional": "true"
}
},
{
"Name": "someIdColumn10",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "12",
"iceberg.field.optional": "true"
}
},
{
"Name": "country",
"Type": "string",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "13",
"iceberg.field.optional": "false"
}
},
{
"Name": "createdat",
"Type": "timestamp",
"Parameters": {
"iceberg.field.current": "true",
"iceberg.field.id": "14",
"iceberg.field.optional": "false"
}
}
] |
@javrasya: This is table should not be too wide, and the statistics should be limited as well (unless you did some specific tweaking there). My best guess is your first suggestion:
How did you try to archive this? |
@pvary I wasn't aware of The way how I did it is that I introduced my own SplitAssignerFactory and SplitAssigner and pass that down to the source. Here is the code to that custom Splitassigner; https://gist.github.com/javrasya/98cfe90bd1a2585c56c4c3346a518477 But the thing is that even though I manage to reduce the number of task per Split, it was still big enough to raise the same error. So it did not solve my problem. How do you limit the statistics, I didn't do anything apart from creating my table and ingesting the data into it. Also do you think a table with 14 columns is too wide? |
@javrasya: Table with 14 columns should not cause any issues. The default stats also could not cause issues. I made a mistake reading the code, and combined splits also could not cause any issues, as we serialize them one-by-one in loop. And we have an issue with one of them. My current theory is that we need to check
Could it be, that you have multiple deletes for the specific split which makes the serialized split too big? iceberg/core/src/main/java/org/apache/iceberg/FileScanTaskParser.java Lines 69 to 75 in 2101ac2
|
We don't do any deletes actually. I will try to debug it locally somehow on that single file it was failing on to see why it is big. But regardless, what do you think would be a remedy in such case? Even though it was delete, should we not do deletes because that breaks the downstream? |
@javrasya: I think we should fix the serialization issue, but I would like to understand the root cause before jumping to solutions. The deletes seems to be an issue anyway - which should be fixed, but if you do not have any deletes in the table, then there should be another source of the issues too. We have to identify it and fix that too. |
Hi @pvary , I failed to debug that locally, Couldn't reproduce it since it is one file of many and it takes time to hit that and can't really do that in a debug session. We don't do deletes but recently I have done the followings;
Do you think any of these operations would leave deleted files behind? |
@javrasya: I do not see yet, how any of the above changes could create delete files. For debugging: Could you use conditional breakpoints, or you could put a breakpoint where the |
I couldn't do this @pvary , the split is far ahead and some time is needed to get there in the application. My local environment is not able to run the app on real data and hit this problematic split on debug mode. It crashes because debug mode is quite a resource consuming mode. Still looking for a way to reproduce it and debu it but Just wanted to let you know the status. (Side note, I ended up re-creating my entire source data, that way it did not have this problem. But I want to find the reason why it exceeds that limit regardless.) |
Hi again @pvary. I managed to run it in debug mode and the JSON which is being deserialized is crammed with delete files ( Another question; is it possible to skip it with some sort of configuration that is available right now so that it does not keep deleted files without waiting for you guys to fix it in a way? |
@javrasya: Few things:
One way to get rid of the deletes is compaction. Currently I think only Spark does a full compaction where the deletes are removed as well. |
Thanks for your input. I tried rewrite_position_delete_files but no luck. |
first of all, current Flink Iceberg source (FLIP-27 or old) doesn't support streaming read with row-level deletes. It only read append-only snapshots/commits.
No, that is not possible today and it won't be correct either. Regarding the usage of |
Thanks for the answer @stevenzwu, no you are right. I know we shouldn't should do streaming on an Iceberg table unless it is an append only table. But we don't do stream in this case. Every day we need to scan the table entirely until a certain snapshot id ( we are using asOfSnapshotId not endSnapsgotId) to produce what we need. No checkpointing, no save point either. |
It still feels weird to allow that big of a split to be created. Wouldn't it possible to make the deleted files lazy and rather be loaded in the respective task node, instead of the coordinator node. It is network cost in the cluster. It will be bigger and bigger for those tables which are upserted kind which is going to have so many EQUALITY_DELETES, or am I interpreting it wrongly 🤔 ? Deleted files in the split seems to be the only one which can bloat the size of the split. |
ah. I didn't know it is a batch read mode using the problem is that a equality file can be associated with many data files. that is probably why you are seeing many of them in one split. that is unfortunate implication of equality deletes. skipping those delete files won't be correct. delete compaction that was suggested earlier should help. Did you use Spark for that? Spark batch should generate position deletes, which are easier for the read path? Regardless, I would agree with @pvary 's suggestion of writeBytes to fix the 64 KB size limit. curious how many delete files you saw in one split/data file? |
I agree, not having that limit would in anyway be better. I see. Yes I did use Spark (rewrite_position_delete_files) to clean up positional deletes. Maybe it helped because I have very few positional deletes afterward. But I still have too many equality deletes which is still causing the limit excess. Did I use the wrong spark procedure? I also expires many snapshots which I hoped would help with compactio but still no luck. |
I had the changes and a test which is making use of some mocking/spying working in my local. It is actually not that trivial by using |
…too many delete files which are created due to an upsert oepration on the table and then it is not possible to consume from such a table
I took the liberty and created the PR since I had the changes locally. Hope you guys don't mind 🙏 |
I see. Spark delete compaction only handles position deletes (not equality deletes). that behavior makes sense, because Spark only writes position deletes. I see there is an |
I think RewriteDataFilesAction could help you there. If I remember correctly it will apply all the delete files on reading, so in theory, after a compaction, you should not need the delete files anymore. That said, I haven't tried it myself, so this should be checked. |
Thank you both. You are right @stevenzwu , it is sad that there is no implementation yet for |
…f serializing/deserializing and it is the most backward compatible, but support longer texts than 65K. This introduce a breaking change which will make splits serialized before this change to be not deserialized because it uses int(4 bytes) instead of unsigned short (2 bytes) as the first bytes to indicate the length of the serialized text Ahmet
Tried rewrite_data_files via Spark, not really sure if it would do the same with Any other suggestion, is appreciated. |
…n can be due to the size of an unsigned short which is 65kb. The text is broken down to chunks that can always fit in 65kb and written to the buffer iteratively
@javrasya: The Spark RewriteDataFiles should create a new snapshot in the table. If the query reads this new snapshot, then it should not read the old delete files anymore. If ExpireSnapshot is used then sooner or later these data files will be removed, as nobody will reference them anymore |
@pvary , You are right. It is all immutable so it makes sense that rewrite operation would create another snapshot and I should be using that not a prior one to that. After I refer to the Thank you both, it is so appreciated. @pvary @stevenzwu |
…s V3 to allow smooth migration
…tion in the Flink code Ahmet
…duce code duplication
…too many delete files which are created due to an upsert oepration on the table and then it is not possible to consume from such a table
…s V3 to allow smooth migration
…tion in the Flink code Ahmet
…duce code duplication
Apache Iceberg version
1.4.2 (latest release)
Query engine
Flink
Please describe the bug 🐞
Hi there, I am trying to consume records from an Iceberg table in my Flink application and I am running into the following issue;
Not really sure why it gets too big but when I looked at the source code here, it might be because there is too many file scan task in one split and that is why this is happening.
The text was updated successfully, but these errors were encountered: