-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent row count across versions #1132
Comments
Nothing too exotic in the
|
Hi @dev-goyal thank you for raising this issue, that looks like a critical issue we want to resolve asap. Based on what you mentioned here:
Is it correct to assume that this is a table with position deletes? We've observed a correctness issue in reading tables with position deletes before that was since resolved, but it could be possible that there's more edge cases that needs to be resolved here. I think the best approach here will be to try to understand more about the characteristic of this specific table that is resulting in issues, and replicating the issue in a control test sample |
No position deletes, it has equality deletes! Exact delete query:
|
Hi @dev-goyal - thank you for sharing the query! I don't think that query means that the table is using equality deletes. I'm not an expert on the delete files, but according to this mail list thread, the community seems to be under the impression that Equality Delete markers are only produced by Flink: https://lists.apache.org/thread/3s36xkmgj01996mkmg0lqxw2k5lhlxf4 If you are deleting records using Spark SQL, and if you are using Merge-On-Read mode for deletion on your table property ( |
Thanks @sungwy , that makes sense to me - I am indeed using MOR (version 2), so this makes sense to me! Let me know how else I might be able to help. |
I'm on @dev-goyal's team. We're reverting to 0.6.1 in the interim as it doesn't seem to suffer from this bug. It's difficult to construct a minimal reproducible example, but let us know if there are any tests we can run on our side that would be useful for you. A little more info: We run a daily etl which adds rows to our Iceberg table, and then we have a separate job where pyiceberg consumes some of that data. On some days, pyiceberg gets the count right, while periodically (~2x per week) it misses many rows as Dev showed in his original message. (On all days, Athena gets the count correct though.) We've not been able to determine what makes the bad days special, nor has the diff of the data between days shown any obvious patterns in the rows that are missing/not missing. |
@daturkel and @dev-goyal - I really appreciate you both being so patient regarding this issue, and keeping this open line of communication as we debug this issue together.
Yes, that's unfortunately what I'm having issues with as well. I'm trying to create some tests to replicate this issue with little luck - here's a draft PR where I'm investigating the issue for reference: #1141
Yes, that would be incredibly helpful. Here are some experiments I think will help us localize the problem in no particular order...
If this specific table scan only has 1 task / file associated with it (i.e. If you feel it would be helpful to debug this issue together, feel free to send me a message on the Iceberg Slack channel! It may be helpful to take a look at your specific table together. |
Good to know that Athena gets the correct count. From that, we can assume that the table state is correct, i.e. the catalog state, the metadata, and data files. This means the culprit is likely when Pyiceberg reads the underlying table. I also noticed the "wrong count" (v0.7.1) is smaller than the "right count" (v0.6.1). |
Hi @dev-goyal and @daturkel are you still seeing the same issues? |
Hi @sungwy and @kevinjqliu , we are still having this issue (we've reverted to v0.6.1 to avoid it) but have been a bit busy leading up to a launch so haven't been able to troubleshoot as much. In a week or so when things cool down we would love to troubleshoot some more. To answer one question though, the wrong count was definitely always smaller than the right count! |
Hi @daturkel and @dev-goyal I was finally able to find the root cause and put up a fix for this issue on this PR: #1141. Would you be able to install from the branch and confirm if it fixes your issue? Sung |
Hi @sungwy absolutely - give me a couple days please, but I will prioritize testing this ASAP. Thank you so much for prioritizing the fix, we much appreciate it! |
Apache Iceberg version
0.7.1 (latest release)
Please describe the bug 🐞
Noticing some fairly weird behaviour with pyiceberg - with the same exact code being run across different versions of the API, we're seeing different counts returned. Have tried this with athena, and can confirm that the 0.6.1 count is the correct one. Any ideas on where to look when debugging this?
Can confirm that the .plan_files() and delete_files is identical across the two versions.
The text was updated successfully, but these errors were encountered: