-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Only validate the current partition specs #5707
Core: Only validate the current partition specs #5707
Conversation
if (unboundSpec.specId() == defaultSpecId) { | ||
builder.add(unboundSpec.bind(schema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] as per my understanding, I think this might not work with v1 partition spec considering we still have void
transform for dropped partitions, which would be holding reference to dropped source field.
Should we update PartitionSpec#checkCompatibility
to handle void transforms. Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should we add a unit test to TableMetadataParserTest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singhpk234 Can you elaborate a bit more? I'm not sure that I understand the issue.
For V1, using the spec
instead of specs
, nothing changes. Since it takes the other branch:
iceberg/core/src/main/java/org/apache/iceberg/TableMetadataParser.java
Lines 393 to 402 in 24d5a53
Preconditions.checkArgument( | |
formatVersion == 1, "%s must exist in format v%s", PARTITION_SPECS, formatVersion); | |
// partition spec is required for older readers, but is always set to the default if the spec | |
// array is set. it is only used to default the spec map is missing, indicating that the | |
// table metadata was written by an older writer. | |
defaultSpecId = TableMetadata.INITIAL_SPEC_ID; | |
specs = | |
ImmutableList.of( | |
PartitionSpecParser.fromJsonFields( | |
schema, TableMetadata.INITIAL_SPEC_ID, JsonUtil.get(PARTITION_SPEC, node))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for not being clear,
What I meant here was, mapping the current schema to the current spec might still cause issue as in case of tables of V1 format, (Considering the UT in this PR, last DDL) The current spec will have a void
transform for day_of_ts
since day_of_ts
is dropped from partitioning (in V1 format, we replace this transform with a void
transform). Now when we attempt to bind current partition spec with current schema, current schema will not have day_of_ts
, but in PartitionSpec#checkCompatibility
we will try to find schema.findType(field.sourceId())
, from current schema for void
transform of day_of_ts
and it will fail, as we have removed day_of_ts
from current schema.
A sample UT for repro (modified from this PR) :
@Test
public void testDropColumnOfOldPartitionField() {
// default table created in v1 format
sql(
"CREATE TABLE %s (id bigint NOT NULL, ts timestamp, day_of_ts date) USING iceberg PARTITIONED BY (day_of_ts)",
tableName);
sql("ALTER TABLE %s REPLACE PARTITION FIELD day_of_ts WITH days(ts)", tableName);
sql("ALTER TABLE %s DROP COLUMN day_of_ts", tableName);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singhpk234 Thanks for the thorough explanation. I wasn't aware of the replacement by a void transform, and it indeed has a similar issue:
Thanks for raising this, and let me think of a solution. Do you know the historical reason to replace it with a void transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko , This explanation from @rdblue, nicely explains the motivation behind introducing void transforms in v1 format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linking @singhpk234, and that is indeed a very clear explanation and it makes a lot of sense.
I've added a condition to skip the validation when we encounter a VoidTransform
. Since it is almost always compatible, I think we should be okay with that, but curious to hear what others think.
24d5a53
to
4f7db1f
Compare
If a fields is being deleted that used to be part of a partition spec, that will throw an error because it cannot resolve the field anymore. Closes apache#5676 Closes apache#5707 Closes apache#5399
4f7db1f
to
2844525
Compare
public void testDropColumnOfOldPartitionFieldV1() { | ||
// default table created in v1 format | ||
sql( | ||
"CREATE TABLE %s (id bigint NOT NULL, ts timestamp, day_of_ts date) USING iceberg PARTITIONED BY (day_of_ts)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we explicitly set format-version=1, in case the default changes in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, added! 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, it doesn't look like this change was pushed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on a follow-up to fix the other issue in the read-path. I waited with pushing until I had a proper fix for this but turns out that it is a bit more complicated than I originally anticipated. I've created a new PR #5907
|
||
sql("ALTER TABLE %s REPLACE PARTITION FIELD day_of_ts WITH days(ts)", tableName); | ||
|
||
sql("ALTER TABLE %s DROP COLUMN day_of_ts", tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also include a sql to read back the data from the table after the drop? We have noticed the same issue in Trino, but for us the drop column still succeeds, it's the subsequent read operations that start failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marton-bod, the ALTER TABLE
threw an exception before, but you are right that subsequent reads are also failing. Digging into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, Interesting. This is what I assumed would happen and went with spec change (disruptive change)
@marton-bod: I have tagged you in one of the slack discussions. Where I proposed spec change to handle this.
If Fokko doesn't find an easy way to fix this, we can discuss my approach after his experiment.
@@ -421,6 +421,31 @@ public void testSparkTableAddDropPartitions() throws Exception { | |||
"spark table partition should be empty", 0, sparkTable().partitioning().length); | |||
} | |||
|
|||
@Test | |||
public void testDropColumnOfOldPartitionFieldV1() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right place to test this. There are partition spec evolution tests in core, which is a much better place than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this looks good as a first step. We should also add a test or two to exercise the read path and see if anything needs to be fixed there.
Thanks for fixing this, @Fokko! I like this version of the fix. |
If a fields is being deleted that used to be part of a partition spec, that will throw an error because it cannot resolve the field anymore. Closes apache#5676 Closes apache#5707 Closes apache#5399
after this change while you can drop partition columns such tables become broken if they hadn't been optimized before dropping the column. Basically see #4563 |
If a field is being deleted that used to be part of a partition spec, that will throw an error because it cannot resolve the field anymore. See the issue for more details.
Closes #5676