-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableMetadataBuilder #587
base: main
Are you sure you want to change the base?
TableMetadataBuilder #587
Conversation
6afeeb0
to
4193131
Compare
ToDos:
|
Fixes #232 |
Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one |
Thanks for your Feedback @liurenjie1024. This isn't really a refactoring of the builder, it's more a complete rewrite. The old builder allowed to create corrupt metadata in various ways. Splitting it up by I would currently prefer to keep it as a larger block mainly because:
We now have a vision of what it could look like in the end. Before putting any more effort in, we should answer the following questions:
Those points might change the overall design quite a bit and might require a re-write of After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in |
@liurenjie1024 I tried to cut a few things out - but not along the lines of
After they are all merged, I'll rebase this PR for the actual builder. |
Hi, @c-thiel Sorry for late reply.
I've went through the new builder and I think this is your design is the right direction.
To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.
I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445 And I think the reason behavior is the |
I agree that this should be required, as I mentioned in #550 |
That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes? |
@liurenjie1024 thanks for the Feedback!
The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions. |
I don't think we should add the argument to be honest. My reasoning is as follows: Maybe add @nastra or @Fokko could add some comments on the intention of that parameter? |
I have reviewed most PRs that I am confident can be merged. The only one left is #615, for which I need more input. |
a3c1c89
to
fea1817
Compare
This PR is now ready for first reviews.
Some Remarks:
add_sort_order
andadd_partition_spec
the Java code re-builds the added sort-order against the current schema by matching column names. This implementation currently does not do this. Adding this feature would requirePartitionSpec
(bound) to store the schema it was bound against (probably a good idea anyway) and splitSortOrder
in bound and unbound, where the boundSortOrder
also stores the schema it was bound against. Instead, this implementation assumes that provided sort-orders and partition-specs are valid for the current schema. Compatibility with the current schema is tested.add_schema
method does not require anew_last_column_id
argument. In java there is a todo to achieve the same. I put my reasoning in a comment in the code, feel free to comment on it.new()
behaviour that now re-assignes field-ids to start from 0. Some tests started from 1 before. Re-assigning the ids, just like in Java, ensures that fresh metadata always has fresh and correct ids even if they are created manually or re-used from another metadata.