-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: View metadata implementation #7759
Conversation
@nastra Is it possible to abstract a common interface for table and view because they have very similar methods? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great to me just some minor remaining comments, and should be good to merge. Thanks a ton for carrying this forward @nastra! I'll reply on the original PR and we can probably close that one.
core/src/test/resources/org.apache.iceberg.view/ViewMetadataInvalidCurrentSchema.json
Outdated
Show resolved
Hide resolved
@nastra Thanks for your response, it indeed makes the code more complex. |
return versionsById().get(versionId); | ||
} | ||
|
||
@Value.Lazy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be Derived
rather than Lazy
. Lazy makes it more complicated and there's no real benefit to not just calculating this at construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally I agree with you that @Value.Derived
is the better option.
However, given that we're deriving from a collection, it makes it difficult/impossible to do validation on the collection before eager validation on the derived field kicks in.
See also my comment in #7759 (comment) about default collection behavior and why it's better to have @Value.Lazy
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, does it make sense to make this Lazy
at all? They're not lazy or derived in TableMetadata
:
public Schema schema() {
return schemasById.get(currentSchemaId);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok removing @Value.Lazy
here, since the only thing it buys us is that it doesn't re-compute the value from the underlying hash map, which is O(1) anyway
versions().subList(versions().size() - versionHistorySizeToKeep(), versions().size()); | ||
List<ViewHistoryEntry> history = | ||
history().subList(history().size() - versionHistorySizeToKeep(), history().size()); | ||
return ImmutableViewMetadata.builder().from(this).versions(versions).history(history).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing. The builder should enforce the maximum number of versions, not the check method. It's awkward that this returns a copy of the view metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be the "official" way of normalizing an object: http://immutables.github.io/immutable.html#normalization
Not sure it's worth adding our own builder class for this particular case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange to me and I'm not a big fan of having a checkAndNormalize
in the interface, but okay I guess? It doesn't seem concerning enough to block using Immutables
but it is annoying that the pattern requires exposing additional methods that don't have a clear contract (like validate
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here completely.
It would be great if there could be a better & shorter alternative to achieve this.
The (longer) alternative would be to have our own builder that internally uses ImmutableViewMetadata.builder()
and does the "normalization" but that doesn't prevent anyone from using ImmutableViewMetadata.builder()
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but if we're using our own builder then what's the point of immutables?
For now, I think we should just get this in to unblock the next PR. We should follow up with the API that adds a new view representation and version. If that goes smoothly then it will be fine. If it doesn't fit then we can remove immutables and go with a direct implementation.
assertThatThrownBy(() -> ImmutableViewMetadata.builder().build()) | ||
.isInstanceOf(IllegalStateException.class) | ||
.hasMessage( | ||
"Cannot build ViewMetadata, some of required attributes are not set [formatVersion, location, currentSchemaId, currentVersionId]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't schemas and versions required to be non-null and non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections are empty by default if not set and typically they are being checked in a @Value.Check
method. See also immutables/immutables#1429 (comment) for some historical reasons why Collections are being empty by default if not set.
The other alternative would be to use the Java Bean Validation API, which would allow expressing validations on Collections via annotations, but I don't think we'd want to do that here.
That's the main reason why I'd like to keep currentVersion()
and schema()
being @Value.Lazy
so that we don't have misleading error messaging.
What I mean here is that for example, if currentVersionId()
is set but versions()
is empty, we would see java.lang.NullPointerException: currentVersion
(because currentVersion
cannot be derived from versions()
and thus ends up being null) rather than java.lang.IllegalArgumentException: Cannot find current version 23 in view versions: [1]
core/src/test/java/org/apache/iceberg/view/TestViewMetadata.java
Outdated
Show resolved
Hide resolved
69d38ba
to
29dd333
Compare
properties(), | ||
ViewProperties.VERSION_HISTORY_SIZE, | ||
ViewProperties.VERSION_HISTORY_SIZE_DEFAULT); | ||
Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this validation fails, won't it cause the view to be broken and unfixable? We won't be able to construct the ViewMetadata
so there would be no way to fix this if another implementation sets it to -1
or something for unlimited history. We would either need to document this requirement in the spec or fail more gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is true, one wouldn't be able to construct the ViewMetadata
object`, because it's technically failing validation.
For keeping unlimited history I think we have the following options:
- allow
-1
to indicateunlimited
(not sure if we have any other properties that carry such semantics). - let users use a large value with
Integer.MAX_VALUE
being the maximum. This seems to be more appropriate IMO, since the view spec mentionsThe number of versions to retain is controlled by the table property: version.history.num-entries.
. However, I can also see the argument to "which value is large enough?".
Whatever we think carries the right semantics to indicate unlimited
history should go into the validation logic + the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather not fail construction because of this. If the value is invalid, then let's just ignore it, warn, and not change the versions list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable, I've updated this to issue a WARN and not modify the history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks about ready, although there are some minor comments.
Co-authored-by: John Zhuge <[email protected]> Co-authored-by: Eduard Tudenhoefner <[email protected]>
Thanks, @nastra! And thanks to @amogh-jahagirdar and @jzhuge for major parts of this as well! |
This is a continuation PR of #6559 and I've addressed all remaining comments.
@stevenzwu @jackye1995 @rdblue since you guys were reviewing the original PR, I've added you here as well. Please take a look when you can.