Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: View metadata implementation #6559

Closed

Conversation

amogh-jahagirdar
Copy link
Contributor

Co-authored-by: John Zhuge [email protected]

Discussed offline with @jzhuge, builds on the core parser changes already done on #4657 but based off the latest API changes.

private final List<ViewHistoryEntry> versionLog;
private final List<Schema> schemas;
private final Map<Integer, Schema> schemasById;
private final String metadataFileLocation;
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't store metadataFileLocation in the metadata file, I'll remove this

Comment on lines 31 to 53
@RunWith(Parameterized.class)
public class TestViewMetadataParser extends ParserTestBase<ViewMetadata> {
private static ViewVersion version1 =
BaseViewVersion.builder()
.versionId(1)
.timestampMillis(4353L)
.addRepresentation(
BaseSQLViewRepresentation.builder().query("select 'foo' foo").schemaId(1).build())
.build();

private static ViewHistoryEntry historyEntry1 = BaseViewHistoryEntry.of(4353L, 1);

private static final Schema TEST_SCHEMA =
new Schema(
1,
Types.NestedField.required(1, "x", Types.LongType.get()),
Types.NestedField.required(2, "y", Types.LongType.get(), "comment"),
Types.NestedField.required(3, "z", Types.LongType.get()));

private static ViewVersion version2 =
BaseViewVersion.builder()
.versionId(2)
.timestampMillis(5555L)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more tests for view metadata parsing. One thing I did just a bit differently than the original PR is move the JSON to a separate file in resources

Comment on lines 141 to 171
if (node.has(SCHEMAS)) {
JsonNode schemaArray = node.get(SCHEMAS);
Schema currentSchema = null;
Preconditions.checkArgument(
schemaArray.isArray(), "Cannot parse schemas from non-array: %s", schemaArray);
// current schema ID is required when the schema array is present
currentSchemaId = JsonUtil.getInt(CURRENT_SCHEMA_ID, node);
// parse the schema array
ImmutableList.Builder<Schema> builder = ImmutableList.builder();
for (JsonNode schemaNode : schemaArray) {
Schema schema = SchemaParser.fromJson(schemaNode);
if (schema.schemaId() == currentSchemaId) {
currentSchema = schema;
}
builder.add(schema);
}
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doubt I still have is I think schemas and current schema should always be set, and thus the spec should be reflected so that it's required. Maybe there's a case that we don't want schema for the entire view and it's up to individual representations. In that case in my mind for any SQL representation in the view, the schema-id should be required (which then means there must be a schema list considering we only support SQL representations). Right now it's marked as optional in the spec

@@ -41,8 +40,8 @@ default Type type() {
/** The default namespace when the view is created. */
Namespace defaultNamespace();

/** The query output schema at version create time, without aliases. */
Schema schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change it to schema ID? If we know the schema ID, we should also be able to retrieve the Schema object

@jackye1995
Copy link
Contributor

I am trying see if we can break this further so that it's easier to review for a broader audience. There are some implementations of the basic interface, such as BaseViewVersion, BaseSQLViewRepresentation, BaseViewHistoryEntry, can we have a PR to first add these objects and their corresponding parsers in 3 small PRs? This would help the reviewers, also help to add tests for each feature we implement. Otherwise there is currently just 1 test for the entire view metadata.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jan 11, 2023

Agreed @jackye1995 we can break this down further for easier review. I'll raise the version, representation and history entry PRs separately and this one can be focused on the view metadata as a whole.

@nastra nastra self-requested a review January 13, 2023 14:15
nastra
nastra previously requested changes Jan 13, 2023
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar thanks for working on this. I've suggested a bunch of improvements that I think will make the code much shorter to read and easier to maintain in the long run. Also I think we need to add a few more tests around nullability in the Parsers

@amogh-jahagirdar
Copy link
Contributor Author

Thanks @nastra I'll be taking these suggestions in all the split PRs I'm raising. Agreed, more tests on nullability/missing fields are needed, and now that we use Immutable dependency in the metrics implementation, we have a good precedent to use it here as well which will simplify a lot of the boilerplate code

if (o == null || getClass() != o.getClass()) {
return false;
}
BaseSQLViewRepresentation that = (BaseSQLViewRepresentation) o;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: this method needs to have line spacing applied.


@Override
public String toString() {
return "BaseViewDefinition{"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a helper rather than building this by hand? It's hard to read this way. We usually base these on MoreObjects.

internalWrite(metadata, outputFile, false);
}

public static void toJson(ViewMetadata metadata, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check for null on metadata + add a test

}

public static ViewMetadata fromJson(JsonNode node) {
Preconditions.checkArgument(node != null, "Cannot parse view metadata from null json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a test for this

static final int DEFAULT_VIEW_FORMAT_VERSION = 1;
static final int SUPPORTED_VIEW_FORMAT_VERSION = 1;

public static ImmutableViewMetadata.Builder buildFrom(ViewMetadata metadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere? Also we can probably just use ImmutableViewMetadata.builder().from(metadata).xyz().build() here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is the goal here to just create a copy? In that case we can also just do ImmutableViewMetadata.copyOf(metadata)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is trying to follow TableMetadata.buildFrom()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is ImmutableViewMetadata defined? I couldn't find it in this PR or existing code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu ImmutableViewMetadata is a generated class from the @Value.Immutable annotation. See also https://immutables.github.io/ for some additional details

return schemasById().get(currentSchemaId());
}

private static Map<Integer, ViewVersion> indexVersions(List<ViewVersion> versions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think indexVersions and indexSchemas don't need to be static

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't static preferred? If someone makes a change in the future, I'd rather they be static since that is more strict. I could see changing from private to a more open visibility, in which case we would prefer static.

}
}

private ViewMetadataParser() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should be at the top

import org.junit.Test;
import org.junit.runners.Parameterized;

public class TestViewMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called TestViewMetadataParser instead?

import org.apache.iceberg.types.Types;
import org.assertj.core.api.Assertions;
import org.junit.Assert;
import org.junit.Test;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth changing this new test to using JUnit5

.currentVersionId(2)
.formatVersion(1)
.build();
assertSameViewMetadata(expectedViewMetadata, ViewMetadataParser.fromJson(json));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assertSame usually suggests that there's some sort of identity comparison between two objects.

generator.writeEndObject();
}

static String toJson(ViewMetadata viewMetadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe move this before the other toJson(..) method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jack makes a good point. A few of these should be reviewed. In addition to this, which reasonably should be public, I think that toJson(ViewMetadata, JsonGenerator) should not be public until we need it in another package.

ViewProperties.VERSION_HISTORY_SIZE_DEFAULT);

Preconditions.checkArgument(
numVersionsToKeep >= 1, "%s must be positive", ViewProperties.VERSION_HISTORY_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add details of the received value of numVersionsToKeep in error message

.build();
}

static ViewMetadata fromJson(String json) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not public?

@@ -0,0 +1,80 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like puffin files are in path org/apache/iceberg/puffin/v1, we should follow the same pattern

import org.immutables.value.Value;

@Value.Immutable
public abstract class ViewMetadata implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other view objects are interfaces. Why does this use an abstract class? Moving between interfaces and classes is a binary incompatible change, so I think that this should be an interface instead. Is that not possible for some reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess why this is an abstract class rather than an Interface is because of indexVersions() and indexSchemas() can't be made private if we use an Interface here, meaning that they would become part of the API, which I don't think we want

Preconditions.checkArgument(
formatVersion <= ViewMetadata.SUPPORTED_VIEW_FORMAT_VERSION,
"Cannot read unsupported version %s",
formatVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: what is the value of adding this here rather than in check()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should go into check(), because otherwise you could create an invalid ViewMetadata object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, @nastra fixed this in https://github.com/apache/iceberg/pull/7759/files#diff-037c118271eb576698f0829d876d5c49919eea4bd20e6b2f98b6072cfd0d4b08R133-R135 where all the validation is done in the special check method he referred to.

versions.add(version);
}

Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a lot of these checks should be in the ViewMetadata class rather than here. Otherwise, you can construct invalid ViewMetadata instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, @nastra fixed this in https://github.com/apache/iceberg/pull/7759/files#diff-037c118271eb576698f0829d876d5c49919eea4bd20e6b2f98b6072cfd0d4b08R133-R135 where all the validation is done in the special check method he referred to.

Preconditions.checkArgument(
numVersionsToKeep >= 1, "%s must be positive", ViewProperties.VERSION_HISTORY_SIZE);

if (versions.size() > numVersionsToKeep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this here? I don't think it belongs in the parser.

Copy link
Contributor

@nastra nastra Jun 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this shouldn't be here. The better option would be to put this into a special version of a @Check method that allows normalization. This would allow to retain the number of versions that are desired in ViewMetadata directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, @nastra fixed this in https://github.com/apache/iceberg/pull/7759/files#diff-037c118271eb576698f0829d876d5c49919eea4bd20e6b2f98b6072cfd0d4b08R133-R135 where all the validation is done in the special check method he referred to.

static final int DEFAULT_VIEW_FORMAT_VERSION = 1;
static final int SUPPORTED_VIEW_FORMAT_VERSION = 1;

public static ImmutableViewMetadata.Builder buildFrom(ViewMetadata metadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is ImmutableViewMetadata defined? I couldn't find it in this PR or existing code

}

private static Map<Integer, Schema> indexSchemas(List<Schema> schemas) {
if (schemas == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could schemas be null? the spec says it is optional. but it is also says A list of schemas, the same as the ‘schemas’ field from Iceberg table spec.

It is optional to support both v1 and v2 tables?

@amogh-jahagirdar
Copy link
Contributor Author

@nastra is carrying this forward in #7759, which addresses the pending comments I'll be closing this one just to avoid confusion
Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants