Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: switch to use SortKey for data statistics #9212

Merged
merged 6 commits into from
Dec 8, 2023

Conversation

stevenzwu
Copy link
Contributor

No description provided.

@github-actions github-actions bot added the flink label Dec 4, 2023
@stevenzwu stevenzwu requested a review from pvary December 5, 2023 17:33
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class TestDataStatisticsOperator {
private final Schema schema =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it worth to have an e2e test where we have a wider record and a narrower SortKey?

@stevenzwu stevenzwu force-pushed the flink-sort-key branch 2 times, most recently from fa153d6 to e594f9e Compare December 7, 2023 01:01
operator.processElement(new StreamRecord<>(GenericRowData.of(StringData.fromString("a"))));
operator.processElement(new StreamRecord<>(GenericRowData.of(StringData.fromString("b"))));
operator.processElement(new StreamRecord<>(genericRowDataA));
operator.processElement(new StreamRecord<>(genericRowDataA));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we make something like this here:

genericRowDataA_2 = GenericRowData.of(StringData.fromString("a"), 2);
operator.processElement(new StreamRecord<>(genericRowDataA_2));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean not reusing the object? if yes, I would just get rid of the pre-constructed objects and always construct them on the fly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - my midnight review was not clear enough. 😢

I was trying to suggest to process a different RowData object, with the same key, like:

private final GenericRowData genericRowDataA_1 = GenericRowData.of(StringData.fromString("a"), 1);
private final GenericRowData genericRowDataA_2 = GenericRowData.of(StringData.fromString("a"), 2);
private final GenericRowData genericRowDataB = GenericRowData.of(StringData.fromString("b"), 3);
[..]
operator.processElement(new StreamRecord<>(genericRowDataA_1));
operator.processElement(new StreamRecord<>(genericRowDataA_2));
operator.processElement(new StreamRecord<>(genericRowDataB));

I know that we have individual test for the correct grouping, but I consider this as an e2e tests, and it would be nice to test this out as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. for readability,I would just get rid of the pre-constructed objects and always construct them on the fly then

Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, but this is really nice work.

Thanks, Steven!

@stevenzwu stevenzwu merged commit beb41b6 into apache:main Dec 8, 2023
13 checks passed
@stevenzwu
Copy link
Contributor Author

thanks @pvary for the review

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Dec 8, 2023
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Dec 8, 2023
stevenzwu added a commit that referenced this pull request Dec 9, 2023
stevenzwu added a commit that referenced this pull request Dec 9, 2023
@Override
public void writeSnapshot(DataOutputView out) throws IOException {
Preconditions.checkState(schema != null, "Invalid schema: null");
Preconditions.checkState(sortOrder != null, "Invalid sort order: null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using Preconditions.checkNotNull to check if schema and sortOrder are null or not
Same for line 329, 330

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer to avoid chechNotNull, as the result is a NullPointerException. Generally NullPointerExceptions are something very unexpected, while IllegalArgumentExceptions are a different category for me (configuration errors, and such)

That is why, I did not flag this at the review, but - taking a second look - this could be strange for someone less biased

@stevenzwu stevenzwu deleted the flink-sort-key branch December 16, 2023 21:33
lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants