-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink: Emit watermarks from the IcebergSource #8553
Conversation
...a/org/apache/iceberg/flink/source/eventtimeextractor/IcebergTimestampEventTimeExtractor.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailoverEventTimeExtractor.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
Minor typo about title
|
Fixed, thanks! |
cc: @sundargates - you might be interested in this |
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/reader/RecordEmmitters.java
Outdated
Show resolved
Hide resolved
Here is the design doc that discussed the proposal from this PR compared to a previously approach: https://docs.google.com/document/d/1JCF77SkeB8Z4h5pXRCRcMh4aiAVuBzHM5IwohLwNCto. We are going with this approach due to simpler implementation and reuse of most of what Flink framework provided in terms of watermark alignment. |
...java/org/apache/iceberg/flink/source/eventtimeextractor/EventTimeExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/iceberg/flink/source/eventtimeextractor/IcebergEventTimeExtractor.java
Outdated
Show resolved
Hide resolved
...a/org/apache/iceberg/flink/source/eventtimeextractor/IcebergTimestampEventTimeExtractor.java
Outdated
Show resolved
Hide resolved
...k/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/reader/RecordEmitterFactory.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java
Outdated
Show resolved
Hide resolved
...java/org/apache/iceberg/flink/source/eventtimeextractor/EventTimeExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/reader/RecordEmitters.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailoverEventTimeExtractor.java
Outdated
Show resolved
Hide resolved
...java/org/apache/iceberg/flink/source/eventtimeextractor/EventTimeExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
...java/org/apache/iceberg/flink/source/eventtimeextractor/EventTimeExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/iceberg/flink/source/reader/IcebergTimestampWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...17/flink/src/main/java/org/apache/iceberg/flink/source/reader/SerializableRecordEmitter.java
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java
Show resolved
Hide resolved
...17/flink/src/main/java/org/apache/iceberg/flink/source/reader/IcebergWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...java/org/apache/iceberg/flink/source/eventtimeextractor/EventTimeExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/iceberg/flink/source/reader/IcebergTimestampWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...17/flink/src/main/java/org/apache/iceberg/flink/source/reader/SerializableRecordEmitter.java
Show resolved
Hide resolved
...st/java/org/apache/iceberg/flink/source/TestIcebergSourceFailoverWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
Created #8803 to enable the possibility to avoid keeping all of the stats when creating the ScanTasks |
4e92111
to
372f93d
Compare
@stevenzwu, @sundargates: Finally, I was able to get #8803 in. So I updated the PR, and the new versions allows to set a
Please review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to check the unit test code. didn't quite get it on the first look
...17/flink/src/main/java/org/apache/iceberg/flink/source/reader/IcebergWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java
Outdated
Show resolved
Hide resolved
...17/flink/src/main/java/org/apache/iceberg/flink/source/reader/IcebergWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/reader/TimestampBasedWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/reader/TimestampBasedWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/main/java/org/apache/iceberg/flink/source/reader/TimestampBasedWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java
Show resolved
Hide resolved
...nk/src/main/java/org/apache/iceberg/flink/source/reader/WatermarkExtractorRecordEmitter.java
Outdated
Show resolved
Hide resolved
flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java
Show resolved
Hide resolved
...link/src/main/java/org/apache/iceberg/flink/source/reader/ColumnStatsWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...st/java/org/apache/iceberg/flink/source/TestIcebergSourceFailoverWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
...k/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceWithWatermarkExtractor.java
Outdated
Show resolved
Hide resolved
}); | ||
|
||
// Use static variable to collect the windows, since other solutions were flaky | ||
windows.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat inelegant, but the windowed.collectAsync()
solution was flaky. Failed once per ~1000 runs. Even though the window operator was called, and emitted the records (checked by logs), the results did not show up in the resultIterator
. Maybe some Flink glitch?
So opted for a static ConcurrentMap
instead to collect the generated windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how the windows
variable work here. won't it be serialized and shipped to the task manager?
If the windowing testing is flaky, I would be inclined to just remove this. I feel the throttling test method below this is more relevant and clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I would use a non-static windows
variable, then it will be serialized, and I could not use it in my test. That is why I had to use a static
variable - which only works correctly in tests which runs everything in one JVM.
Flakiness is solved by this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more nit comment on Javadoc.
otherwise, LGTM
@@ -429,6 +444,30 @@ public Builder<T> setAll(Map<String, String> properties) { | |||
return this; | |||
} | |||
|
|||
/** | |||
* Emits watermarks once per split based on the file statistics for the given split. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: based on the file statistics
-> based on the min value of column statistics from file metadata
@@ -453,6 +492,18 @@ public IcebergSource<T> build() { | |||
contextBuilder.project(FlinkSchemaUtil.convert(icebergSchema, projectedFlinkSchema)); | |||
} | |||
|
|||
SerializableRecordEmitter<T> emitter = SerializableRecordEmitter.defaultEmitter(); | |||
if (watermarkColumn != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with tabling this as a follow-up.
- The default combining logic is fine for batch processing where ordering doesn't matter. for streaming read and watermark alignment, ordering is important as it affects data buffering in the flink state. That is precisely what watermark alignment intended to solve.
- Flink watermark alignment has been designed mostly with unbounded splits (like Kafka) in mind. also within an unbounded Kafka split, records are FIFO. there is a natural order within a split.
@pvary not sure every users understand the internal details. At least, we can at least document this option of disabling combining for better ordering in the doc then. Then users can make an informed choice. @dchristle can probably also chime in here and help review the doc change in a follow up PR.
this.eventTimeFieldId = field.fieldId(); | ||
this.eventTimeFieldName = eventTimeFieldName; | ||
// Use the timeUnit only for Long columns. | ||
this.timeUnit = typeID.equals(TypeID.LONG) ? timeUnit : TimeUnit.MICROSECONDS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer!
Created #9137 to tackle this
}); | ||
|
||
// Use static variable to collect the windows, since other solutions were flaky | ||
windows.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how the windows
variable work here. won't it be serialized and shipped to the task manager?
If the windowing testing is flaky, I would be inclined to just remove this. I feel the throttling test method below this is more relevant and clear.
Merged to main. |
There are several cases when it is worthwhile to emitt watermarks from the source.
In these cases we need to make sure that the emitted watermarks are in order.
If we do this correctly we could be use:
The PR has 2 main additions to the current IcebergSource:
IcebergSourceRecordEmitter
interface to do the actual record emissionEventTimeExtractorRecordEmitter
which emits the watermark for every split and emits the record tooIcebergEventTimeExtractor
interface to extract the split watermark, and the record timestampThe solution uses the
OrderedSplitAssignerFactory
to order the splits based on the split watermark, and enhances theIcebergSource
so the users could provide theeventTimeExtractor