-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDB Shredder: optimize DAG by excluding count #582
Milestone
Comments
chuwy
changed the title
RDB Shredder: optimize DAG by exclusing count
RDB Shredder: optimize DAG by excluding count
Sep 30, 2021
chuwy
added a commit
that referenced
this issue
Oct 20, 2021
spenes
pushed a commit
that referenced
this issue
Oct 28, 2021
spenes
pushed a commit
that referenced
this issue
Nov 3, 2021
dilyand
pushed a commit
that referenced
this issue
Nov 8, 2021
chuwy
added a commit
that referenced
this issue
Nov 16, 2021
spenes
pushed a commit
that referenced
this issue
Nov 19, 2021
chuwy
added a commit
that referenced
this issue
Nov 24, 2021
spenes
pushed a commit
that referenced
this issue
Dec 2, 2021
chuwy
added a commit
that referenced
this issue
Dec 3, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently we use
countAsync
to count events in final RDD, but it has proven itself very inefficient in regards to cache and also not being async. We can replacecountAsync
with event counter implemented as Spark accumulator to get rid of additional job.Usually counters implemented on top of accumulators is a bad idea because they can be executed more than once, but we decided it's a reasonable trade-off because:
This commit aslo simplifies the DAG by joining good and bad data output into a single step. And last, it saves shredded type only if it has been successfully shredded.
The text was updated successfully, but these errors were encountered: