Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDB Shredder: optimize DAG by excluding count #582

Closed
chuwy opened this issue Sep 30, 2021 · 0 comments
Closed

RDB Shredder: optimize DAG by excluding count #582

chuwy opened this issue Sep 30, 2021 · 0 comments
Assignees
Milestone

Comments

@chuwy
Copy link
Contributor

chuwy commented Sep 30, 2021

Currently we use countAsync to count events in final RDD, but it has proven itself very inefficient in regards to cache and also not being async. We can replace countAsync with event counter implemented as Spark accumulator to get rid of additional job.

Usually counters implemented on top of accumulators is a bad idea because they can be executed more than once, but we decided it's a reasonable trade-off because:

  1. We never need an exact number of events - it's purely informational
  2. The chance of another attempt is very small at the point where we increment

This commit aslo simplifies the DAG by joining good and bad data output into a single step. And last, it saves shredded type only if it has been successfully shredded.

@chuwy chuwy changed the title RDB Shredder: optimize DAG by exclusing count RDB Shredder: optimize DAG by excluding count Sep 30, 2021
@chuwy chuwy closed this as completed in 7dc411c Jan 19, 2022
@chuwy chuwy self-assigned this Jan 20, 2022
@chuwy chuwy added this to the 2.1.0 milestone Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant