RDB Shredder: optimize DAG by excluding count #582

chuwy · 2021-09-30T11:44:49Z

Currently we use countAsync to count events in final RDD, but it has proven itself very inefficient in regards to cache and also not being async. We can replace countAsync with event counter implemented as Spark accumulator to get rid of additional job.

Usually counters implemented on top of accumulators is a bad idea because they can be executed more than once, but we decided it's a reasonable trade-off because:

We never need an exact number of events - it's purely informational
The chance of another attempt is very small at the point where we increment

This commit aslo simplifies the DAG by joining good and bad data output into a single step. And last, it saves shredded type only if it has been successfully shredded.

The text was updated successfully, but these errors were encountered:

chuwy changed the title ~~RDB Shredder: optimize DAG by exclusing count~~ RDB Shredder: optimize DAG by excluding count Sep 30, 2021

chuwy added a commit that referenced this issue Oct 20, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

f883b54

spenes pushed a commit that referenced this issue Oct 28, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

3c3e3bd

spenes pushed a commit that referenced this issue Nov 3, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

e18a890

dilyand pushed a commit that referenced this issue Nov 8, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

fa00982

chuwy added a commit that referenced this issue Nov 16, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

998a6dd

spenes pushed a commit that referenced this issue Nov 19, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

23c4461

chuwy added a commit that referenced this issue Nov 24, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

177d5b3

spenes pushed a commit that referenced this issue Dec 2, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

42c21d0

chuwy added a commit that referenced this issue Dec 3, 2021

RDB Shredder: optimize DAG by excluding count (close #582)

3766517

chuwy closed this as completed in 7dc411c Jan 19, 2022

chuwy self-assigned this Jan 20, 2022

chuwy added this to the 2.1.0 milestone Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDB Shredder: optimize DAG by excluding count #582

RDB Shredder: optimize DAG by excluding count #582

chuwy commented Sep 30, 2021

RDB Shredder: optimize DAG by excluding count #582

RDB Shredder: optimize DAG by excluding count #582

Comments

chuwy commented Sep 30, 2021