RDB Shredder: persist synthetic duplicates on disk #142

chuwy · 2019-04-16T14:45:59Z

Currently, RDB Shredder job has a lineage with three RDDs branched off from goodWithSyntheticDupes, an RDD where non-determenistic UUID.randomUUID() operation invoked. It means that if Spark has not enough memory for storing a block with synthetic duplicates - it can evict it and recompute afterwards for one of derived RDDs. During that recomputation a new event_id will be generated that won't correspond to events, which leads to a scenario with orphaned shredded entities.

In other words, it leads to a scenario where single enriched event is shredded into three entities:

events - atomic TSV
jsons - all shredded entities: contexts and self-describing events (except iglu:com.snowplowanalytics.snowplow/duplicate/jsonschema/1-0-0 context)
duplicate context

won't have same event_id/root_id, which means they could not be joined in Redshift.

Orphans appear only for events went though synthetic deduplication, i.e. events that have same event_id, but different fingerprints. These events usually are generated by bots, not real users and rarely exceed 0.5% of events.

Restructuring RDD lineage (using new Analytics SDK 0.4.0) to union jsons with duplicate and persisting an RDD with newly generated event_ids with MEMORY_AND_DISK_SER storage should solve the problem.

The text was updated successfully, but these errors were encountered:

chuwy added this to the Release 31 milestone Apr 16, 2019

chuwy self-assigned this Apr 16, 2019

chuwy changed the title ~~RDB Shredder: persist shredded entities on disk~~ RDB Shredder: persist synthetic duplicates entities on disk Apr 16, 2019

chuwy added the bug label Apr 16, 2019

chuwy added a commit that referenced this issue Apr 16, 2019

RDB Shredder: persist synthetic duplicates entities on disk (close #142)

893d9f6

chuwy changed the title ~~RDB Shredder: persist synthetic duplicates entities on disk~~ RDB Shredder: persist synthetic duplicates on disk Apr 16, 2019

chuwy added a commit that referenced this issue Apr 16, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

f47b4bd

chuwy added a commit that referenced this issue Apr 17, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

4408f7c

chuwy added a commit that referenced this issue Apr 17, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

e6e43e9

chuwy added a commit that referenced this issue Apr 17, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

1ea0ce0

chuwy added a commit that referenced this issue Apr 18, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

6676c67

chuwy added a commit that referenced this issue Apr 19, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

c6e4e34

chuwy added a commit that referenced this issue Apr 21, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

c156ae3

chuwy added a commit that referenced this issue Jun 25, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

23a32c5

chuwy added a commit that referenced this issue Jul 18, 2019

RDB Shredder: persist synthetic duplicates on disk (close #142)

72698bd

chuwy closed this as completed in a949db9 Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDB Shredder: persist synthetic duplicates on disk #142

RDB Shredder: persist synthetic duplicates on disk #142

chuwy commented Apr 16, 2019

RDB Shredder: persist synthetic duplicates on disk #142

RDB Shredder: persist synthetic duplicates on disk #142

Comments

chuwy commented Apr 16, 2019