Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDB Shredder: persist synthetic duplicates on disk #142

Closed
chuwy opened this issue Apr 16, 2019 · 0 comments
Closed

RDB Shredder: persist synthetic duplicates on disk #142

chuwy opened this issue Apr 16, 2019 · 0 comments
Assignees
Labels
Milestone

Comments

@chuwy
Copy link
Contributor

chuwy commented Apr 16, 2019

Currently, RDB Shredder job has a lineage with three RDDs branched off from goodWithSyntheticDupes, an RDD where non-determenistic UUID.randomUUID() operation invoked. It means that if Spark has not enough memory for storing a block with synthetic duplicates - it can evict it and recompute afterwards for one of derived RDDs. During that recomputation a new event_id will be generated that won't correspond to events, which leads to a scenario with orphaned shredded entities.

In other words, it leads to a scenario where single enriched event is shredded into three entities:

  1. events - atomic TSV
  2. jsons - all shredded entities: contexts and self-describing events (except iglu:com.snowplowanalytics.snowplow/duplicate/jsonschema/1-0-0 context)
  3. duplicate context

won't have same event_id/root_id, which means they could not be joined in Redshift.

Orphans appear only for events went though synthetic deduplication, i.e. events that have same event_id, but different fingerprints. These events usually are generated by bots, not real users and rarely exceed 0.5% of events.

Restructuring RDD lineage (using new Analytics SDK 0.4.0) to union jsons with duplicate and persisting an RDD with newly generated event_ids with MEMORY_AND_DISK_SER storage should solve the problem.

@chuwy chuwy added this to the Release 31 milestone Apr 16, 2019
@chuwy chuwy self-assigned this Apr 16, 2019
@chuwy chuwy changed the title RDB Shredder: persist shredded entities on disk RDB Shredder: persist synthetic duplicates entities on disk Apr 16, 2019
@chuwy chuwy added the bug label Apr 16, 2019
@chuwy chuwy changed the title RDB Shredder: persist synthetic duplicates entities on disk RDB Shredder: persist synthetic duplicates on disk Apr 16, 2019
@chuwy chuwy closed this as completed in a949db9 Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant