-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDB Shredder: persist synthetic duplicates on disk #142
Comments
chuwy
changed the title
RDB Shredder: persist shredded entities on disk
RDB Shredder: persist synthetic duplicates entities on disk
Apr 16, 2019
chuwy
added a commit
that referenced
this issue
Apr 16, 2019
chuwy
changed the title
RDB Shredder: persist synthetic duplicates entities on disk
RDB Shredder: persist synthetic duplicates on disk
Apr 16, 2019
chuwy
added a commit
that referenced
this issue
Apr 16, 2019
chuwy
added a commit
that referenced
this issue
Apr 17, 2019
chuwy
added a commit
that referenced
this issue
Apr 17, 2019
chuwy
added a commit
that referenced
this issue
Apr 17, 2019
chuwy
added a commit
that referenced
this issue
Apr 18, 2019
chuwy
added a commit
that referenced
this issue
Apr 19, 2019
chuwy
added a commit
that referenced
this issue
Apr 21, 2019
chuwy
added a commit
that referenced
this issue
Jun 25, 2019
chuwy
added a commit
that referenced
this issue
Jul 18, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, RDB Shredder job has a lineage with three
RDD
s branched off fromgoodWithSyntheticDupes
, an RDD where non-determenisticUUID.randomUUID()
operation invoked. It means that if Spark has not enough memory for storing a block with synthetic duplicates - it can evict it and recompute afterwards for one of derivedRDD
s. During that recomputation a newevent_id
will be generated that won't correspond toevents
, which leads to a scenario with orphaned shredded entities.In other words, it leads to a scenario where single enriched event is shredded into three entities:
events
- atomic TSVjsons
- all shredded entities: contexts and self-describing events (exceptiglu:com.snowplowanalytics.snowplow/duplicate/jsonschema/1-0-0
context)duplicate
contextwon't have same
event_id
/root_id
, which means they could not be joined in Redshift.Orphans appear only for events went though synthetic deduplication, i.e. events that have same
event_id
, but differentfingerprint
s. These events usually are generated by bots, not real users and rarely exceed 0.5% of events.Restructuring RDD lineage (using new Analytics SDK 0.4.0) to union
jsons
withduplicate
and persisting an RDD with newly generatedevent_id
s withMEMORY_AND_DISK_SER
storage should solve the problem.The text was updated successfully, but these errors were encountered: