Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common: get rid of atomic-events folder #183

Closed
chuwy opened this issue Mar 23, 2020 · 2 comments
Closed

Common: get rid of atomic-events folder #183

chuwy opened this issue Mar 23, 2020 · 2 comments
Assignees
Milestone

Comments

@chuwy
Copy link
Contributor

chuwy commented Mar 23, 2020

We can to treat atomic events as a (special) TSV-shredded type in output data with [iglu:com.snowplowanalytics.snowplow/atomic/jsonschema/1-0-0] (snowplow/iglu-central#778) schema. This would allow us to get rid of one S3DistCp step and dedicate RDD action in Shredder.

This would require:

  1. Change in EmrEtlRunner to make it skip atomic S3DistCp (otherwise it fails as there's no data)
  2. Make RDB Shredder 0.19.0 (assuming it will implement the change) compatible only with Loader 0.19.0 and above

Did I miss anything, @stdfalse ☝️

@chuwy chuwy self-assigned this Mar 23, 2020
@stdfalse
Copy link
Collaborator

Sounds good @chuwy.

Just a note on (1) - this is a dedicated step only if consolidated_shredded_output is enabled. I believe the logic will be if rdb_shredder >= '0.19.0' and consolidated_shredded_output is True: do not submit the step. You might need to check how the consolidation is implemented to ensure it won't cause any side-effects.

What are the benefits of this change? Will an absence of the RDD action have notable performance impact?

@chuwy
Copy link
Contributor Author

chuwy commented Mar 24, 2020

You might need to check how the consolidation is implemented to ensure it won't cause any side-effects.

👍

What are the benefits of this change? Will an absence of the RDD action have notable performance impact?

Not maybe that notable as we'd like to have - just around 2-3%, a bit more if we use >1 nodes, because Spark will be able to evenly partition the data. Also, the fewer steps we have - the fewer points of failure it will have. However, main reason is that we need to unify data and simplify the flow for the future refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants