-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite DataAggregators #561
Comments
I made a plan after doing a spike in #704. We discussed in person (@englehardt, @vringar, and @birdsarah) and agree this is a reasonable way forward. Open questions:
Notes:
The goal of a first PR would be to:
Follow-on PR:
Follow-on PRs:
|
When standardizing crawl_id (aka browser_id), task_id, visit_id - will need to have a new strategy that spans - because at the moment local sqlite and parquet handle very differently. |
I propose introducing some new wording around this to clarify what mean:
|
#232 should also be addressed by this issue. |
If we split structured and unstructured storage we'd currently have the following 4 Aggregators:
In a second step we then could move the SQLLite Aggregator to a SQLAlchemy based one, so we could then write to a Cloud SQL DB. This would give us atomic commits with no batching and enable us to have an exactly once guarantee.
Also we should reconsider #230 as this would remove the need for a bunch of sockets and possibly a whole process
The text was updated successfully, but these errors were encountered: