-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structured aggregator #704
Conversation
from .DataAggregator.BaseAggregator import (ACTION_TYPE_FINALIZE, | ||
RECORD_TYPE_SPECIAL) | ||
from .data_aggregator import BaseAggregator, build_data_aggregator_class | ||
from .data_aggregator.base import ACTION_TYPE_FINALIZE, RECORD_TYPE_SPECIAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside. @englehardt and I discussed incrementally moving the codebase to more conventional python naming (https://www.python.org/dev/peps/pep-0008/#package-and-module-names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean, we'll also rename the base module from automation to openwpm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not any time soon. But that seems logical to me.
Codecov Report
@@ Coverage Diff @@
## master #704 +/- ##
==========================================
- Coverage 37.48% 37.42% -0.07%
==========================================
Files 28 30 +2
Lines 3073 3078 +5
==========================================
Hits 1152 1152
- Misses 1921 1926 +5
Continue to review full report at Codecov.
|
Thank you for spearheading this issue. |
I'm not sure yet, but I don't see why it couldn't be coerced into working. For what it's worth I'm not sure that in the end we would end up with a solution that looks like this - it's a little unconventional and I think it probably makes it harder to read the code base and have a clear sense of what's going on quickly. That said, it's a path to untangle the situtation we're in and get into a new situation. |
If you're being polite, and think it wouldn't work, then it's fine to ask to add the listener process to this PR. |
output_format = manager_params["output_format"] | ||
""" | ||
Suggest renaming data_directory and s3_bucket | ||
to a consistent thing like "output_folder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about output_url
this way we handle any pyFilesytem as long as it's installed.
Closing this in favor of a new PR based on the plan outlined above and discussed with @englehardt and @vringar in person. |
This PR is functionally a no-op, but it starts us down the path of splitting up the aggregators (see #701, #652, #561).
This PR dynamically builds a class DataAggregator pulling in either former LocalAggregator methods or former S3Aggregator methods.
We should be able to build on this relatively simply to start adding, for example, parquet saving locally. There maybe some combinations we never want to support - e.g. ldb in an s3 context (seems like we could support that though, even though it would be a little odd for distributed crawls).
I'm thinking that once we have unpacked the classes so that the type of structured/content data and how you save it are separate we can work on the pieces to unify the structured data processing with something like sqlalchemy and a single schema.
Open questions:
Notes:
The goal of a first PR would be to:
Follow-on PR:
Follow-on PRs: