Structured aggregator #704

birdsarah · 2020-06-26T06:42:22Z

This PR is functionally a no-op, but it starts us down the path of splitting up the aggregators (see #701, #652, #561).

This PR dynamically builds a class DataAggregator pulling in either former LocalAggregator methods or former S3Aggregator methods.

We should be able to build on this relatively simply to start adding, for example, parquet saving locally. There maybe some combinations we never want to support - e.g. ldb in an s3 context (seems like we could support that though, even though it would be a little odd for distributed crawls).

I'm thinking that once we have unpacked the classes so that the type of structured/content data and how you save it are separate we can work on the pieces to unify the structured data processing with something like sqlalchemy and a single schema.

Open questions:

should you be able to send structured data and unstructured data to separate destinations? Can we live with not being able to do this for now.
Why not have more listeners - one running process_record (listening to data coming up from the socket and parsing it out) and then one each for structured and unstructured data? (just a thought - this could pave the way for one process_record listener per browser if that was useful)

Notes:

Don't see the need for s3_bucket and s3_directory
Don't see the need for user's to be able to specify the database name - they can just specify the folder - like they do with s3
I think this can work gracefully if we unify some concepts - like handling batches - which are in both.
The tricky bit is that sometimes there's interplay between what you're storing and where you're storing it, so we need to make a graceful interface that allows for a matrix / mixin approach even if not all elements of matrix will be supported. In particular, blob output types (sqlite, leveldb) are not going to get cloud support. But there's no reason for parquet and gzip not to have local support. And postgres and rocksdb, for example, should work as local or cloud....I think I have a plan.

DataAggregator - should be scoped down to controlling the listener process and passing messages to it
- launch
- shutdown
- get_status
- get_most_recent_status
- get_new_completed_visits

It does not need to be responsible for making a visit_id.
It would perhaps be better named as a DataProcessBus

Listener - consider renaming to DataProcess 
- it has top level methods for setup and teardown, draining the queue, processing records, managing the
concept of a batch
- it sets up a "structured" and an "unstructured" class (unstructured only if needed - save_content=True) - examples below
- The (un)structured class that is used then decides how it wants to work with batches e.g. pretty diff between
S3Parquet and LocalSqlite. 
- The storage destination becomes the lowest level of implementation detail (see class structure below) instead of being
wrapped up at the top level.

- Structured
-- Parquet
--- S3Parquet
--- LocalParquet
-- Sql
--- Postrgresql
--- LocalSqlite

- Unstructured
-- LocalLevelDB
-- Gzip
--- LocalGzip
--- S3Gzip

The goal of a first PR would be to:

Only change TaskManager in two ways:
- Simplified launch of "DataAggregator" 7c8538d#diff-df517bce823800d647ba110df4679453L261-L265
- Remove visit_id responsibility out of DataAggregator
- Rest of TaskManager interface remains unchanged but guts of DataAggregator gets pulled out and put back together with:
  - new DataProcess / Listener that has taken on new responsibilities as outline above
  - only implement: LocalGZip, LocalSQlite, S3Parquet, S3Gzip

Follow-on PR:

Implement new manager params to allow for:
- specific specification of structured / unstructured class desired
- unify some params that can be

Follow-on PRs:

Implement additional classes e.g. LocalParquet, LocalGzip
Unify structured processing to have one schema backing it (so everyone gets task id for example)

birdsarah · 2020-06-26T06:43:51Z

automation/TaskManager.py

-from .DataAggregator.BaseAggregator import (ACTION_TYPE_FINALIZE,
-                                            RECORD_TYPE_SPECIAL)
+from .data_aggregator import BaseAggregator, build_data_aggregator_class
+from .data_aggregator.base import ACTION_TYPE_FINALIZE, RECORD_TYPE_SPECIAL


As an aside. @englehardt and I discussed incrementally moving the codebase to more conventional python naming (https://www.python.org/dev/peps/pep-0008/#package-and-module-names)

Does this mean, we'll also rename the base module from automation to openwpm?

Not any time soon. But that seems logical to me.

codecov · 2020-06-26T07:01:13Z

Codecov Report

Merging #704 into master will decrease coverage by 0.06%.
The diff coverage is 73.63%.

@@            Coverage Diff             @@
##           master     #704      +/-   ##
==========================================
- Coverage   37.48%   37.42%   -0.07%     
==========================================
  Files          28       30       +2     
  Lines        3073     3078       +5     
==========================================
  Hits         1152     1152              
- Misses       1921     1926       +5

Impacted Files	Coverage Δ
automation/data_aggregator/base.py	`56.81% <0.00%> (ø)`
automation/data_aggregator/parquet_schema.py	`100.00% <ø> (ø)`
automation/data_aggregator/s3.py	`24.25% <44.73%> (ø)`
automation/data_aggregator/builder.py	`92.30% <92.30%> (ø)`
automation/data_aggregator/local.py	`40.00% <97.87%> (ø)`
automation/TaskManager.py	`76.05% <100.00%> (-0.39%)`	⬇️
automation/data_aggregator/__init__.py	`100.00% <100.00%> (ø)`
automation/BrowserManager.py	`43.22% <0.00%> (-2.94%)`	⬇️
automation/MPLogger.py	`65.24% <0.00%> (-1.22%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 627f440...700aab7. Read the comment docs.

vringar · 2020-06-29T13:28:10Z

Thank you for spearheading this issue.
I really like this design but don't quite understand how it would work for the listener process.
Do you want to show how you would go about doing that here or would you want to file a separate PR for that?

birdsarah · 2020-06-29T15:34:56Z

Thank you for spearheading this issue.
I really like this design but don't quite understand how it would work for the listener process.
Do you want to show how you would go about doing that here or would you want to file a separate PR for that?

I'm not sure yet, but I don't see why it couldn't be coerced into working.

For what it's worth I'm not sure that in the end we would end up with a solution that looks like this - it's a little unconventional and I think it probably makes it harder to read the code base and have a clear sense of what's going on quickly. That said, it's a path to untangle the situtation we're in and get into a new situation.

birdsarah · 2020-06-29T15:41:44Z

I really like this design but don't quite understand how it would work for the listener process.

If you're being polite, and think it wouldn't work, then it's fine to ask to add the listener process to this PR.

vringar · 2020-07-02T15:37:32Z

automation/data_aggregator/builder.py

-    output_format = manager_params["output_format"]
+    """
+    Suggest renaming data_directory and s3_bucket
+    to a consistent thing like "output_folder"


How about output_url this way we handle any pyFilesytem as long as it's installed.

birdsarah · 2020-07-09T01:59:54Z

Closing this in favor of a new PR based on the plan outlined above and discussed with @englehardt and @vringar in person.

birdsarah added 3 commits June 25, 2020 23:46

Rename DataAggregator to data_aggregator

de1bd56

A little moving files around and ref the new builder.

7c8538d

Implement new DataAggregator builder function

c327274

birdsarah requested a review from vringar June 26, 2020 06:42

birdsarah commented Jun 26, 2020

View reviewed changes

birdsarah requested a review from englehardt June 26, 2020 06:45

Merge branch 'master' into structured_aggregator

39d728a

birdsarah marked this pull request as draft July 1, 2020 21:26

birdsarah added 2 commits July 1, 2020 19:33

Merge branch 'master' into structured_aggregator

1d19693

WIPPY WIP - overhaul everything

700aab7

vringar reviewed Jul 2, 2020

View reviewed changes

birdsarah closed this Jul 9, 2020

This was referenced Jul 9, 2020

Rewrite DataAggregators #561

Closed

Crawl ID -> Browser ID, and matching data structures #701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured aggregator #704

Structured aggregator #704

birdsarah commented Jun 26, 2020 •

edited

Loading

birdsarah Jun 26, 2020

vringar Jun 29, 2020

birdsarah Jun 29, 2020

codecov bot commented Jun 26, 2020 •

edited

Loading

vringar commented Jun 29, 2020

birdsarah commented Jun 29, 2020

birdsarah commented Jun 29, 2020

vringar Jul 2, 2020

birdsarah commented Jul 9, 2020

Structured aggregator #704

Structured aggregator #704

Conversation

birdsarah commented Jun 26, 2020 • edited Loading

birdsarah Jun 26, 2020

Choose a reason for hiding this comment

vringar Jun 29, 2020

Choose a reason for hiding this comment

birdsarah Jun 29, 2020

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2020 • edited Loading

Codecov Report

vringar commented Jun 29, 2020

birdsarah commented Jun 29, 2020

birdsarah commented Jun 29, 2020

vringar Jul 2, 2020

Choose a reason for hiding this comment

birdsarah commented Jul 9, 2020

birdsarah commented Jun 26, 2020 •

edited

Loading

codecov bot commented Jun 26, 2020 •

edited

Loading