Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured aggregator #704

Closed
wants to merge 6 commits into from
Closed

Structured aggregator #704

wants to merge 6 commits into from

Conversation

birdsarah
Copy link
Contributor

@birdsarah birdsarah commented Jun 26, 2020

This PR is functionally a no-op, but it starts us down the path of splitting up the aggregators (see #701, #652, #561).

This PR dynamically builds a class DataAggregator pulling in either former LocalAggregator methods or former S3Aggregator methods.

We should be able to build on this relatively simply to start adding, for example, parquet saving locally. There maybe some combinations we never want to support - e.g. ldb in an s3 context (seems like we could support that though, even though it would be a little odd for distributed crawls).

I'm thinking that once we have unpacked the classes so that the type of structured/content data and how you save it are separate we can work on the pieces to unify the structured data processing with something like sqlalchemy and a single schema.

Open questions:

  • should you be able to send structured data and unstructured data to separate destinations? Can we live with not being able to do this for now.
  • Why not have more listeners - one running process_record (listening to data coming up from the socket and parsing it out) and then one each for structured and unstructured data? (just a thought - this could pave the way for one process_record listener per browser if that was useful)

Notes:

  • Don't see the need for s3_bucket and s3_directory
  • Don't see the need for user's to be able to specify the database name - they can just specify the folder - like they do with s3
  • I think this can work gracefully if we unify some concepts - like handling batches - which are in both.
  • The tricky bit is that sometimes there's interplay between what you're storing and where you're storing it, so we need to make a graceful interface that allows for a matrix / mixin approach even if not all elements of matrix will be supported. In particular, blob output types (sqlite, leveldb) are not going to get cloud support. But there's no reason for parquet and gzip not to have local support. And postgres and rocksdb, for example, should work as local or cloud....I think I have a plan.
DataAggregator - should be scoped down to controlling the listener process and passing messages to it
- launch
- shutdown
- get_status
- get_most_recent_status
- get_new_completed_visits

It does not need to be responsible for making a visit_id.
It would perhaps be better named as a DataProcessBus

Listener - consider renaming to DataProcess 
- it has top level methods for setup and teardown, draining the queue, processing records, managing the
concept of a batch
- it sets up a "structured" and an "unstructured" class (unstructured only if needed - save_content=True) - examples below
- The (un)structured class that is used then decides how it wants to work with batches e.g. pretty diff between
S3Parquet and LocalSqlite. 
- The storage destination becomes the lowest level of implementation detail (see class structure below) instead of being
wrapped up at the top level.

- Structured
-- Parquet
--- S3Parquet
--- LocalParquet
-- Sql
--- Postrgresql
--- LocalSqlite

- Unstructured
-- LocalLevelDB
-- Gzip
--- LocalGzip
--- S3Gzip

The goal of a first PR would be to:

  • Only change TaskManager in two ways:
    • Simplified launch of "DataAggregator" 7c8538d#diff-df517bce823800d647ba110df4679453L261-L265
    • Remove visit_id responsibility out of DataAggregator
    • Rest of TaskManager interface remains unchanged but guts of DataAggregator gets pulled out and put back together with:
      • new DataProcess / Listener that has taken on new responsibilities as outline above
      • only implement: LocalGZip, LocalSQlite, S3Parquet, S3Gzip

Follow-on PR:

  • Implement new manager params to allow for:
    • specific specification of structured / unstructured class desired
    • unify some params that can be

Follow-on PRs:

  • Implement additional classes e.g. LocalParquet, LocalGzip
  • Unify structured processing to have one schema backing it (so everyone gets task id for example)

@birdsarah birdsarah requested a review from vringar June 26, 2020 06:42
from .DataAggregator.BaseAggregator import (ACTION_TYPE_FINALIZE,
RECORD_TYPE_SPECIAL)
from .data_aggregator import BaseAggregator, build_data_aggregator_class
from .data_aggregator.base import ACTION_TYPE_FINALIZE, RECORD_TYPE_SPECIAL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside. @englehardt and I discussed incrementally moving the codebase to more conventional python naming (https://www.python.org/dev/peps/pep-0008/#package-and-module-names)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean, we'll also rename the base module from automation to openwpm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not any time soon. But that seems logical to me.

@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #704 into master will decrease coverage by 0.06%.
The diff coverage is 73.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #704      +/-   ##
==========================================
- Coverage   37.48%   37.42%   -0.07%     
==========================================
  Files          28       30       +2     
  Lines        3073     3078       +5     
==========================================
  Hits         1152     1152              
- Misses       1921     1926       +5     
Impacted Files Coverage Δ
automation/data_aggregator/base.py 56.81% <0.00%> (ø)
automation/data_aggregator/parquet_schema.py 100.00% <ø> (ø)
automation/data_aggregator/s3.py 24.25% <44.73%> (ø)
automation/data_aggregator/builder.py 92.30% <92.30%> (ø)
automation/data_aggregator/local.py 40.00% <97.87%> (ø)
automation/TaskManager.py 76.05% <100.00%> (-0.39%) ⬇️
automation/data_aggregator/__init__.py 100.00% <100.00%> (ø)
automation/BrowserManager.py 43.22% <0.00%> (-2.94%) ⬇️
automation/MPLogger.py 65.24% <0.00%> (-1.22%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 627f440...700aab7. Read the comment docs.

@vringar
Copy link
Contributor

vringar commented Jun 29, 2020

Thank you for spearheading this issue.
I really like this design but don't quite understand how it would work for the listener process.
Do you want to show how you would go about doing that here or would you want to file a separate PR for that?

@birdsarah
Copy link
Contributor Author

Thank you for spearheading this issue.
I really like this design but don't quite understand how it would work for the listener process.
Do you want to show how you would go about doing that here or would you want to file a separate PR for that?

I'm not sure yet, but I don't see why it couldn't be coerced into working.

For what it's worth I'm not sure that in the end we would end up with a solution that looks like this - it's a little unconventional and I think it probably makes it harder to read the code base and have a clear sense of what's going on quickly. That said, it's a path to untangle the situtation we're in and get into a new situation.

@birdsarah
Copy link
Contributor Author

I really like this design but don't quite understand how it would work for the listener process.

If you're being polite, and think it wouldn't work, then it's fine to ask to add the listener process to this PR.

@birdsarah birdsarah marked this pull request as draft July 1, 2020 21:26
output_format = manager_params["output_format"]
"""
Suggest renaming data_directory and s3_bucket
to a consistent thing like "output_folder"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about output_url this way we handle any pyFilesytem as long as it's installed.

@birdsarah
Copy link
Contributor Author

Closing this in favor of a new PR based on the plan outlined above and discussed with @englehardt and @vringar in person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants