Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite DataAggregators #561

Closed
vringar opened this issue Feb 14, 2020 · 5 comments · Fixed by #753
Closed

Rewrite DataAggregators #561

vringar opened this issue Feb 14, 2020 · 5 comments · Fixed by #753
Assignees

Comments

@vringar
Copy link
Contributor

vringar commented Feb 14, 2020

If we split structured and unstructured storage we'd currently have the following 4 Aggregators:

  • Structured
    • SQLLite
    • Parquet on S3
  • Unstructured
    • LevelDB
    • S3

In a second step we then could move the SQLLite Aggregator to a SQLAlchemy based one, so we could then write to a Cloud SQL DB. This would give us atomic commits with no batching and enable us to have an exactly once guarantee.
Also we should reconsider #230 as this would remove the need for a bunch of sockets and possibly a whole process

@birdsarah
Copy link
Contributor

I made a plan after doing a spike in #704. We discussed in person (@englehardt, @vringar, and @birdsarah) and agree this is a reasonable way forward.

Open questions:

  • should you be able to send structured data and unstructured data to separate destinations? Can we live with not being able to do this for now.
  • Why not have more listeners - one running process_record (listening to data coming up from the socket and parsing it out) and then one each for structured and unstructured data? (just a thought - this could pave the way for one process_record listener per browser if that was useful)

Notes:

  • Don't see the need for s3_bucket and s3_directory
  • Don't see the need for user's to be able to specify the database name - they can just specify the folder - like they do with s3
  • I think this can work gracefully if we unify some concepts - like handling batches - which are in both.
  • The tricky bit is that sometimes there's interplay between what you're storing and where you're storing it, so we need to make a graceful interface that allows for a matrix / mixin approach even if not all elements of matrix will be supported. In particular, blob output types (sqlite, leveldb) are not going to get cloud support. But there's no reason for parquet and gzip not to have local support. And postgres and rocksdb, for example, should work as local or cloud....I think I have a plan.
DataAggregator - should be scoped down to controlling the listener process and passing messages to it
- launch
- shutdown
- get_status
- get_most_recent_status
- get_new_completed_visits

It does not need to be responsible for making a visit_id.
It would perhaps be better named as a DataProcessBus

Listener - consider renaming to DataProcess 
- it has top level methods for setup and teardown, draining the queue, processing records, managing the
concept of a batch
- it sets up a "structured" and an "unstructured" class (unstructured only if needed - save_content=True) - examples below
- The (un)structured class that is used then decides how it wants to work with batches e.g. pretty diff between
S3Parquet and LocalSqlite. 
- The storage destination becomes the lowest level of implementation detail (see class structure below) instead of being
wrapped up at the top level.

- Structured
-- Parquet
--- S3Parquet
--- LocalParquet
-- Sql
--- Postrgresql
--- LocalSqlite

- Unstructured
-- LocalLevelDB
-- Gzip
--- LocalGzip
--- S3Gzip

The goal of a first PR would be to:

  • Only change TaskManager in two ways:
    • Simplified launch of "DataAggregator" 7c8538d#diff-df517bce823800d647ba110df4679453L261-L265
    • Remove visit_id responsibility out of DataAggregator
    • Rest of TaskManager interface remains unchanged but guts of DataAggregator gets pulled out and put back together with:
      • new DataProcess / Listener that has taken on new responsibilities as outline above
      • only implement: LocalGZip, LocalSQlite, S3Parquet, S3Gzip

Follow-on PR:

  • Implement new manager params to allow for:
    • specific specification of structured / unstructured class desired
    • unify some params that can be

Follow-on PRs:

  • Implement additional classes e.g. LocalParquet, LocalGzip
  • Unify structured processing to have one schema backing it (so everyone gets task id for example)

@birdsarah
Copy link
Contributor

When standardizing crawl_id (aka browser_id), task_id, visit_id - will need to have a new strategy that spans - because at the moment local sqlite and parquet handle very differently.

@vringar
Copy link
Contributor Author

vringar commented Sep 28, 2020

I propose introducing some new wording around this to clarify what mean:

  • records : formerly known of structured content
  • blobs : formerly known as unstructured content
  • storage provider: formerly known as data aggregator
  • data aggregator: the thread that handles the sockets and queues and feeds data into the storage providers

@vringar
Copy link
Contributor Author

vringar commented Nov 6, 2020

#232 should also be addressed by this issue.

@englehardt englehardt changed the title Rewrite Aggregators to support PostgresSQL for structured data Rewrite DataAggregators Nov 9, 2020
@englehardt englehardt assigned vringar and unassigned birdsarah Nov 9, 2020
@vringar
Copy link
Contributor Author

vringar commented Nov 30, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants