Rewrite DataAggregators #561

vringar · 2020-02-14T17:34:19Z

If we split structured and unstructured storage we'd currently have the following 4 Aggregators:

Structured
- SQLLite
- Parquet on S3
Unstructured
- LevelDB
- S3

In a second step we then could move the SQLLite Aggregator to a SQLAlchemy based one, so we could then write to a Cloud SQL DB. This would give us atomic commits with no batching and enable us to have an exactly once guarantee.
Also we should reconsider #230 as this would remove the need for a bunch of sockets and possibly a whole process

birdsarah · 2020-07-09T02:02:12Z

I made a plan after doing a spike in #704. We discussed in person (@englehardt, @vringar, and @birdsarah) and agree this is a reasonable way forward.

Open questions:

should you be able to send structured data and unstructured data to separate destinations? Can we live with not being able to do this for now.
Why not have more listeners - one running process_record (listening to data coming up from the socket and parsing it out) and then one each for structured and unstructured data? (just a thought - this could pave the way for one process_record listener per browser if that was useful)

Notes:

Don't see the need for s3_bucket and s3_directory
Don't see the need for user's to be able to specify the database name - they can just specify the folder - like they do with s3
I think this can work gracefully if we unify some concepts - like handling batches - which are in both.
The tricky bit is that sometimes there's interplay between what you're storing and where you're storing it, so we need to make a graceful interface that allows for a matrix / mixin approach even if not all elements of matrix will be supported. In particular, blob output types (sqlite, leveldb) are not going to get cloud support. But there's no reason for parquet and gzip not to have local support. And postgres and rocksdb, for example, should work as local or cloud....I think I have a plan.

DataAggregator - should be scoped down to controlling the listener process and passing messages to it
- launch
- shutdown
- get_status
- get_most_recent_status
- get_new_completed_visits

It does not need to be responsible for making a visit_id.
It would perhaps be better named as a DataProcessBus

Listener - consider renaming to DataProcess 
- it has top level methods for setup and teardown, draining the queue, processing records, managing the
concept of a batch
- it sets up a "structured" and an "unstructured" class (unstructured only if needed - save_content=True) - examples below
- The (un)structured class that is used then decides how it wants to work with batches e.g. pretty diff between
S3Parquet and LocalSqlite. 
- The storage destination becomes the lowest level of implementation detail (see class structure below) instead of being
wrapped up at the top level.

- Structured
-- Parquet
--- S3Parquet
--- LocalParquet
-- Sql
--- Postrgresql
--- LocalSqlite

- Unstructured
-- LocalLevelDB
-- Gzip
--- LocalGzip
--- S3Gzip

The goal of a first PR would be to:

Only change TaskManager in two ways:
- Simplified launch of "DataAggregator" 7c8538d#diff-df517bce823800d647ba110df4679453L261-L265
- Remove visit_id responsibility out of DataAggregator
- Rest of TaskManager interface remains unchanged but guts of DataAggregator gets pulled out and put back together with:
  - new DataProcess / Listener that has taken on new responsibilities as outline above
  - only implement: LocalGZip, LocalSQlite, S3Parquet, S3Gzip

Follow-on PR:

Implement new manager params to allow for:
- specific specification of structured / unstructured class desired
- unify some params that can be

Follow-on PRs:

Implement additional classes e.g. LocalParquet, LocalGzip
Unify structured processing to have one schema backing it (so everyone gets task id for example)

birdsarah · 2020-07-23T23:05:29Z

When standardizing crawl_id (aka browser_id), task_id, visit_id - will need to have a new strategy that spans - because at the moment local sqlite and parquet handle very differently.

vringar · 2020-09-28T09:14:37Z

I propose introducing some new wording around this to clarify what mean:

records : formerly known of structured content
blobs : formerly known as unstructured content
storage provider: formerly known as data aggregator
data aggregator: the thread that handles the sockets and queues and feeds data into the storage providers

vringar · 2020-11-06T15:19:03Z

#232 should also be addressed by this issue.

vringar · 2020-11-30T21:45:09Z

https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems

vringar added discussion feature-request needs-investigation labels Feb 14, 2020

birdsarah mentioned this issue Jun 26, 2020

Structured aggregator #704

Closed

birdsarah self-assigned this Jul 9, 2020

birdsarah removed discussion needs-investigation labels Jul 9, 2020

englehardt mentioned this issue Sep 11, 2020

Simplify S3Aggregator using dask #427

Closed

vringar mentioned this issue Sep 29, 2020

Data Aggregator Rewrite #753

Merged

englehardt mentioned this issue Nov 6, 2020

Explore the benefits of refactoring data saving to abstract aggregation #230

Closed

englehardt changed the title ~~Rewrite Aggregators to support PostgresSQL for structured data~~ Rewrite DataAggregators Nov 9, 2020

englehardt assigned vringar and unassigned birdsarah Nov 9, 2020

vringar closed this as completed in #753 Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite DataAggregators #561

Rewrite DataAggregators #561

vringar commented Feb 14, 2020 •

edited

Loading

birdsarah commented Jul 9, 2020

birdsarah commented Jul 23, 2020

vringar commented Sep 28, 2020 •

edited

Loading

vringar commented Nov 6, 2020

vringar commented Nov 30, 2020

Rewrite DataAggregators #561

Rewrite DataAggregators #561

Comments

vringar commented Feb 14, 2020 • edited Loading

birdsarah commented Jul 9, 2020

birdsarah commented Jul 23, 2020

vringar commented Sep 28, 2020 • edited Loading

vringar commented Nov 6, 2020

vringar commented Nov 30, 2020

vringar commented Feb 14, 2020 •

edited

Loading

vringar commented Sep 28, 2020 •

edited

Loading