RDB Loader: support for table auto-creation and auto-migration #7

chuwy · 2017-08-17T11:16:11Z

Migrated from snowplow/snowplow#185

chuwy · 2018-01-23T08:04:05Z

I'd like to make it a main scope of R30. Depends on #81 mostly, and few more iglu tickets.

This feature can (or even must) be first used in test-mode, so we should add a corresponding setting to enable it.

Long story short: with enabled auto-migration algorithm should be following:

RDB Shredder shreds enriched data, saves it to S3 and writes a record to processing manifest with found shredded data types
RDB Loader finds new records on manifest and checks what types already exist in DB and what are not.
Fetch JSON Schemas for all types in a new folder
For existing types: generate a JSONPath using Schema DDL on-fly [2]
For new types - fetch JSON Schema from registry, create DDL and JSONPaths using Schema DDL
Create new tables
Load both existing and new types using just generated JSONPath files

Problems we need to solve:

Obsolete, manually generated DDLs/JSONPath. Our users have lots of JSON Schemas, for which DDLs and JSONPaths were not generated with Igluctl, which means that order of columns is unpredictable. We already have some plans about obsolete-DDLs migration (igluctl: add Redshift DDL migrations iglu#312), but at first we can go with "pre-generated assets" approach: first we check JSONPaths bucket and if asset exists - we do not generate assets on-fly and use existing instead. I also don't think this is a temporary approach. Instead it will allow our users to have custom optimizations in DDLs: better compressions, filtered out folders etc.
Even on-fly generated JSONPath files still need to be stored somewhere. I think Redshift always requires JSONPaths to be stored on S3. Need to double-check it.
So far this approach will work only for initial tables. We do have migrations, but they hardly cover 50% of table alterations.
Also, we need an automatic versions check in Iglu before production usage.

alexanderdean · 2018-01-23T21:31:54Z

I think this design is a good start, but needs some more work.

The most obvious thing is the JSON Paths files themselves - if we continue with this approach, then:

We have to continue in generating and hosting (in S3) JSON Paths files
We cannot use this functionality with other relational databases like Postgres or Azure SQL DW

I am keen that we make the leap to removing the need for JSON Paths files altogether. Essentially, the shred process no longer writes out JSONs, but TSV files - to put it another way: a "virtual JSON Paths file" is applied to the JSON inside the shred step.

alexanderdean · 2018-01-23T21:39:16Z

first we check JSONPaths bucket and if asset exists - we do not generate assets on-fly and use existing instead.

I think this is over-complicated - hard to reason about and debug... I think it would be better to:

Come up with a guided upgrade process to fix all the Redshift target tables. This is a one-time operation, run out-of-band by the DBA
Ideally come up with a clever way (a flag table? comment on the schema?) for the new RDB Loader release to check that the guided upgrade has been successfully completed before loading

So far this approach will work only for initial tables. We do have migrations, but they hardly cover 50% of table alterations.

It feels like it's safest for the initial version to just support initial tables - rather than muddying the water with some migration support...

chuwy · 2020-03-19T20:30:51Z

Closing in favor of #152

chuwy mentioned this issue Aug 17, 2017

StorageLoader: auto-create and auto-migrate tables snowplow/snowplow#185

Closed

chuwy changed the title ~~Support for table auto-creation and auto-migration~~ RDB Loader: support for table auto-creation and auto-migration Jan 23, 2018

chuwy self-assigned this Jan 23, 2018

chuwy mentioned this issue Jun 29, 2018

Schema DDL: Add ability to generate Postgres JSON Paths snowplow/iglu#193

Closed

chuwy mentioned this issue May 8, 2019

EmrEtlRunner: add support for shredded TSV data snowplow/snowplow#4074

Closed

chuwy closed this as completed Mar 19, 2020

chuwy added the duplicate label Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDB Loader: support for table auto-creation and auto-migration #7

RDB Loader: support for table auto-creation and auto-migration #7

chuwy commented Aug 17, 2017

chuwy commented Jan 23, 2018

alexanderdean commented Jan 23, 2018

alexanderdean commented Jan 23, 2018 •

edited

Loading

chuwy commented Mar 19, 2020

RDB Loader: support for table auto-creation and auto-migration #7

RDB Loader: support for table auto-creation and auto-migration #7

Comments

chuwy commented Aug 17, 2017

chuwy commented Jan 23, 2018

alexanderdean commented Jan 23, 2018

alexanderdean commented Jan 23, 2018 • edited Loading

chuwy commented Mar 19, 2020

alexanderdean commented Jan 23, 2018 •

edited

Loading