Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDB Loader: support for table auto-creation and auto-migration #7

Closed
chuwy opened this issue Aug 17, 2017 · 4 comments
Closed

RDB Loader: support for table auto-creation and auto-migration #7

chuwy opened this issue Aug 17, 2017 · 4 comments
Assignees

Comments

@chuwy
Copy link
Contributor

chuwy commented Aug 17, 2017

Migrated from snowplow/snowplow#185

@chuwy chuwy changed the title Support for table auto-creation and auto-migration RDB Loader: support for table auto-creation and auto-migration Jan 23, 2018
@chuwy chuwy self-assigned this Jan 23, 2018
@chuwy
Copy link
Contributor Author

chuwy commented Jan 23, 2018

I'd like to make it a main scope of R30. Depends on #81 mostly, and few more iglu tickets.

This feature can (or even must) be first used in test-mode, so we should add a corresponding setting to enable it.

Long story short: with enabled auto-migration algorithm should be following:

  1. RDB Shredder shreds enriched data, saves it to S3 and writes a record to processing manifest with found shredded data types
  2. RDB Loader finds new records on manifest and checks what types already exist in DB and what are not.
  3. Fetch JSON Schemas for all types in a new folder
  4. For existing types: generate a JSONPath using Schema DDL on-fly [2]
  5. For new types - fetch JSON Schema from registry, create DDL and JSONPaths using Schema DDL
  6. Create new tables
  7. Load both existing and new types using just generated JSONPath files

Problems we need to solve:

  1. Obsolete, manually generated DDLs/JSONPath. Our users have lots of JSON Schemas, for which DDLs and JSONPaths were not generated with Igluctl, which means that order of columns is unpredictable. We already have some plans about obsolete-DDLs migration (igluctl: add Redshift DDL migrations iglu#312), but at first we can go with "pre-generated assets" approach: first we check JSONPaths bucket and if asset exists - we do not generate assets on-fly and use existing instead. I also don't think this is a temporary approach. Instead it will allow our users to have custom optimizations in DDLs: better compressions, filtered out folders etc.
  2. Even on-fly generated JSONPath files still need to be stored somewhere. I think Redshift always requires JSONPaths to be stored on S3. Need to double-check it.
  3. So far this approach will work only for initial tables. We do have migrations, but they hardly cover 50% of table alterations.
  4. Also, we need an automatic versions check in Iglu before production usage.

@alexanderdean
Copy link
Member

I think this design is a good start, but needs some more work.

The most obvious thing is the JSON Paths files themselves - if we continue with this approach, then:

  1. We have to continue in generating and hosting (in S3) JSON Paths files
  2. We cannot use this functionality with other relational databases like Postgres or Azure SQL DW

I am keen that we make the leap to removing the need for JSON Paths files altogether. Essentially, the shred process no longer writes out JSONs, but TSV files - to put it another way: a "virtual JSON Paths file" is applied to the JSON inside the shred step.

@alexanderdean
Copy link
Member

alexanderdean commented Jan 23, 2018

first we check JSONPaths bucket and if asset exists - we do not generate assets on-fly and use existing instead.

I think this is over-complicated - hard to reason about and debug... I think it would be better to:

  • Come up with a guided upgrade process to fix all the Redshift target tables. This is a one-time operation, run out-of-band by the DBA
  • Ideally come up with a clever way (a flag table? comment on the schema?) for the new RDB Loader release to check that the guided upgrade has been successfully completed before loading

So far this approach will work only for initial tables. We do have migrations, but they hardly cover 50% of table alterations.

It feels like it's safest for the initial version to just support initial tables - rather than muddying the water with some migration support...

@chuwy
Copy link
Contributor Author

chuwy commented Mar 19, 2020

Closing in favor of #152

@chuwy chuwy closed this as completed Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants