[ETL-659] Snowflake git integration #123

philerooski · 2024-07-03T00:18:07Z

Deploys Snowflake objects to a specific environment as part of our CI/CD process. An environment is analogous to a database. We assume that account-level objects have already been created.

The following objects are created:

RECOVER_GIT_REPOSITORY git repository
RECOVER_{environment} database (I will refer to this merely as the RECOVER database from here on for simplicity sake).
PARQUET schema
PARQUET_FORMAT file format.
PARQUET_S3 external stage.
A table for each Parquet dataset in our external stage (manually defined)

Deployment logic and object DDL is organized as a hierarchy:

snowflake/objects
└── database
    └── recover
        └── schema
            └── parquet
                ├── file_format
                ├── stage
                └── table

(*Almost) every level in this hierarchy has a deploy.sql which will deploy all child objects with respect to the current directory.

* Due to technical limitations, there is no functioning deploy.sql file under snowflake/objects/database, which would have deployed every database (we only have one database in our deployment, currently).

snowflake/objects
├── database
│   ├── deploy.sql
│   └── recover
│       ├── deploy.sql
│       └── schema
│           ├── deploy.sql
│           └── parquet
│               ├── deploy.sql
│               ├── file_format
│               │   ├── deploy.sql
│               │   └── parquet_format.sql
│               ├── stage
│               │   ├── deploy.sql
│               │   └── parquet_s3.sql
│               └── table
│                   ├── deploy.sql
│                   ├── enrolledparticipants_customfields_symptoms_parquet.sql
│                   ├── enrolledparticipants_customfields_treatments_parquet.sql
│                   ├── enrolledparticipants_parquet.sql
│                   ├── fitbitactivitylogs_parquet.sql
│                   ├── fitbitdailydata_parquet.sql
│                   ├── ...
└── deploy.sql

For example, the file located at snowflake/objects/database/recover/deploy.sql will deploy all objects under the RECOVER database, and snowflake/objects/database/recover/schema/parquet/deploy.sql will deploy the file formats, stages, and tables under the PARQUET schema.

The child objects, that is, the schemas, file formats, stages, tables – anything that is not a database – are defined in such a way as to be agnostic to their parent context. For example, snowflake/objects/database/recover/schema/parquet/deploy.sql will deploy the PARQUET schema and all its child objects to whichever database your Snowflake user assumes. There is nothing in the SQL which restricts the PARQUET schema to be created within the RECOVER database. Likewise, the tables can be deployed to any schema, although their DDL is specific to the columns in our Parquet datasets.

The entrypoint for deployment is snowflake/objects/deploy.sql. When a commit is pushed to a branch, our CI/CD process will instantiate all Snowflake objects to an environment-specific database prefixed with your branch name, e.g., RECOVER_ETL_659. Details on how to manually deploy are provided in the file itself.

Implementation details

This PR is an incomplete solution to Snowflake deployment. In a production-ready deployment configuration, we would already have a production database with tables containing data from which we could easily derive tables to deploy to a dev environment. Lacking that, we define our tables from scratch and don't yet load any data into them. Having the ability to create tables from scratch is a critical component for doing reproducible deployments, but it's not an effective production-level deployment strategy. For what work remains to be done to have a fully-featured deployment, see this comment.
Objects are created in a few different ways:
- CREATE OR REPLACE will destroy any existing objects and create a new object. This is a powerful, declarative way of enforcing object configurations, but it's not always effective to recreate objects and sometimes existing objects ought to be reused.
- CREATE ... IF NOT EXISTS will create an object if it doesn't yet already exist. We use this statement to preserve the state of that object or its child objects. The prototypical use case for this statement is to preserve tables containing data and their parent objects. I use this statement with DATABASE and SCHEMA objects because they are parents of table objects. Although, if this is a dev deployment, I instead CREATE OR REPLACE the database object in anticipation of cloning preexisting production tables while ensuring a completely reproducible dev environment with each new deployment.
- CREATE OR ALTER is a mechanism for declaratively updating table objects. More information here.
Table data types are defined to be as large as possible without consuming extraneous storage or requiring additional processing. For more information, see String & binary data types and Numeric data types.

thomasyu888

🔥 LGTM - with some minor comments - but going to pre-approve!

I wanted to touch on this "This PR is an incomplete solution to Snowflake deployment. "

I am still a bit confused by this, but could you outline what you think a "production" deployment looks like via CI/CD? when you push into main here, I think it creates a recover_main database?

thomasyu888 · 2024-07-03T02:16:54Z

snowflake/objects/databases/recover/schemas/parquet/file_format/parquet_format.sql

+/*
+  Create the Parquet file format
+*/
+CREATE OR REPLACE FILE FORMAT {{ parquet_file_format_name }}


Should this be Create if not exists?

It's a tradeoff.

CREATE OR REPLACE is the best way to ensure we are deploying in a declarative way without having to worry about state and requires minimal overhead. FILE FORMAT is a lightweight object and can be replaced easily.

But with the caveat about FILE FORMAT and EXTERNAL TABLE, we either need to enforce top-down that external tables which use FILE FORMAT are always recreated during deployment, or – if we were to instead use CREATE ... IF NOT EXISTS – we need to depend on the developer to be aware that if they have an existing deployment and they modify the FILE FORMAT object, then those changes won't be reflected in subsequent deployments unless they manually delete the object. But if they manually delete the object, then they also have to manually deal with any fallout from the caveat I mentioned above! Alternatively, they could ALTER the file format, but already we're getting ourselves into a situation where the developer has to be familiar with and remember this caveat and they have to manage the state of those objects themselves; CI/CD won't do it for them. Of course, we'd have to deal with this complexity when deploying changes to a FILE FORMAT in staging or prod, too.

Fortunately, since EXTERNAL TABLE objects are merely metadata (schema and table properties), they are also lightweight objects and can be replaced easily. So my preference is to use CREATE OR REPLACE for both object types to ensure that we are deploying objects in a state that's reflected in their DDL.

@philerooski Are we planning on using EXTERNAL TABLE? That's an important caveat you mentioned - but if we aren't, it's probably ok to do this

@thomasyu888 We are not, but if we were I would still recommend this approach because external tables are easy to replace.

.github/workflows/upload-and-deploy.yaml

thomasyu888 · 2024-07-03T04:34:08Z

...abases/recover/schemas/parquet/tables/enrolledparticipants_customfields_symptoms_parquet.sql

@@ -0,0 +1,11 @@
+CREATE OR ALTER TABLE ENROLLEDPARTICIPANTS_CUSTOMFIELDS_SYMPTOMS (


This is neat, does it automatically put you in the right schema? I did see the use schema in one of the scripts!

If you are deploying from an entrypoint in database/recover/schema/parquet/deploy.sql (where you see the USE SCHEMA statement) or higher, than yes. But if you are using another schema and you EXECUTE IMMEDIATE FROM this file, then this table will be created in that other schema.

philerooski · 2024-07-03T16:31:11Z

I wanted to touch on this "This PR is an incomplete solution to Snowflake deployment." I am still a bit confused by this, but could you outline what you think a "production" deployment looks like via CI/CD?

To summarize what the deployment does in this PR, it effectively creates all the objects necessary to create tables (plus a few extra – FILE FORMAT and STAGE objects – in anticipation of future work) and then creates tables for each of our Parquet datasets. It doesn't load any data, it merely creates the tables (or alters them, if they already exist).

This is less than ideal for a developer. They don't want to load any data and they definitely don't want to do it manually! So in the future, when we have Snowflake tables in production which already contain our data, we can modify the default behavior of this deployment so that we clone those tables, rather than create empty tables. That's what makes this an incomplete solution.

As for what a production-ready deployment system looks like, I believe we ought to have multiple ways of deploying tables:

By cloning from existing tables in production
By creating the table from scratch and loading data into the table via COPY INTO from an external stage.
By updating the table if there are changes, and otherwise doing nothing

(1) is our ideal dev environment deployment scenario. It requires minimal overhead and we get a production database in an isolated environment that we can muck around in.
(2) is our bootstrap scenario. If – for whatever reason – we need to redeploy our database entirely from scratch, we should be able to do that.
(3) is our production scenario. We want to update the table metadata if there have been any changes, but data loading happens via a separate mechanism not related to deployment.

when you push into main here, I think it creates a recover_main database?

No, RECOVER_MAIN is one of the databases which we already assume to exist as detailed in snowflake/objects/deploy.sql. I use Snowflake scripting to specify that we only create or replace a database if this is not the main or staging branch/environment.

thomasyu888 · 2024-07-03T16:31:59Z

snowflake/objects/deploy.sql

+  Create an external stage over the RECOVER Git repository so that we can
+  use EXECUTE IMMEDIATE FROM statements.
+*/
+CREATE OR REPLACE GIT REPOSITORY recover_git_repository


I think eventually, we will need ALTER this instead of create or replace in the future.

@thomasyu888 Could you explain why we might need to maintain an existing GIT REPOSITORY object? By using CREATE OR REPLACE we guarantee that our repository is up to date without needing to worry about whether the repository has already been created – which would require a FETCH statement.

@philerooski so, maybe im jumping ahead and just thinking of the dev and prod deployment. I guess its not too much lift to create or replace these everytime we push to a branch.

I think I'm struggling a bit conceptually to see the production CI/CD pipeline so I'll wait for you to get the first data flowing in and we can revisit.

Also I guess the issue is there's two parallel tracks here that I'm getting mixed up

Development workflow

Dev and prod deployment

What we are currently tackling here is (2), in that everytime you add a file (lets say Table) or add a scheduled task or dynamic table, the CI/CD is going to run through this entire stack to create the resource without creating all the other resources again.

For (1), like you've mentioned, it would be CLONE DATABASE, then create all the new resource per a new branch. We do NOT load all the data up again for existing data., we just load data for the new resources or tables we're adding

Is that right?

I guess its not too much lift to create or replace these everytime we push to a branch.

No, it's not much overhead at all. And it's the only way to do stateless/declarative deployment in Snowflake – with the exception of objects which support the CREATE OR ALTER syntax.

the CI/CD is going to run through this entire stack to create the resource without creating all the other resources again.

I'm a little confused about how you think this works, because you said above that we create or replace most objects every time we push to a branch, but this statement implies that we'll be able to know when a change is independent of already deployed objects, which would require a dependency graph.

For (1), like you've mentioned, it would be CLONE DATABASE, then create all the new resource per a new branch. We do NOT load all the data up again for existing data., we just load data for the new resources or tables we're adding

That's right. Does it make things clearer if I put it like this: The deployment process for dev/staging/prod is exactly the same except for the hard to replace objects, like tables with data.

we'll be able to know when a change is independent of already deployed objects, which would require a dependency graph.

Hmmm. lets merge this and see how it plays out - I think we are actually on the same page, but I think I just need to just see it in action.

Snowflake takes care of what has already been deployed (hence the CREATE... IF NOT EXISTS, or CREATE OR ALTER statements)

That does help!

slight revision: The deployment process for dev/staging/prod is exactly the same except for the hard to replace objects and their parent objects, like tables with data, and their schema and database.

BryanFauble · 2024-07-03T20:08:02Z

snowflake/objects/deploy.sql

+        * An API INTEGRATION with an API_ALLOWED_PREFIXES value
+          containing 'https://github.com/Sage-Bionetworks/' and
+          API_PROVIDER = GIT_HTTPS_API.
+    - STORAGE INTEGRATION `RECOVER_PROD_S3`


Do we also want a recover_dev_s3? Not sure if we want non-releases to go through any of this.

We actually do already have a storage integration for our dev account. But it's not a dependency for this deployment, and since in the future we will clone prod tables (or maybe we want to sample them to keep things lightweight?) to their own isolated environment I don't see why we would want to work with the pilot data.

BryanFauble

This is awesome. Great work getting this together.

sonarcloud · 2024-07-15T18:57:03Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
5.6% Duplication on New Code

See analysis details on SonarCloud

philerooski requested a review from a team as a code owner July 3, 2024 00:18

thomasyu888 approved these changes Jul 3, 2024

View reviewed changes

philerooski temporarily deployed to develop July 3, 2024 15:12 — with GitHub Actions Inactive

philerooski had a problem deploying to develop July 3, 2024 15:14 — with GitHub Actions Failure

philerooski had a problem deploying to develop July 3, 2024 15:14 — with GitHub Actions Error

philerooski force-pushed the etl-659 branch from a2fb77c to 1891f17 Compare July 3, 2024 16:27

philerooski temporarily deployed to develop July 3, 2024 16:27 — with GitHub Actions Inactive

philerooski had a problem deploying to develop July 3, 2024 16:30 — with GitHub Actions Failure

philerooski had a problem deploying to develop July 3, 2024 16:30 — with GitHub Actions Error

thomasyu888 reviewed Jul 3, 2024

View reviewed changes

philerooski temporarily deployed to develop July 3, 2024 19:19 — with GitHub Actions Inactive

philerooski had a problem deploying to develop July 3, 2024 19:19 — with GitHub Actions Failure

philerooski had a problem deploying to develop July 3, 2024 19:19 — with GitHub Actions Error

BryanFauble reviewed Jul 3, 2024

View reviewed changes

BryanFauble approved these changes Jul 3, 2024

View reviewed changes

philerooski force-pushed the etl-659 branch from 1891f17 to 139a863 Compare July 11, 2024 22:09

philerooski temporarily deployed to develop July 11, 2024 22:09 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 11, 2024 22:10 — with GitHub Actions Inactive

philerooski force-pushed the etl-659 branch from ead6926 to a29e952 Compare July 12, 2024 23:46

philerooski temporarily deployed to develop July 12, 2024 23:46 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 12, 2024 23:49 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 12, 2024 23:55 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 12, 2024 23:57 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 12, 2024 23:58 — with GitHub Actions Inactive

philerooski added 8 commits July 15, 2024 11:56

Configure Snowflake authentication in upload-and-deploy workflow

a1eb7d0

Add Snowflake table definition for enrolled_participants data type

8510876

Add SQL deployment scripts for Snowflake object hierarchy

ea9dd27

Deploy Snowflake objects in upload-and-deploy workflow

57ae25d

Add README to snowflake directory

de701d9

Rename snowflake/objects/ folders to singular

029a674

CREATE OR REPLACE git repository to ensure it is always up-to-date

b2c57e4

qualify snowflake deployment for staging/main

559ac2d

philerooski force-pushed the etl-659 branch from a29e952 to 559ac2d Compare July 15, 2024 18:56

philerooski temporarily deployed to develop July 15, 2024 18:57 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 15, 2024 18:59 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 15, 2024 19:06 — with GitHub Actions Inactive

philerooski merged commit 537c503 into main Jul 15, 2024
13 checks passed

philerooski temporarily deployed to develop July 15, 2024 19:08 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 15, 2024 19:09 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-659] Snowflake git integration #123

[ETL-659] Snowflake git integration #123

philerooski commented Jul 3, 2024 •

edited

Loading

thomasyu888 left a comment

thomasyu888 Jul 3, 2024

philerooski Jul 3, 2024

thomasyu888 Jul 3, 2024

philerooski Jul 3, 2024

thomasyu888 Jul 3, 2024 •

edited

Loading

philerooski Jul 3, 2024

philerooski commented Jul 3, 2024

thomasyu888 Jul 3, 2024

philerooski Jul 3, 2024

thomasyu888 Jul 3, 2024

thomasyu888 Jul 3, 2024 •

edited

Loading

philerooski Jul 3, 2024

thomasyu888 Jul 3, 2024 •

edited

Loading

philerooski Jul 3, 2024

BryanFauble Jul 3, 2024

philerooski Jul 5, 2024

BryanFauble left a comment

sonarcloud bot commented Jul 15, 2024 •

edited

Loading

		@@ -0,0 +1,11 @@
		CREATE OR ALTER TABLE ENROLLEDPARTICIPANTS_CUSTOMFIELDS_SYMPTOMS (

[ETL-659] Snowflake git integration #123

[ETL-659] Snowflake git integration #123

Conversation

philerooski commented Jul 3, 2024 • edited Loading

Implementation details

thomasyu888 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasyu888 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philerooski commented Jul 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasyu888 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasyu888 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanFauble left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Jul 15, 2024 • edited Loading

Quality Gate passed

philerooski commented Jul 3, 2024 •

edited

Loading

thomasyu888 Jul 3, 2024 •

edited

Loading

thomasyu888 Jul 3, 2024 •

edited

Loading

thomasyu888 Jul 3, 2024 •

edited

Loading

sonarcloud bot commented Jul 15, 2024 •

edited

Loading