[ETL-676] Create data loading procedure for each parquet datatype #132

philerooski · 2024-07-22T03:57:06Z

For each of our Parquet data types, we create a stored procedure which can do a COPY INTO operation from the S3 location of that data type into a Snowflake table. Additionally, I've added an external stage for our development S3 bucket. We do not call these stored procedures as part of the deployment process. Instead, we will call the procedures using Tasks (future work).

snowflake/objects/database/recover/schema/parquet/procedure/copy_into_table_from_stage.sql
This is the stored procedure logic. This procedure will copy data from a specific stage (S3) location, but can copy into any Snowflake table (specified by the target_table parameter).
snowflake/objects/database/recover/schema/parquet/procedure/deploy.sql
This handles the logic of deploying the above procedure for each of our Parquet data types.
snowflake/objects/database/recover/schema/parquet/deploy.sql
This is our (relative) top-level deployment script which handles deployment for all the child objects under the Parquet schema. We've added some logic which will create the procedures. Depending on whether this is a dev or prod/staging (i.e., the branch is main) deployment, we will have the procedures reference the production or development stage.
snowflake/objects/database/recover/schema/parquet/stage/*
Everything under this directory is related to deploying the prod and dev stages.

This reverts commit c265b5f.

sonarcloud · 2024-07-22T04:00:52Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

thomasyu888

🔥 LGTM! The test seems to be failing for certain portions though.

BryanFauble

These changes look great!

Are there any instructions written up around how to run these items via CLI? I know during code spaces you were covering a few ways that running these were a bit different from passing arguments normally. I'm thinking about this from the perspective of "If I need to make a change, what is the best way for me to test the changes?"

philerooski · 2024-07-22T19:26:37Z

@BryanFauble

I know during code spaces you were covering a few ways that running these were a bit different from passing arguments normally. I'm thinking about this from the perspective of "If I need to make a change, what is the best way for me to test the changes?"

This is one area where I feel like Snowflake starts tripping over itself. The only way to invoke a SQL script locally (afaik) is with snow sql -f, but that requires Jinja variables to be formatted like &{ this }, whereas in an actual deployment you would use EXECUTE IMMEDIATE FROM statements to invoke the SQL, which uses {{ this }} format. And we can't just snow sql -q "EXECUTE IMMEDIATE FROM ..." to run local SQL files because Snowflake requires the top-level SQL script in your EXECUTE IMMEDIATE FROM chain to be in a stage -- which means we need to push the code to an internal Snowflake stage or push it to Git and use an external stage.

I've been developing without variables as much as possible, then amending my git commits whenever I'm ready to start using variables. This is pretty inconvenient. It might be easier to replace {{ this }} with &{ this } when developing and then switch the formatting when I'm ready to push to Git. I'm still trying to figure out the best way to go about development.

philerooski · 2024-07-22T19:27:30Z

@thomasyu888

The test seems to be failing for certain portions though.

Seems like it might have had something to do with Synapse being down for maintenance Sunday evening?

BryanFauble · 2024-07-22T19:43:06Z

@BryanFauble

I know during code spaces you were covering a few ways that running these were a bit different from passing arguments normally. I'm thinking about this from the perspective of "If I need to make a change, what is the best way for me to test the changes?"

This is one area where I feel like Snowflake starts tripping over itself. The only way to invoke a SQL script locally (afaik) is with snow sql -f, but that requires Jinja variables to be formatted like &{ this }, whereas in an actual deployment you would use EXECUTE IMMEDIATE FROM statements to invoke the SQL, which uses {{ this }} format. And we can't just snow sql -q "EXECUTE IMMEDIATE FROM ..." to run local SQL files because Snowflake requires the top-level SQL script in your EXECUTE IMMEDIATE FROM chain to be in a stage -- which means we need to push the code to an internal Snowflake stage or push it to Git and use an external stage.

I've been developing without variables as much as possible, then amending my git commits whenever I'm ready to start using variables. This is pretty inconvenient. It might be easier to replace {{ this }} with &{ this } when developing and then switch the formatting when I'm ready to push to Git. I'm still trying to figure out the best way to go about development.

@philerooski What I have been doing in the Terraform world is to just commit my changes and push to origin:
Sage-Bionetworks-Workflows/eks-stack@main...ibcdpe-935-vpc-updates

I'm then letting the CICD pipeline automatically run for me - And then I look at the results in the UI. I can do all of this locally, but this was more than sufficient for me.

Perhaps we would need to follow the same thing here where you push changes to origin and watch the github action that runs?

What this would prevent in your case is if a mistake is made when you "switch the formatting when I'm ready to push to Git"

philerooski · 2024-07-22T22:27:16Z

@BryanFauble That's the process I use now, letting the CI/CD do the deployment after I amend the commit and force push. A few (minor-ish) downsides:

My deployment method doesn't take into account unchanged "stacks". Everything needs to be deployed from scratch, which is a lot slower than you would hope if all you're trying to do is update a single object. So I end up manually commenting out time-consuming steps like snowflake/objects/.../table/deploy.sql.
RECOVER's CI/CD includes a lot of other time-consuming stuff like AWS deployments and tests -- which aren't blocking the Snowflake deployment in the CI/CD, but also aren't related and don't need to run.

philerooski added 4 commits July 18, 2024 11:23

create procedure to copy data into table from stage

d46e263

Add external stage for parquet dev

3612829

return resultset rather than table

c265b5f

Revert "return resultset rather than table"

2202e0f

This reverts commit c265b5f.

philerooski requested a review from a team as a code owner July 22, 2024 03:57

create a procedure for each parquet datatype

7a8a3b2

philerooski force-pushed the etl-676 branch from 1d37ce9 to 7a8a3b2 Compare July 22, 2024 04:00

philerooski temporarily deployed to develop July 22, 2024 04:00 — with GitHub Actions Inactive

philerooski had a problem deploying to develop July 22, 2024 04:00 — with GitHub Actions Failure

philerooski temporarily deployed to develop July 22, 2024 04:00 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 22, 2024 04:03 — with GitHub Actions Inactive

thomasyu888 approved these changes Jul 22, 2024

View reviewed changes

BryanFauble approved these changes Jul 22, 2024

View reviewed changes

philerooski temporarily deployed to develop July 22, 2024 18:54 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 22, 2024 19:11 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 22, 2024 19:13 — with GitHub Actions Inactive

philerooski temporarily deployed to develop July 22, 2024 19:14 — with GitHub Actions Inactive

philerooski merged commit 6a610b1 into main Jul 23, 2024
18 checks passed

philerooski deleted the etl-676 branch July 23, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-676] Create data loading procedure for each parquet datatype #132

[ETL-676] Create data loading procedure for each parquet datatype #132

philerooski commented Jul 22, 2024 •

edited

Loading

sonarcloud bot commented Jul 22, 2024

thomasyu888 left a comment •

edited

Loading

BryanFauble left a comment

philerooski commented Jul 22, 2024

philerooski commented Jul 22, 2024

BryanFauble commented Jul 22, 2024

philerooski commented Jul 22, 2024

[ETL-676] Create data loading procedure for each parquet datatype #132

[ETL-676] Create data loading procedure for each parquet datatype #132

Conversation

philerooski commented Jul 22, 2024 • edited Loading

sonarcloud bot commented Jul 22, 2024

Quality Gate passed

thomasyu888 left a comment • edited Loading

Choose a reason for hiding this comment

BryanFauble left a comment

Choose a reason for hiding this comment

philerooski commented Jul 22, 2024

philerooski commented Jul 22, 2024

BryanFauble commented Jul 22, 2024

philerooski commented Jul 22, 2024

philerooski commented Jul 22, 2024 •

edited

Loading

thomasyu888 left a comment •

edited

Loading