Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-90] Introduce CI job to test schemachange updates against a clone #78

Open
wants to merge 79 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
ed485ba
Initial commit. A simple clone of the DB
jaymedina Oct 23, 2024
b388915
Install snowsql
jaymedina Oct 23, 2024
e3c575a
add pw
jaymedina Oct 23, 2024
ae8d75d
add debug step
jaymedina Oct 23, 2024
4371e0b
.
jaymedina Oct 23, 2024
c9989e5
debugging
jaymedina Oct 23, 2024
1b136f0
syntax
jaymedina Oct 23, 2024
680ecaa
directly pass vars
jaymedina Oct 23, 2024
7d946ed
new .sql file to run the command isntead
jaymedina Oct 23, 2024
a342f85
global vars
jaymedina Oct 23, 2024
244216b
dynamically change the name of the clone DB
jaymedina Oct 23, 2024
4f35760
syntax
jaymedina Oct 23, 2024
ab2a048
syntax
jaymedina Oct 23, 2024
be2e82f
using IDENTIFIER for db object
jaymedina Oct 23, 2024
fb4a208
.
jaymedina Oct 23, 2024
913ef4c
.
jaymedina Oct 23, 2024
ded7408
set var sub true
jaymedina Oct 23, 2024
b25b10d
add identifier
jaymedina Oct 23, 2024
2bdc6ef
.
jaymedina Oct 23, 2024
2090fad
.
jaymedina Oct 23, 2024
b0ebf0c
typo
jaymedina Oct 23, 2024
d91d5ab
.
jaymedina Oct 23, 2024
6a7b631
.
jaymedina Oct 23, 2024
d8c8927
?
jaymedina Oct 23, 2024
38da909
.
jaymedina Oct 23, 2024
c75abba
go back to running command in ci step itself
jaymedina Oct 23, 2024
1a5160f
role is data_engineer
jaymedina Oct 23, 2024
1edbb83
branch sha
jaymedina Oct 23, 2024
1bc7f04
Run schemachagne on the clone
jaymedina Oct 24, 2024
a94ea31
.
jaymedina Oct 24, 2024
cee2c5e
.
jaymedina Oct 24, 2024
df3a91f
remove clone_db.sql
jaymedina Oct 24, 2024
46c8390
label name changed
jaymedina Oct 25, 2024
ca73fb6
change job name. change clone name convention. fix typos. new delete_…
jaymedina Oct 28, 2024
6fe39ca
address syntax issues in clone name
jaymedina Oct 28, 2024
88c7b00
add a status message
jaymedina Oct 28, 2024
939c9f7
missing env var
jaymedina Oct 28, 2024
4116019
vars
jaymedina Oct 28, 2024
bac91b8
sysadmin to run schemachange
jaymedina Oct 28, 2024
1ffafa9
syntax fix
jaymedina Oct 28, 2024
c44ae9e
refactor and introduce drop_clone job
jaymedina Oct 28, 2024
a3eade5
no refactoring
jaymedina Oct 28, 2024
d76999f
drop_clone will only run after PR merging
jaymedina Oct 28, 2024
89f06b4
new contributing.md
jaymedina Oct 28, 2024
93253a2
updating contributing
jaymedina Oct 29, 2024
6e0d537
move tip
jaymedina Oct 29, 2024
9b187c1
.
jaymedina Oct 29, 2024
3bbd05c
.
jaymedina Oct 29, 2024
4f042c0
.
jaymedina Oct 29, 2024
d2a2452
.
jaymedina Oct 29, 2024
ba646fa
remove redundancies from drop_clone job
jaymedina Oct 29, 2024
6e24035
testing
jaymedina Oct 29, 2024
fc3a85d
testing done
jaymedina Oct 29, 2024
cb218a5
final comment on time travel
jaymedina Oct 29, 2024
06996ab
Update CONTRIBUTING.md
jaymedina Oct 29, 2024
ea07518
Update CONTRIBUTING.md
jaymedina Oct 29, 2024
9d4aaaa
fix typo. small formatting update.
jaymedina Oct 29, 2024
03dfc5a
only run create_clone on open prs. run drop_clone on closed and merge…
jaymedina Oct 30, 2024
ab9297a
improve clarity in contributing.md
jaymedina Oct 30, 2024
8498feb
rephrasing
jaymedina Oct 31, 2024
ed645f3
use snowflake cli
jaymedina Nov 4, 2024
ced4d2c
snowflake connection
jaymedina Nov 4, 2024
3a90e92
.
jaymedina Nov 4, 2024
a85a5ca
lol
jaymedina Nov 4, 2024
8e78bd8
.
jaymedina Nov 4, 2024
294244f
.
jaymedina Nov 4, 2024
f75fb49
we no longer need to auth via the query command
jaymedina Nov 4, 2024
65cfebb
run for every commit when the PR is labeled
jaymedina Nov 5, 2024
9d8fde0
V script to make dummy table
jaymedina Nov 5, 2024
6f2052a
quick change
jaymedina Nov 5, 2024
55cbfd8
rename V to see if that fixes the problem
jaymedina Nov 5, 2024
9a622ca
change to R script
jaymedina Nov 5, 2024
2c9bab2
rename table. update drop_clone job.
jaymedina Nov 5, 2024
912b974
.
jaymedina Nov 5, 2024
d0880d8
try running schemachange as data_engineer
jaymedina Nov 5, 2024
c6ba8d5
.
jaymedina Nov 5, 2024
8b50065
grant privileges to dpe_engineer
jaymedina Nov 6, 2024
00c2a3c
attempt at using dpe_engineer for schematic again
jaymedina Nov 6, 2024
5babc72
put both configs in the same file
jaymedina Nov 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions .github/workflows/test_with_clone.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
name: Test Changes with Cloned DB

on:
pull_request:
types: [ labeled, closed ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this trigger if the PR is created with the label already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- what if the PR is already labeled, I make some additional commits to my branch, and I want to CREATE OR REPLACE ... CLONE again? Do I need to remove the label and add it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this trigger if the PR is created with the label already?

Good point, I also had this concern but from what I've read, CI jobs can't get triggered on a closed PR. To be safe, I'll add this as a test after this gets merged.

Do I need to remove the label and add it again?

Yes, you would need to remove it and add it again. I'll add this as a note in the contributing doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to cause confusion if the clone of my PR does not reflect the HEAD of my feature branch. My preferred implementation would be that new commits to my branch also triggers the clone, if my branch meets the other conditions (PR/labeled).

Copy link
Contributor Author

@jaymedina jaymedina Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to cause confusion if the clone of my PR does not reflect the HEAD of my feature branch.

I would argue that this shouldn't cause confusion since you can match up the "last updated" on your DB to when you made your last commit, but I see this point.

If we go with that implementation, we may as well trigger the job for every commit regardless of whether it's labeled or not. I'm still on the fence with that idea. You mentioned that your CI pipeline for RECOVER triggers a zero copy clone for every commit, right? Did you experience any technical issues for edge cases where commits are made before the previous run was complete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had continued Snowflake development for RECOVER we probably would have used CLONE in some capacity. Instead, we had a stored procedure for each table that could be manually invoked to load data into that table, and a SQL script to do this in bulk for every data type, although this script wasn't used in our CI/CD. Snowflake stuff for RECOVER was never used in a production capacity, it was primarily a proof of concept.

I'm still on the fence with that idea

How come?

Copy link
Member

@thomasyu888 thomasyu888 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we follow the schemachange workflow, the clone is mainly a space for you to test things manually. Unsure if every commit is necessary right now - but I wonder if we could iteratively get there, or when we have the bandwidth to actually move away from schemachange.

There would be additional setup required right now for schemachange to deploy on feature branches into the clone, and would require more reworking on the CI/CD pipeline.

If we have plans on moving away from schemachange in the future, we may want to follow the schemachange framework for now: https://github.com/Snowflake-Labs/schemachange?tab=readme-ov-file#sample-devops-process-flow.

I chose this particular framework for ease and the setup time was much faster than my experience with using the snow cli to manage all of this. The downside ofcourse is the annoying V_ and R_ scripts.

Copy link
Contributor Author

@jaymedina jaymedina Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come?

My main concern is the risk of running the job for a commit before the previous job has finished. In other words: back-to-back commits causing cloned DB conflicts, or clashing between R scripts that are running against the same clone.

I think you bring a valid point re: manually triggering the job could lead to version inconsistency between the feature branch and the cloned DB, so I'm open to doing the run-for-every-commit implementation. But I need to do a test on this PR for the edge-cases I'm worried about, and then we can move forward from there. @philerooski


permissions:
contents: read

jobs:

create_clone_and_run_schemachange:
runs-on: ubuntu-latest
if: contains(github.event.pull_request.labels.*.name, 'create_clone_and_run_schemachange')
jaymedina marked this conversation as resolved.
Show resolved Hide resolved
environment: dev
env:
SNOWSQL_PWD: ${{ secrets.SNOWSQL_PWD }}
SNOWSQL_ACCOUNT: ${{ secrets.SNOWSQL_ACCOUNT }}
SNOWSQL_USER: ${{ secrets.SNOWSQL_USER }}
SNOWSQL_WAREHOUSE: ${{ secrets.SNOWSQL_WAREHOUSE }}
SNOWSQL_CLONE_ROLE: DATA_ENGINEER
SNOWSQL_SCHEMACHANGE_ROLE: SYSADMIN
SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE_ORIG: ${{ vars.SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE }}
SNOWFLAKE_SYNAPSE_STAGE_STORAGE_INTEGRATION: ${{ vars.SNOWFLAKE_SYNAPSE_STAGE_STORAGE_INTEGRATION }}
SNOWFLAKE_SYNAPSE_STAGE_URL: ${{ vars.SNOWFLAKE_SYNAPSE_STAGE_URL }}
CLONE_NAME: "${{ vars.SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE }}_${{ github.head_ref }}"
STACK: ${{ vars.STACK }}

steps:

- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install python libraries
shell: bash
run: |
pip install schemachange==3.6.1
pip install numpy==1.26.4
pip install pandas==1.5.3

- name: Install SnowSQL
run: |
curl -O https://sfc-repo.snowflakecomputing.com/snowsql/bootstrap/1.2/linux_x86_64/snowsql-1.2.9-linux_x86_64.bash
SNOWSQL_DEST=~/bin SNOWSQL_LOGIN_SHELL=~/.profile bash snowsql-1.2.9-linux_x86_64.bash
jaymedina marked this conversation as resolved.
Show resolved Hide resolved

- name: Verify SnowSQL installation
run: ~/bin/snowsql -v
jaymedina marked this conversation as resolved.
Show resolved Hide resolved

- name: Sanitize Clone Name
run: |
CLONE_NAME_SANITIZED="${CLONE_NAME//[^a-zA-Z0-9_]/_}"
jaymedina marked this conversation as resolved.
Show resolved Hide resolved
echo "Clone name has been updated! The clone name will be: ${CLONE_NAME_SANITIZED}"
echo "SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE=${CLONE_NAME_SANITIZED}" >> $GITHUB_ENV
jaymedina marked this conversation as resolved.
Show resolved Hide resolved

- name: Zero-copy clone the database
shell: bash
run: |
~/bin/snowsql -r $SNOWSQL_CLONE_ROLE -q "CREATE OR REPLACE DATABASE $SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE CLONE $SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE_ORIG;"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some interesting things I learned while reading the docs on cloning:

  • A cloned table does not include the load history of the source table. One consequence of this is that data files that were loaded into a source table can be loaded again into its clones.
  • Cloning a database or schema does not clone objects of the following types in the database or schema: External tables, Internal (Snowflake) stages
  • The "Object references" section describes how objects which reference a fully-qualified name (like my_database.my_schema.my_table) also reference the fully-qualified name in their clone. This could potentially be problematic if someone has defined in the production database a view (for example) which references a stage by its fully-qualified name.

Copy link
Member

@thomasyu888 thomasyu888 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to look into these things!

  • Re - the loading of source tables: This is ok, the dev warehouse has small enough data that if we were ever to mess with the raw snapshot scheduled task pipeline, it would be ok to pull in more data
  • Re - the not cloned objects: That's ok, we currently do not have any external or internal snowflake stages.
  • Re - the last point: could you expand with an example?


- name: Run schemachange on the clone
shell: bash
run: |
schemachange \
-f synapse_data_warehouse \
-a $SNOWSQL_ACCOUNT \
-u $SNOWSQL_USER \
-r $SNOWSQL_SCHEMACHANGE_ROLE \
-w $SNOWSQL_WAREHOUSE \
--config-folder synapse_data_warehouse

drop_clone:
runs-on: ubuntu-latest
if: github.event.pull_request.merged == true
environment: dev
env:
SNOWSQL_PWD: ${{ secrets.SNOWSQL_PWD }}
SNOWSQL_ACCOUNT: ${{ secrets.SNOWSQL_ACCOUNT }}
SNOWSQL_USER: ${{ secrets.SNOWSQL_USER }}
SNOWSQL_WAREHOUSE: ${{ secrets.SNOWSQL_WAREHOUSE }}
SNOWSQL_CLONE_ROLE: DATA_ENGINEER
CLONE_NAME: "${{ vars.SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE }}_${{ github.head_ref }}"

steps:

- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install SnowSQL
run: |
curl -O https://sfc-repo.snowflakecomputing.com/snowsql/bootstrap/1.2/linux_x86_64/snowsql-1.2.9-linux_x86_64.bash
SNOWSQL_DEST=~/bin SNOWSQL_LOGIN_SHELL=~/.profile bash snowsql-1.2.9-linux_x86_64.bash

- name: Verify SnowSQL installation
run: ~/bin/snowsql -v

- name: Sanitize Clone Name
run: |
CLONE_NAME_SANITIZED="${CLONE_NAME//[^a-zA-Z0-9_]/_}"
echo "SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE=${CLONE_NAME_SANITIZED}" >> $GITHUB_ENV
echo "Clone name has been updated! The clone name will be: ${SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE}"
echo $SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE

- name: Drop the clone
shell: bash
run: |
~/bin/snowsql -r $SNOWSQL_CLONE_ROLE -q "DROP DATABASE IF EXISTS $SNOWFLAKE_SYNAPSE_DATA_WAREHOUSE_DATABASE;"
104 changes: 104 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Contributing Guidelines

Welcome, and thanks for your interest in contributing to the `snowflake` repository! :snowflake:

By contributing, you are agreeing that we may redistribute your work under this [license](https://github.com/Sage-Bionetworks/snowflake/tree/snow-90-auto-db-clone?tab=License-1-ov-file#).

## Getting Started

To start contributing, follow these steps to set up and develop on your local repository:

### 1. Clone the Repository

```bash
git clone https://github.com/Sage-Bionetworks/snowflake
```

### 2. Fetch the Latest `dev` Branch

After cloning, navigate to the repository directory:

```bash
cd snowflake
```

Then, fetch the latest updates from the `dev` branch to ensure you’re working with the latest codebase:

```bash
git fetch origin dev
```

### 3. Create a New Branch Off `dev`

Create and checkout your feature branch from the latest `dev` branch. Name it based on the Jira ticket number and your feature/fix. For example:

```bash
git checkout -b snow-123-new-feature origin/dev
```

Your branch will now be tracking `origin/dev` which you can merge with or rebase onto should a merge conflict occur. For more guidance
on how to resolve merge conflicts, [see here](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/about-merge-conflicts#resolving-merge-conflicts).

> [!IMPORTANT]
> If you plan to run the automated testing described in section [Running CI Jobs for Database Testing](#running-ci-jobs-for-database-testing), your branch name needs to start with `snow-`,
> otherwise the test deployment will fail.

### 4. Push to The Remote Branch

Once you've made your changes and committed them locally, push your branch to the remote repository:

```
git push origin snow-123-new-feature
```

### 5. Create a Draft Pull Request

In order to initiate automated testing you will need to work on a draft pull request (PR) on GitHub. After pushing your commits to
the remote branch in Step 4, use the GitHub UI to initate a PR and convert it to draft mode.
jaymedina marked this conversation as resolved.
Show resolved Hide resolved

## Running CI Jobs for Database Testing

This repository includes automated CI jobs to validate changes against a cloned database. If you want to trigger these jobs to test your changes in an isolated database environment, please follow the steps below:

### 1. Add the Label

Add the label `create_clone_and_run_schemachange` to your PR to trigger the CI workflow. This job does two things:

* Creates a zero-copy clone of the database and runs your proposed schema changes against it.
* Tests your schema changes on a cloned version of the development database, verifying that your updates work correctly without
affecting the real development database. After the PR is merged, the clone is automatically dropped to free up resources.

> [!IMPORTANT]
> Your cloned database is a clone of the development database as it exists at the time of cloning. Please be mindful that **there may be
> other developers working on their own version of the development database, whose changes may not be reflected in your clone**. Keep an
> eye out for other PRs with the `create_clone_and_run_schemachange` label, and ensure that you are not performing changes on the
> same tables to avoid conflicts.
jaymedina marked this conversation as resolved.
Show resolved Hide resolved

### 2. Perform Inspection using Snowsight

You can go on Snowsight to perform manual inspection of the schema changes in your cloned database. We recommend using a SQL worksheet for manual quality assurance queries, e.g. ensure there is no row duplication in the new/updated tables.

> [!TIP]
> Your database will be named after your feature branch so it's easy to find on Snowsight. For example, if your feature branch is called
> `snow-123-new-feature`, your database might be called `SYNAPSE_DATA_WAREHOUSE_DEV_SNOW_123_NEW_FEATURE`.

### 3. Manually Drop the Cloned Database (Optional)

There is a second job in the repository (`drop_clone`) that will drop your branch's database clone once it has been merged into `dev`.
In other words, once your cloned database is created for testing, it will remain open until your PR is closed (unless you manually drop it).

An initial clone of the development database will not incur new resource costs, **HOWEVER**, when a clone deviates from the original
(e.g. new schema changes are applied for testing), the cloned database will begin to incur costs the longer it exists in our warehouse.
**Please be mindful of the amount of time your PR stays open**, as cloned databases do not get dropped until a PR is merged. For example, if your PR is open for >1 week, consider manually dropping your cloned database on Snowflake to avoid unnecessary cost.

> [!NOTE]
> Keep in mind that after dropping your cloned database, you will still have access to it through Snowflake's "Time Travel"
> feature. Your database is retained through "Time Travel" for X amount of time before it is permanently deleted. To see
> how long your database can be accessed after dropping, run the following query in a SQL worksheet on Snowsight and look
> for the keyword `DATA_RETENTION_TIME_IN_DAYS`:
>
> ```
> SHOW PARAMETERS IN DATABASE <your-database-name>;
> ```

Following these guidelines helps maintain a clean, efficient, and well-tested codebase. Thank you for contributing!
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ pip install "snowflake-connector-python[pandas]" "synapseclient[pandas]" python-

## Contributing

WIP
For contribution guidelines, please see the `CONTRIBUTING.md` file in this repository.

## Visualizing with Streamlit

Expand Down