Skip to content

Commit

Permalink
Added deploy with modal. (#1805)
Browse files Browse the repository at this point in the history
* Added deploy with modal.

* A few minor fixes

* updated links as per comment

* Updated as per the comments.

* Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md

* Updated

* Updated as per comments

* Updated

* minor fix for relative link

* Incorporated comments and new script provided.

* Added the snippets

* Updated

* Updated

* updated poetry.lock

* Updated "poetry.lock"

* Added "__init__.py"

* Updated snippets.py

* Updated path in MAKEFILE

* Added __init__.py in walkthroughs

* Adjusted for black

* Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\-]+'

* updated

* renamed deploy-a-pipeline with deploy_a_pipeline

* Updated for errors in linting

* small changes

* bring back deploy-a-pipeline

* bring back deploy-a-pipeline in sidebar

* fix path to snippet

* update lock file

* fix path to snippet in tags

* fix Duplicate module named "snippets"

* rename snippets to code, refactor article, fix mypy errors

* fix black errors

* rename code to deploy_snippets

* add pytest testing for modal function

* move example article to the bottom

* update lock file

---------

Co-authored-by: Anton Burnashev <[email protected]>
Co-authored-by: Alena <[email protected]>
  • Loading branch information
3 people authored Nov 7, 2024
1 parent 0c6fd65 commit f5a64be
Show file tree
Hide file tree
Showing 7 changed files with 562 additions and 6 deletions.
1 change: 0 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ lint-and-test-examples:
poetry run mypy --config-file mypy.ini docs/examples
cd docs/examples && poetry run pytest


test-examples:
cd docs/examples && poetry run pytest

Expand Down
113 changes: 113 additions & 0 deletions docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: Deploy with Modal
description: How to deploy a pipeline with Modal
keywords: [how to, deploy a pipeline, Modal]
canonical: https://modal.com/blog/analytics-stack
---

# Deploy with Modal

## Introduction to Modal

[Modal](https://modal.com/) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure.

With Modal, you can perform tasks like running generative models, large-scale batch jobs, and job queues, all while easily scaling compute resources.

### Modal features

- Serverless Compute: No infrastructure management; scales automatically from zero to thousands of CPUs/GPUs.
- Cloud Functions: Run Python code in the cloud instantly and scale horizontally.
- GPU/CPU Scaling: Easily attach GPUs for heavy tasks like AI model training with a single line of code.
- Web Endpoints: Expose any function as an HTTPS API endpoint quickly.
- Scheduled Jobs: Convert Python functions into scheduled tasks effortlessly.

To know more, please refer to [Modals's documentation.](https://modal.com/docs)


## How to run dlt on Modal

Here’s a dlt project setup to copy data from public MySQL database into DuckDB as a destination:

### Step 1: Initialize source
Run the `dlt init` CLI command to initialize the SQL database source and set up the `sql_database_pipeline.py` template.
```sh
dlt init sql_database duckdb
```

### Step 2: Define Modal Image
Open the file and define the Modal Image you want to run `dlt` in:
<!--@@@DLT_SNIPPET ./deploy_snippets/deploy-with-modal-snippets.py::modal_image-->

### Step 3: Define Modal Function
A Modal Function is a containerized environment that runs tasks.
It can be scheduled (e.g., daily or on a Cron schedule), request more CPU/memory, and scale across
multiple containers.

Here’s how to include your SQL pipeline in the Modal Function:

<!--@@@DLT_SNIPPET ./deploy_snippets/deploy-with-modal-snippets.py::modal_function-->

### Step 4: Set up credentials
You can securely store your credentials using Modal secrets. When you reference secrets within a Modal script,
the defined secret is automatically set as an environment variable. dlt natively supports environment variables,
enabling seamless integration of your credentials. For example, to declare a connection string, you can define it as follows:
```text
SOURCES__SQL_DATABASE__CREDENTIALS=mysql+pymysql://[email protected]:4497/Rfam
```
In the script above, the credentials specified are automatically utilized by dlt.
For more details, please refer to the [documentation.](../../general-usage/credentials/setup#environment-variables)

### Step 5: Run pipeline
Execute the pipeline once.
To run your pipeline a single time, use the following command:
```sh
modal run sql_pipeline.py
```

### Step 6: Deploy
If you want to deploy your pipeline on Modal for continuous execution or scheduling, use this command:
```sh
modal deploy sql_pipeline.py
```

## Advanced configuration
### Modal Proxy

If your database is in a private VPN, you can use [Modal Proxy](https://modal.com/docs/reference/modal.Proxy) as a bastion server (available for Enterprise customers).
To connect to a production read replica, attach the proxy to the function definition and change the hostname to localhost:
```py
@app.function(
secrets=[
modal.Secret.from_name("postgres-read-replica-prod"),
],
schedule=modal.Cron("24 6 * * *"),
proxy=modal.Proxy.from_name("prod-postgres-proxy", environment_name="main"),
timeout=3000,
)
def task_pipeline(dev: bool = False) -> None:
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'
```

### Capturing deletes
To capture updates or deleted rows from your Postgres database, consider using dlt's [Postgres CDC replication feature](../../dlt-ecosystem/verified-sources/pg_replication), which is
useful for tracking changes and deletions in the data.

### Sync Multiple Tables in Parallel
To sync multiple tables in parallel, map each table copy job to a separate container using [Modal.starmap](https://modal.com/docs/reference/modal.Function#starmap):

```py
@app.function(timeout=3000, schedule=modal.Cron("29 11 * * *"))
def main(dev: bool = False):
tables = [
("task", "enqueued_at", dev),
("worker", "launched_at", dev),
...
]
list(load_table_from_database.starmap(tables))
```

## More examples

For a practical, real-world example, check out the article ["Building a Cost-Effective Analytics Stack with Modal, dlt, and dbt"](https://modal.com/blog/analytics-stack).

This article illustrates how to automate a workflow for loading data from Postgres into Snowflake using dlt, providing valuable insights into building an efficient analytics pipeline.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import os

import modal
from tests.pipeline.utils import assert_load_info

# @@@DLT_SNIPPET_START modal_image
# Define the Modal Image
image = modal.Image.debian_slim().pip_install(
"dlt>=1.1.0",
"dlt[duckdb]", # destination
"dlt[sql_database]", # source (MySQL)
"pymysql", # database driver for MySQL source
)

app = modal.App("example-dlt", image=image)

# Modal Volume used to store the duckdb database file
vol = modal.Volume.from_name("duckdb-vol", create_if_missing=True)
# @@@DLT_SNIPPET_END modal_image


# @@@DLT_SNIPPET_START modal_function
@app.function(
volumes={"/data/": vol},
schedule=modal.Period(days=1),
secrets=[modal.Secret.from_name("sql-secret")],
)
def load_tables() -> None:
import dlt
from dlt.sources.sql_database import sql_database

# Define the source database credentials; in production, you would save this as a Modal Secret which can be referenced here as an environment variable
os.environ["SOURCES__SQL_DATABASE__CREDENTIALS"] = (
"mysql+pymysql://[email protected]:4497/Rfam"
)
# Load tables "family" and "genome"
source = sql_database().with_resources("family", "genome")

# Create dlt pipeline object
pipeline = dlt.pipeline(
pipeline_name="sql_to_duckdb_pipeline",
destination=dlt.destinations.duckdb(
"/data/rfam.duckdb"
), # write the duckdb database file to this file location, which will get mounted to the Modal Volume
dataset_name="sql_to_duckdb_pipeline_data",
progress="log", # output progress of the pipeline
)

# Run the pipeline
load_info = pipeline.run(source)

# Print run statistics
print(load_info)
# @@@DLT_SNIPPET_END modal_function

assert_load_info(load_info)


def test_modal_snippet() -> None:
import pytest
from modal.exception import ExecutionError

# Any additional logic or calling the function
with pytest.raises(ExecutionError) as excinfo:
load_tables.remote()
# >> modal.exception.ExecutionError:
# >> Function has not been hydrated with the metadata it needs to run on Modal, because the App it is defined on is not running.
assert "hydrated" in str(excinfo.value)
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,7 @@ const sidebars = {
'walkthroughs/deploy-a-pipeline/deploy-with-kestra',
'walkthroughs/deploy-a-pipeline/deploy-with-dagster',
'walkthroughs/deploy-a-pipeline/deploy-with-prefect',
'walkthroughs/deploy-a-pipeline/deploy-with-modal',
]
},
{
Expand Down
Loading

0 comments on commit f5a64be

Please sign in to comment.