Added deploy with modal. (#1805)

* Added deploy with modal. * A few minor fixes * updated links as per comment * Updated as per the comments. * Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md * Updated * Updated as per comments * Updated * minor fix for relative link * Incorporated comments and new script provided. * Added the snippets * Updated * Updated * updated poetry.lock * Updated "poetry.lock" * Added "__init__.py" * Updated snippets.py * Updated path in MAKEFILE * Added __init__.py in walkthroughs * Adjusted for black * Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\-]+' * updated * renamed deploy-a-pipeline with deploy_a_pipeline * Updated for errors in linting * small changes * bring back deploy-a-pipeline * bring back deploy-a-pipeline in sidebar * fix path to snippet * update lock file * fix path to snippet in tags * fix Duplicate module named "snippets" * rename snippets to code, refactor article, fix mypy errors * fix black errors * rename code to deploy_snippets * add pytest testing for modal function * move example article to the bottom * update lock file --------- Co-authored-by: Anton Burnashev <[email protected]> Co-authored-by: Alena <[email protected]>
dlt-hub · Nov 7, 2024 · f5a64be · f5a64be
1 parent 0c6fd65
commit f5a64be
Show file tree

Hide file tree

Showing 7 changed files with 562 additions and 6 deletions.
diff --git a/Makefile b/Makefile
@@ -75,7 +75,6 @@ lint-and-test-examples:
 	poetry run mypy --config-file mypy.ini docs/examples
 	cd docs/examples && poetry run pytest
 
-
 test-examples:
 	cd docs/examples && poetry run pytest
 

diff --git a/docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md b/docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md
@@ -0,0 +1,113 @@
+---
+title: Deploy with Modal
+description: How to deploy a pipeline with Modal
+keywords: [how to, deploy a pipeline, Modal]
+canonical: https://modal.com/blog/analytics-stack
+---
+
+# Deploy with Modal
+
+## Introduction to Modal
+
+[Modal](https://modal.com/) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure.
+
+With Modal, you can perform tasks like running generative models, large-scale batch jobs, and job queues, all while easily scaling compute resources.
+
+### Modal features
+
+- Serverless Compute: No infrastructure management; scales automatically from zero to thousands of CPUs/GPUs.
+- Cloud Functions: Run Python code in the cloud instantly and scale horizontally.
+- GPU/CPU Scaling: Easily attach GPUs for heavy tasks like AI model training with a single line of code.
+- Web Endpoints: Expose any function as an HTTPS API endpoint quickly.
+- Scheduled Jobs: Convert Python functions into scheduled tasks effortlessly.
+
+To know more, please refer to [Modals's documentation.](https://modal.com/docs)
+
+
+## How to run dlt on Modal
+
+Here’s a dlt project setup to copy data from public MySQL database into DuckDB as a destination:
+
+### Step 1: Initialize source
+Run the `dlt init` CLI command to initialize the SQL database source and set up the `sql_database_pipeline.py` template.
+```sh
+dlt init sql_database duckdb
+```
+
+### Step 2: Define Modal Image
+Open the file and define the Modal Image you want to run `dlt` in:
+<!--@@@DLT_SNIPPET ./deploy_snippets/deploy-with-modal-snippets.py::modal_image-->
+
+### Step 3: Define Modal Function
+A Modal Function is a containerized environment that runs tasks.
+It can be scheduled (e.g., daily or on a Cron schedule), request more CPU/memory, and scale across
+multiple containers.
+
+Here’s how to include your SQL pipeline in the Modal Function:
+
+<!--@@@DLT_SNIPPET ./deploy_snippets/deploy-with-modal-snippets.py::modal_function-->
+
+### Step 4: Set up credentials
+You can securely store your credentials using Modal secrets. When you reference secrets within a Modal script,
+the defined secret is automatically set as an environment variable. dlt natively supports environment variables,
+enabling seamless integration of your credentials. For example, to declare a connection string, you can define it as follows:
+```text
+SOURCES__SQL_DATABASE__CREDENTIALS=mysql+pymysql://[email protected]:4497/Rfam
+```
+In the script above, the credentials specified are automatically utilized by dlt.
+For more details, please refer to the [documentation.](../../general-usage/credentials/setup#environment-variables)
+
+### Step 5: Run pipeline
+Execute the pipeline once.
+To run your pipeline a single time, use the following command:
+```sh
+modal run sql_pipeline.py
+```
+
+### Step 6: Deploy
+If you want to deploy your pipeline on Modal for continuous execution or scheduling, use this command:
+```sh
+modal deploy sql_pipeline.py
+```
+
+## Advanced configuration
+### Modal Proxy
+
+If your database is in a private VPN, you can use [Modal Proxy](https://modal.com/docs/reference/modal.Proxy) as a bastion server (available for Enterprise customers).
+To connect to a production read replica, attach the proxy to the function definition and change the hostname to localhost:
+```py
+@app.function(
+    secrets=[
+        modal.Secret.from_name("postgres-read-replica-prod"),
+    ],
+    schedule=modal.Cron("24 6 * * *"),
+    proxy=modal.Proxy.from_name("prod-postgres-proxy", environment_name="main"),
+    timeout=3000,
+)
+def task_pipeline(dev: bool = False) -> None:
+    pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'
+```
+
+### Capturing deletes
+To capture updates or deleted rows from your Postgres database, consider using dlt's [Postgres CDC replication feature](../../dlt-ecosystem/verified-sources/pg_replication), which is
+useful for tracking changes and deletions in the data.
+
+### Sync Multiple Tables in Parallel
+To sync multiple tables in parallel, map each table copy job to a separate container using [Modal.starmap](https://modal.com/docs/reference/modal.Function#starmap):
+
+```py
+@app.function(timeout=3000, schedule=modal.Cron("29 11 * * *"))
+def main(dev: bool = False):
+    tables = [
+        ("task", "enqueued_at", dev),
+        ("worker", "launched_at", dev),
+        ...
+    ]
+    list(load_table_from_database.starmap(tables))
+```
+
+## More examples
+
+For a practical, real-world example, check out the article ["Building a Cost-Effective Analytics Stack with Modal, dlt, and dbt"](https://modal.com/blog/analytics-stack).
+
+This article illustrates how to automate a workflow for loading data from Postgres into Snowflake using dlt, providing valuable insights into building an efficient analytics pipeline.
diff --git a/docs/website/docs/walkthroughs/deploy-a-pipeline/deploy_snippets/__init__.py b/docs/website/docs/walkthroughs/deploy-a-pipeline/deploy_snippets/__init__.py
diff --git a/...website/docs/walkthroughs/deploy-a-pipeline/deploy_snippets/deploy-with-modal-snippets.py b/...website/docs/walkthroughs/deploy-a-pipeline/deploy_snippets/deploy-with-modal-snippets.py
@@ -0,0 +1,68 @@
+import os
+
+import modal
+from tests.pipeline.utils import assert_load_info
+
+# @@@DLT_SNIPPET_START modal_image
+# Define the Modal Image
+image = modal.Image.debian_slim().pip_install(
+    "dlt>=1.1.0",
+    "dlt[duckdb]",  # destination
+    "dlt[sql_database]",  # source (MySQL)
+    "pymysql",  # database driver for MySQL source
+)
+
+app = modal.App("example-dlt", image=image)
+
+# Modal Volume used to store the duckdb database file
+vol = modal.Volume.from_name("duckdb-vol", create_if_missing=True)
+# @@@DLT_SNIPPET_END modal_image
+
+
+# @@@DLT_SNIPPET_START modal_function
+@app.function(
+    volumes={"/data/": vol},
+    schedule=modal.Period(days=1),
+    secrets=[modal.Secret.from_name("sql-secret")],
+)
+def load_tables() -> None:
+    import dlt
+    from dlt.sources.sql_database import sql_database
+
+    # Define the source database credentials; in production, you would save this as a Modal Secret which can be referenced here as an environment variable
+    os.environ["SOURCES__SQL_DATABASE__CREDENTIALS"] = (
+        "mysql+pymysql://[email protected]:4497/Rfam"
+    )
+    # Load tables "family" and "genome"
+    source = sql_database().with_resources("family", "genome")
+
+    # Create dlt pipeline object
+    pipeline = dlt.pipeline(
+        pipeline_name="sql_to_duckdb_pipeline",
+        destination=dlt.destinations.duckdb(
+            "/data/rfam.duckdb"
+        ),  # write the duckdb database file to this file location, which will get mounted to the Modal Volume
+        dataset_name="sql_to_duckdb_pipeline_data",
+        progress="log",  # output progress of the pipeline
+    )
+
+    # Run the pipeline
+    load_info = pipeline.run(source)
+
+    # Print run statistics
+    print(load_info)
+    # @@@DLT_SNIPPET_END modal_function
+
+    assert_load_info(load_info)
+
+
+def test_modal_snippet() -> None:
+    import pytest
+    from modal.exception import ExecutionError
+
+    # Any additional logic or calling the function
+    with pytest.raises(ExecutionError) as excinfo:
+        load_tables.remote()
+    # >>  modal.exception.ExecutionError:
+    # >>  Function has not been hydrated with the metadata it needs to run on Modal, because the App it is defined on is not running.
+    assert "hydrated" in str(excinfo.value)
diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js
@@ -283,6 +283,7 @@ const sidebars = {
             'walkthroughs/deploy-a-pipeline/deploy-with-kestra',
             'walkthroughs/deploy-a-pipeline/deploy-with-dagster',
             'walkthroughs/deploy-a-pipeline/deploy-with-prefect',
+            'walkthroughs/deploy-a-pipeline/deploy-with-modal',
           ]
         },
         {