skyplane-project · sarahwooders · Mar 10, 2023 · Feb 28, 2023 · Mar 4, 2023 · Mar 4, 2023
diff --git a/docs/_templates/base.html b/docs/_templates/base.html
@@ -3,4 +3,4 @@
 <script async src="https://www.googletagmanager.com/gtag/js?id=G-DPRKFYT3T9"></script>
 <script>window.dataLayer = window.dataLayer || [];function gtag(){dataLayer.push(arguments);}gtag('js', new Date());gtag('config', 'G-DPRKFYT3T9');</script>
 {{ super() }}
-{% endblock %}
+{% endblock %}
diff --git a/docs/faq.md b/docs/faq.md
@@ -42,7 +42,7 @@ Using the cloud provider's console, verify the bucket exists. If so, ensure that
 To enable support for requester pays buckets, run `skyplane config set requester_pays true`.
 ``` 
 
-# How can I switch between GCP projects? 
+## How can I switch between GCP projects? 
 We recommend re-setting GCP credentials locally by running `rm -r ~/.config/gcloud` then re-running `gcloud auth application-default login`. You can then set the project ID you want with `gcloud config set project <PROJECT_ID>`. Once you've updated authentication and the project, you can run `skyplane init --reinit-gcp'. 
 
 If you get a an error saying `Compute Engine API has not been used in project 507282715251 before or it is disabled`, wait a few minutes for the API enablement to take effect and re-run `skyplane init`.
diff --git a/docs/index.rst b/docs/index.rst
@@ -7,54 +7,38 @@ Welcome to Skyplane!
    <iframe src="https://ghbtns.com/github-btn.html?user=skyplane-project&repo=skyplane&type=star&count=true&size=large" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
 
 .. note::
-
    This project is under active development.
 
-**🔥 Blazing fast bulk data transfers between any cloud 🔥**
-
-Skyplane is a tool for blazingly fast bulk data transfers in the cloud. Skyplane manages parallelism, data partitioning, and network paths to optimize data transfers, and can also spin up VM instances to increase transfer throughput. 
-
-You can use Skyplane to transfer data: 
-* Between buckets within a cloud provider
-* Between object stores across multiple cloud providers
-* (experimental) Between local storage and cloud object stores
-
-Copy a large dataset in the cloud in a minute, not hours:
-
-.. code-block:: bash
-
-   $ pip install skyplane[aws]
-   $ skyplane init
-   $ skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/
-
-
-Skyplane supports copying data between any major public cloud:
-
-.. image:: /_static/supported-destinations.png
+.. include:: summary.md
+   :parser: myst_parser.sphinx_
 
 Contents
---------
+---------
 
 
 .. toctree::
    :maxdepth: 2
    :caption: Overview
 
+   installation 
    quickstart
-   benchmark
    configure
    architecture
-   performance_stats_collection
    faq
 
+.. toctree::
+   :maxdepth: 2
+   :caption: Tutorials
+
+   tutorial_dataloader
+   tutorial_airflow
 
 .. toctree::
    :maxdepth: 4
    :caption: Developer documentation
 
    build_from_source
    contributing
-   roadmap
    debugging
 
 .. toctree::
@@ -64,6 +48,14 @@ Contents
    skyplane_api
    skyplane_cli
 
+.. toctree::
+   :maxdepth: 2
+   :caption: Learn More 
+
+   benchmark
+   performance_stats_collection
+   roadmap
+
 .. toctree::
    :caption: Community 
 

diff --git a/docs/quickstart.rst → docs/installation.rst b/docs/quickstart.rst → docs/installation.rst
@@ -1,9 +1,7 @@
 ***************
-Getting Started
+Installation
 ***************
 
-Installation
------------------------
 We're ready to install Skyplane. It's as easy as:
 
 .. code-block:: bash
@@ -25,8 +23,8 @@ We're ready to install Skyplane. It's as easy as:
 
       $ GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install skyplane[all]
 
-Cloud Credentials
------------------------
+Setting up Cloud Credentials
+-----------------------------
 Skyplane needs access to cloud credentials to perform transfers. To get started with setting up credentials, make sure you have cloud provider CLI tools installed: 
 
 .. code-block:: bash
@@ -59,21 +57,3 @@ Now, you can initialize Skyplane with your desired cloud providers. Skyplane aut
 
    ---> Setup cloud provider connectors:
    $ skyplane init
-
-
-Transferring Data
--------------------
-
-We're ready to use Skyplane! Let's use `skyplane cp` to copy files from AWS to GCP:
-
-.. code-block:: bash
-
-   ---> 🎸 Ready to rock and roll! Copy some files:
-   $ skyplane cp -r s3://... gs://...
-
-To transfer only new objects, you can instead use `skyplane sync`: 
-
-.. code-block:: bash
-
-   ---> Copy only diff
-   $ skyplane sync s3://... gs://...
diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -0,0 +1,56 @@
+# Quickstart 
+
+## CLI 
+The simplest way to run transfers on Skyplane is to use the CLI. To transfer files from a AWS to GCP, you can run: 
+```
+skyplane cp -r s3://... gs://...
+```
+You can also sync directories to avoid copying data that is already in the destination location: 
+```
+skyplane sync s3://... gs://...
+```
+
+
+## Python API 
+You can also run skyplane from a Python API client. To copy a single object or folder, you can run: 
+```
+import skyplane
+
+client = skyplane.SkyplaneClient()
+client.copy(src="s3://bucket-src/key", dst="s3://bucket-dst/key", recursive=False)
+```
+This will create a skyplane dataplane (i.e. cluster), execute the transfer, and tear down the cluster upon completion. 
+
+You can also execute multiple transfers on the same dataplane to reduce overhead from VM startup time. To do this, you can define a dataplane object and provision it: 
+```
+dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=8)
+dp.provision()
+```
+This will create a dataplane for transfers between `us-east-1` and `us-east-2` with 8 VMs per region. Now, we can queue transfer jobs in this dataplane: 
+```
+# queue transfer 
+dp.queue_copy("s3://bucket1/key1", "s3://bucket2/key1")
+dp.queue_copy("s3://bucket1/key2", "s3://bucket2/key2")
+
+# execute transfer
+tracker = dp.run_async()
+
+# monitor transfer status
+remaining_bytes = tracker.query_bytes_remaining()
+```
+The queued transfer won't run until you call `dp.run()` or `dp.run_async()`. Once you run the transfer, you can moniter the transfer with the returned `tracker` object. Once the transfer is completed, make sure the deprovision the dataplane to avoid cloud costs: 
+```
+# tear down the dataplane 
+dp.deprovision() 
+```
+You can have Skyplane automatically deprovision `dp.auto_deprovision()`:  
+```
+with dp.auto_deprovision():
+    dp.provision()
+    dp.queue_copy(...)
+    tracker = dp.run_async()
+```
+Now you can programmatically transfer terabytes of data across clouds! To see some examples of applications you can build with the API, you can check out our tutorials on how to [load training data from another region](tutorial_dataloader.md) and [build an Airflow operator](tutorial_airflow.md).
+
+
+
diff --git a/docs/summary.md b/docs/summary.md
@@ -0,0 +1,30 @@
+**🔥 Blazing fast bulk data transfers between any cloud 🔥**
+
+```
+pip install skyplane[aws]
+skyplane init
+skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/
+```
+
+Skyplane is a tool for blazingly fast bulk data transfers between object stores in the cloud. It provisions a fleet of VMs in the cloud to transfer data in parallel while using compression and bandwidth tiering to reduce cost.
+
+Skyplane is:
+1. 🔥 Blazing fast ([110x faster than AWS DataSync](https://skyplane.org/en/latest/benchmark.html))
+2. 🤑 Cheap (4x cheaper than rsync)
+3. 🌐 Universal (AWS, Azure and GCP)
+
+You can use Skyplane to transfer data: 
+* between object stores within a cloud provider (e.g. AWS us-east-1 to AWS us-west-2)
+* between object stores across multiple cloud providers (e.g. AWS us-east-1 to GCP us-central1)
+* between local storage and cloud object stores (experimental)
+
+Skyplane currently supports the following source and destination endpoints (any source and destination can be combined): 
+
+| Endpoint           | Source             | Destination        |
+|--------------------|--------------------|--------------------|
+| AWS S3             | ✅                 | ✅                 |
+| Google Storage     | ✅                 | ✅                 |
+| Azure Blob Storage | ✅                 | ✅                 |
+| Local Disk         | ✅                 | (in progress)      |
+
+Skyplane is an actively developed project. It will have 🔪 SHARP EDGES 🔪. Please file an issue or ask the contributors via [the #help channel on our Slack](https://join.slack.com/t/skyplaneworkspace/shared_invite/zt-1cxmedcuc-GwIXLGyHTyOYELq7KoOl6Q) if you encounter bugs.
diff --git a/docs/tutorial_airflow.md b/docs/tutorial_airflow.md
@@ -0,0 +1,78 @@
+# Creating an Airflow Operator 
+
+Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job.
+
+![airflow](_static/api/airflow.png)
+
+In this tutorial, we extend Airflow's `BaseOperator` object to create a custom Skyplane operator, called `SkyplaneOperator`. We first define the fields of the `SkyplaneOperator`: 
+```
+import skyplane
+from airflow.models import BaseOperator  # type: ignore
+
+class SkyplaneOperator(BaseOperator):
+    template_fields = (
+        "src_provider",
+        "src_bucket",
+        "src_region",
+        "dst_provider",
+        "dst_bucket",
+        "dst_region",
+        "config_path",
+    )
+
+    def __init__(
+        self,
+        *src_provider: str,
+        src_bucket: str,
+        src_region: str,
+        dst_provider: str,
+        dst_bucket: str,
+        dst_region: str,
+        aws_config: Optional[skyplane.AWSConfig] = None,
+        gcp_config: Optional[skyplane.GCPConfig] = None,
+        azure_config: Optional[skyplane.AzureConfig] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.src_provider = src_provider
+        self.src_bucket = src_bucket
+        self.src_region = src_region
+        self.dst_provider = dst_provider
+        self.dst_bucket = dst_bucket
+        self.dst_region = dst_region
+        self.aws_config = aws_config
+        self.gcp_config = gcp_config
+        self.azure_config = azure_config
+
+
+def execute(self, context):
+    pass
+```
+Inside the `execute` function, we can instantiate a Skyplane client to create a dataplane and execute transfers: 
+```
+
+def execute(self, context):
+    client = SkyplaneClient(aws_config=self.aws_config, gcp_config=self.gcp_config, azure_config=self.azure_config)
+    dp = client.dataplane(self.src_provider, self.src_region, self.dst_provider, self.dst_region, n_vms=1)
+    with dp.auto_deprovision():
+        dp.provision()
+        dp.queue_copy(self.src_bucket, self.dst_bucket, recursive=True)
+        tracker = dp.run_async()
+```
+We can also add reporting on the transfer: 
+```
+    with dp.auto_deprovision():
+        ...
+        print("Waiting for transfer to complete...")
+        while True:
+            bytes_remaining = tracker.query_bytes_remaining()
+            if bytes_remaining is None:
+                print(f"{timestamp} Transfer not yet started")
+            elif bytes_remaining > 0:
+                print(f"{(bytes_remaining / (2 ** 30)):.2f}GB left")
+            else:
+                break
+            time.sleep(1)
+        tracker.join()
+        print("Transfer complete!")
+```
diff --git a/docs/tutorial_dataloader.md b/docs/tutorial_dataloader.md
@@ -0,0 +1,86 @@
+# Faster Training Data Loading
+
+This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow [here](https://github.com/skyplane-project/skyplane/tree/main/examples).
+
+Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region.
+
+![imagenet_training](_static/api/imagenet.png)
+
+## Remote vs. Local Regions
+Say that you have a VM for running training jobs in an AWS region, `us-west-2`. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in `us-east-1` (the remote region), but we are running training from another region `us-west-2` (the local region).
+
+
+## Reading data from S3 
+Directly reading data from S3 can be convinient to avoid having to download your entire dataset before starting to train. In this tutorial, we create an `ImageNetS3` dataset that extends AWS's `S3IterableDataset` object.
+
+```
+import skyplane
+import torch  
+import torchvision.transforms as transforms  
+from torch.utils.data import IterableDataset, DataLoader  
+from awsio.python.lib.io.s3.s3dataset import S3IterableDataset  
+
+class ImageNetS3(IterableDataset):
+    def __init__(self, url_list, shuffle_urls=True):
+        self.s3_iter_dataset = S3IterableDataset(url_list, shuffle_urls)
+        self.transform = transforms.Compose(
+            [
+                transforms.RandomResizedCrop(224),
+                transforms.RandomHorizontalFlip(),
+                transforms.ToTensor(),
+                transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+            ]
+        )
+
+    def data_generator(self):
+        try:
+            while True:
+                # Based on aplhabetical order of files sequence of label and image will change.
+                # e.g. for files 0186304.cls 0186304.jpg, 0186304.cls will be fetched first
+                _, label_fobj = next(self.s3_iter_dataset_iterator)
+                _, image_fobj = next(self.s3_iter_dataset_iterator)
+                label = int(label_fobj)
+                image_np = Image.open(io.BytesIO(image_fobj)).convert("RGB")
+
+                # Apply torch visioin transforms if provided
+                if self.transform is not None:
+                    image_np = self.transform(image_np)
+                yield image_np, label
+
+        except StopIteration:
+            return
+```
+We can create a data loader with the data located in our remote bucket: 
+```
+    remote_bucket_url = "s3://us-east-1-bucket" 
+    data_urls = [
+        (remote_bucket_url + "/" if not remote_bucket_url.endswith("/") else remote_bucket_url) + f"imagenet-train-{i:06d}.tar"
+        for i in range(100)
+    ]
+    dataset = ImageNetS3(data_urls)
+    train_loader = DataLoader(dataset, batch_size=256, num_workers=2)
+```
+However, the latency of this dataloader will be very high and likely degrade training performance.  
+
+## Tranferring Data with Skyplane 
+We can improve our data loader's performance by transferring data to a local region first. We can do this by running: 
+```
+    local_bucket_url = "s3://us-west-2-bucket" 
+
+    # Step 1:  Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default
+    client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
+
+    # Step 2:  Copy the data from the remote bucket to the local bucket.
+    client.copy(src=remote_bucket_url, dst=local_bucket_url, recursive=True)
+```
+Once the copy completes, the following code will be able to read the training data from the bucket with low latency, and no egress cost: 
+```
+    data_urls = [
+        (local_bucket_url + "/" if not local_bucket_url.endswith("/") else local_bucket_url) + f"imagenet-train-{i:06d}.tar"
+        for i in range(100)
+    ]
+    dataset = ImageNetS3(data_urls)
+    train_loader = DataLoader(dataset, batch_size=256, num_workers=2)
+```
+
+