Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation with API information and Tutorials #771

Merged
merged 11 commits into from
Mar 10, 2023
2 changes: 1 addition & 1 deletion docs/_templates/base.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
<script async src="https://www.googletagmanager.com/gtag/js?id=G-DPRKFYT3T9"></script>
<script>window.dataLayer = window.dataLayer || [];function gtag(){dataLayer.push(arguments);}gtag('js', new Date());gtag('config', 'G-DPRKFYT3T9');</script>
{{ super() }}
{% endblock %}
{% endblock %}
2 changes: 1 addition & 1 deletion docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Using the cloud provider's console, verify the bucket exists. If so, ensure that
To enable support for requester pays buckets, run `skyplane config set requester_pays true`.
```

# How can I switch between GCP projects?
## How can I switch between GCP projects?
We recommend re-setting GCP credentials locally by running `rm -r ~/.config/gcloud` then re-running `gcloud auth application-default login`. You can then set the project ID you want with `gcloud config set project <PROJECT_ID>`. Once you've updated authentication and the project, you can run `skyplane init --reinit-gcp'.

If you get a an error saying `Compute Engine API has not been used in project 507282715251 before or it is disabled`, wait a few minutes for the API enablement to take effect and re-run `skyplane init`.
44 changes: 18 additions & 26 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,54 +7,38 @@ Welcome to Skyplane!
<iframe src="https://ghbtns.com/github-btn.html?user=skyplane-project&repo=skyplane&type=star&count=true&size=large" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>

.. note::

This project is under active development.

**🔥 Blazing fast bulk data transfers between any cloud 🔥**

Skyplane is a tool for blazingly fast bulk data transfers in the cloud. Skyplane manages parallelism, data partitioning, and network paths to optimize data transfers, and can also spin up VM instances to increase transfer throughput.

You can use Skyplane to transfer data:
* Between buckets within a cloud provider
* Between object stores across multiple cloud providers
* (experimental) Between local storage and cloud object stores

Copy a large dataset in the cloud in a minute, not hours:

.. code-block:: bash

$ pip install skyplane[aws]
$ skyplane init
$ skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/


Skyplane supports copying data between any major public cloud:

.. image:: /_static/supported-destinations.png
.. include:: summary.md
:parser: myst_parser.sphinx_

Contents
--------
---------


.. toctree::
:maxdepth: 2
:caption: Overview

installation
quickstart
benchmark
configure
architecture
performance_stats_collection
faq

.. toctree::
:maxdepth: 2
:caption: Tutorials

tutorial_dataloader
tutorial_airflow

.. toctree::
:maxdepth: 4
:caption: Developer documentation

build_from_source
contributing
roadmap
debugging

.. toctree::
Expand All @@ -64,6 +48,14 @@ Contents
skyplane_api
skyplane_cli

.. toctree::
:maxdepth: 2
:caption: Learn More

benchmark
performance_stats_collection
roadmap

.. toctree::
:caption: Community

Expand Down
26 changes: 3 additions & 23 deletions docs/quickstart.rst → docs/installation.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
***************
Getting Started
Installation
***************

Installation
-----------------------
We're ready to install Skyplane. It's as easy as:

.. code-block:: bash
Expand All @@ -25,8 +23,8 @@ We're ready to install Skyplane. It's as easy as:

$ GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install skyplane[all]

Cloud Credentials
-----------------------
Setting up Cloud Credentials
-----------------------------
Skyplane needs access to cloud credentials to perform transfers. To get started with setting up credentials, make sure you have cloud provider CLI tools installed:

.. code-block:: bash
Expand Down Expand Up @@ -59,21 +57,3 @@ Now, you can initialize Skyplane with your desired cloud providers. Skyplane aut

---> Setup cloud provider connectors:
$ skyplane init


Transferring Data
-------------------

We're ready to use Skyplane! Let's use `skyplane cp` to copy files from AWS to GCP:

.. code-block:: bash

---> 🎸 Ready to rock and roll! Copy some files:
$ skyplane cp -r s3://... gs://...

To transfer only new objects, you can instead use `skyplane sync`:

.. code-block:: bash

---> Copy only diff
$ skyplane sync s3://... gs://...
56 changes: 56 additions & 0 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Quickstart

## CLI
The simplest way to run transfers on Skyplane is to use the CLI. To transfer files from a AWS to GCP, you can run:
```
skyplane cp -r s3://... gs://...
```
You can also sync directories to avoid copying data that is already in the destination location:
```
skyplane sync s3://... gs://...
```


## Python API
You can also run skyplane from a Python API client. To copy a single object or folder, you can run:
```
import skyplane

client = skyplane.SkyplaneClient()
client.copy(src="s3://bucket-src/key", dst="s3://bucket-dst/key", recursive=False)
```
This will create a skyplane dataplane (i.e. cluster), execute the transfer, and tear down the cluster upon completion.

You can also execute multiple transfers on the same dataplane to reduce overhead from VM startup time. To do this, you can define a dataplane object and provision it:
```
dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=8)
dp.provision()
```
This will create a dataplane for transfers between `us-east-1` and `us-east-2` with 8 VMs per region. Now, we can queue transfer jobs in this dataplane:
```
# queue transfer
dp.queue_copy("s3://bucket1/key1", "s3://bucket2/key1")
dp.queue_copy("s3://bucket1/key2", "s3://bucket2/key2")

# execute transfer
tracker = dp.run_async()

# monitor transfer status
remaining_bytes = tracker.query_bytes_remaining()
```
The queued transfer won't run until you call `dp.run()` or `dp.run_async()`. Once you run the transfer, you can moniter the transfer with the returned `tracker` object. Once the transfer is completed, make sure the deprovision the dataplane to avoid cloud costs:
```
# tear down the dataplane
dp.deprovision()
```
You can have Skyplane automatically deprovision `dp.auto_deprovision()`:
```
with dp.auto_deprovision():
dp.provision()
dp.queue_copy(...)
tracker = dp.run_async()
```
Now you can programmatically transfer terabytes of data across clouds! To see some examples of applications you can build with the API, you can check out our tutorials on how to [load training data from another region](tutorial_dataloader.md) and [build an Airflow operator](tutorial_airflow.md).



30 changes: 30 additions & 0 deletions docs/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
**🔥 Blazing fast bulk data transfers between any cloud 🔥**

```
pip install skyplane[aws]
skyplane init
skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/
```

Skyplane is a tool for blazingly fast bulk data transfers between object stores in the cloud. It provisions a fleet of VMs in the cloud to transfer data in parallel while using compression and bandwidth tiering to reduce cost.

Skyplane is:
1. 🔥 Blazing fast ([110x faster than AWS DataSync](https://skyplane.org/en/latest/benchmark.html))
2. 🤑 Cheap (4x cheaper than rsync)
3. 🌐 Universal (AWS, Azure and GCP)

You can use Skyplane to transfer data:
* between object stores within a cloud provider (e.g. AWS us-east-1 to AWS us-west-2)
* between object stores across multiple cloud providers (e.g. AWS us-east-1 to GCP us-central1)
* between local storage and cloud object stores (experimental)

Skyplane currently supports the following source and destination endpoints (any source and destination can be combined):

| Endpoint | Source | Destination |
|--------------------|--------------------|--------------------|
| AWS S3 | ✅ | ✅ |
| Google Storage | ✅ | ✅ |
| Azure Blob Storage | ✅ | ✅ |
| Local Disk | ✅ | (in progress) |

Skyplane is an actively developed project. It will have 🔪 SHARP EDGES 🔪. Please file an issue or ask the contributors via [the #help channel on our Slack](https://join.slack.com/t/skyplaneworkspace/shared_invite/zt-1cxmedcuc-GwIXLGyHTyOYELq7KoOl6Q) if you encounter bugs.
78 changes: 78 additions & 0 deletions docs/tutorial_airflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Creating an Airflow Operator

Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job.

![airflow](_static/api/airflow.png)

In this tutorial, we extend Airflow's `BaseOperator` object to create a custom Skyplane operator, called `SkyplaneOperator`. We first define the fields of the `SkyplaneOperator`:
```
import skyplane
from airflow.models import BaseOperator # type: ignore

class SkyplaneOperator(BaseOperator):
template_fields = (
"src_provider",
"src_bucket",
"src_region",
"dst_provider",
"dst_bucket",
"dst_region",
"config_path",
)

def __init__(
self,
*src_provider: str,
src_bucket: str,
src_region: str,
dst_provider: str,
dst_bucket: str,
dst_region: str,
aws_config: Optional[skyplane.AWSConfig] = None,
gcp_config: Optional[skyplane.GCPConfig] = None,
azure_config: Optional[skyplane.AzureConfig] = None,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.src_provider = src_provider
self.src_bucket = src_bucket
self.src_region = src_region
self.dst_provider = dst_provider
self.dst_bucket = dst_bucket
self.dst_region = dst_region
self.aws_config = aws_config
self.gcp_config = gcp_config
self.azure_config = azure_config


def execute(self, context):
pass
```
Inside the `execute` function, we can instantiate a Skyplane client to create a dataplane and execute transfers:
```

def execute(self, context):
client = SkyplaneClient(aws_config=self.aws_config, gcp_config=self.gcp_config, azure_config=self.azure_config)
dp = client.dataplane(self.src_provider, self.src_region, self.dst_provider, self.dst_region, n_vms=1)
with dp.auto_deprovision():
dp.provision()
dp.queue_copy(self.src_bucket, self.dst_bucket, recursive=True)
tracker = dp.run_async()
```
We can also add reporting on the transfer:
```
with dp.auto_deprovision():
...
print("Waiting for transfer to complete...")
while True:
bytes_remaining = tracker.query_bytes_remaining()
if bytes_remaining is None:
print(f"{timestamp} Transfer not yet started")
elif bytes_remaining > 0:
print(f"{(bytes_remaining / (2 ** 30)):.2f}GB left")
else:
break
time.sleep(1)
tracker.join()
print("Transfer complete!")
```
86 changes: 86 additions & 0 deletions docs/tutorial_dataloader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Faster Training Data Loading

This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow [here](https://github.com/skyplane-project/skyplane/tree/main/examples).

Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region.

![imagenet_training](_static/api/imagenet.png)

## Remote vs. Local Regions
Say that you have a VM for running training jobs in an AWS region, `us-west-2`. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in `us-east-1` (the remote region), but we are running training from another region `us-west-2` (the local region).


## Reading data from S3
Directly reading data from S3 can be convinient to avoid having to download your entire dataset before starting to train. In this tutorial, we create an `ImageNetS3` dataset that extends AWS's `S3IterableDataset` object.

```
import skyplane
import torch
import torchvision.transforms as transforms
from torch.utils.data import IterableDataset, DataLoader
from awsio.python.lib.io.s3.s3dataset import S3IterableDataset

class ImageNetS3(IterableDataset):
def __init__(self, url_list, shuffle_urls=True):
self.s3_iter_dataset = S3IterableDataset(url_list, shuffle_urls)
self.transform = transforms.Compose(
[
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
]
)

def data_generator(self):
try:
while True:
# Based on aplhabetical order of files sequence of label and image will change.
# e.g. for files 0186304.cls 0186304.jpg, 0186304.cls will be fetched first
_, label_fobj = next(self.s3_iter_dataset_iterator)
_, image_fobj = next(self.s3_iter_dataset_iterator)
label = int(label_fobj)
image_np = Image.open(io.BytesIO(image_fobj)).convert("RGB")

# Apply torch visioin transforms if provided
if self.transform is not None:
image_np = self.transform(image_np)
yield image_np, label

except StopIteration:
return
```
We can create a data loader with the data located in our remote bucket:
```
remote_bucket_url = "s3://us-east-1-bucket"
data_urls = [
(remote_bucket_url + "/" if not remote_bucket_url.endswith("/") else remote_bucket_url) + f"imagenet-train-{i:06d}.tar"
for i in range(100)
]
dataset = ImageNetS3(data_urls)
train_loader = DataLoader(dataset, batch_size=256, num_workers=2)
```
However, the latency of this dataloader will be very high and likely degrade training performance.

## Tranferring Data with Skyplane
We can improve our data loader's performance by transferring data to a local region first. We can do this by running:
```
local_bucket_url = "s3://us-west-2-bucket"

# Step 1: Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default
client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())

# Step 2: Copy the data from the remote bucket to the local bucket.
client.copy(src=remote_bucket_url, dst=local_bucket_url, recursive=True)
```
Once the copy completes, the following code will be able to read the training data from the bucket with low latency, and no egress cost:
```
data_urls = [
(local_bucket_url + "/" if not local_bucket_url.endswith("/") else local_bucket_url) + f"imagenet-train-{i:06d}.tar"
for i in range(100)
]
dataset = ImageNetS3(data_urls)
train_loader = DataLoader(dataset, batch_size=256, num_workers=2)
```