From 5f10edd8df099ad436db0934d5517c8205be2cc7 Mon Sep 17 00:00:00 2001 From: Asim Biswal Date: Tue, 28 Feb 2023 01:35:19 +0000 Subject: [PATCH 01/10] adding api example to quickstart --- docs/quickstart.rst | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/docs/quickstart.rst b/docs/quickstart.rst index 915f40113..06d01c5f2 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -61,10 +61,10 @@ Now, you can initialize Skyplane with your desired cloud providers. Skyplane aut $ skyplane init -Transferring Data -------------------- +Transferring Data via Skyplane CLI +------------------------------------ -We're ready to use Skyplane! Let's use `skyplane cp` to copy files from AWS to GCP: +We're ready to use the Skyplane CLI! Let's use `skyplane cp` to copy files from AWS to GCP: .. code-block:: bash @@ -77,3 +77,25 @@ To transfer only new objects, you can instead use `skyplane sync`: ---> Copy only diff $ skyplane sync s3://... gs://... + +Transferring Data via Skyplane API +------------------------------------ + +We can also leverage the power of the Skyplane API! To access Skyplane and its functions, you can import it in your Python code like this: + +.. code-block:: python + + import skyplane + +To start a simple copy job using the Skyplane API, we simply create a SkyplaneClient and call `copy`: + +.. code-block:: python + :caption: Example of how to use API simple copy that automatically deprovisions the VMs + + import skyplane + + client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig()) + client.copy(src="s3://skycamp-demo-src/synset_labels.txt", dst="s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False) + +.. note:: + In this example, we use a defauly AWSConfig which infers AWS credentials from the environment. \ No newline at end of file From f72799fdba6261c527dc9a4446a47dc801b9d575 Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Fri, 3 Mar 2023 19:18:11 -0800 Subject: [PATCH 02/10] add tutorials section --- docs/faq.md | 2 +- docs/index.rst | 6 +++ docs/tutorial_airflow.md | 69 ++++++++++++++++++++++++++++ docs/tutorial_dataloader.md | 89 +++++++++++++++++++++++++++++++++++++ 4 files changed, 165 insertions(+), 1 deletion(-) create mode 100644 docs/tutorial_airflow.md create mode 100644 docs/tutorial_dataloader.md diff --git a/docs/faq.md b/docs/faq.md index 76f8fa175..705284387 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -42,7 +42,7 @@ Using the cloud provider's console, verify the bucket exists. If so, ensure that To enable support for requester pays buckets, run `skyplane config set requester_pays true`. ``` -# How can I switch between GCP projects? +## How can I switch between GCP projects? We recommend re-setting GCP credentials locally by running `rm -r ~/.config/gcloud` then re-running `gcloud auth application-default login`. You can then set the project ID you want with `gcloud config set project `. Once you've updated authentication and the project, you can run `skyplane init --reinit-gcp'. If you get a an error saying `Compute Engine API has not been used in project 507282715251 before or it is disabled`, wait a few minutes for the API enablement to take effect and re-run `skyplane init`. diff --git a/docs/index.rst b/docs/index.rst index 5eae9938e..06de84cb3 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -47,6 +47,12 @@ Contents performance_stats_collection faq +.. toctree:: + :maxdepth: 2 + :caption: Tutorials + + tutorial_dataloader + tutorial_airflow .. toctree:: :maxdepth: 4 diff --git a/docs/tutorial_airflow.md b/docs/tutorial_airflow.md new file mode 100644 index 000000000..505fcb89b --- /dev/null +++ b/docs/tutorial_airflow.md @@ -0,0 +1,69 @@ +# Creating an Airflow Operator + +Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job. + +![airflow](_static/api/airflow.png) + +In this tutorial, we extend Airflow's `BaseOperator` object to create a custom Skyplane operator, called `SkyplaneOperator`. We first define the fields of the `SkyplaneOperator`: +``` +from airflow.models import BaseOperator # type: ignore + +class SkyplaneOperator(BaseOperator): + template_fields = ( + "src_provider", + "src_bucket", + "src_region", + "dst_provider", + "dst_bucket", + "dst_region", + "config_path", + ) + + def __init__( + self, + *src_provider: str, + src_bucket: str, + src_region: str, + dst_provider: str, + dst_bucket: str, + dst_region: str, + config_path: str, + **kwargs, + ) -> None: + super().__init__(**kwargs) + self.src_provider = src_provider + self.src_bucket = src_bucket + self.src_region = src_region + self.dst_provider = dst_provider + self.dst_bucket = dst_bucket + self.dst_region = dst_region + self.config_path = config_path + + +def execute(self, context): + pass +``` +Inside the `execute` function, we can instantiate call the Skyplane API to execute transfers: +``` + +import skyplane + +def execute(self, context): + aws_config, gcp_config, azure_config = skyplane.SkyplaneAuth.load_from_config_file(self.config_path) + client = skyplane.SkyplaneClient(aws_config=aws_config, gcp_config=gcp_config, azure_config=azure_config) + dp = client.dataplane(self.src_provider, self.src_region, self.dst_provider, self.dst_region, n_vms=1) + with dp.auto_deprovision(): + dp.provision() + dp.queue_copy(self.src_bucket, self.dst_bucket, recursive=True) + tracker = dp.run_async() +``` +We can also add reporting on the transfer by adding a line: +``` + with dp.auto_deprovision(): + ... + reporter = skyplane.SimpleReporter(tracker) + + # monitor the transfer + while reporter.update(): + time.sleep(1) +``` \ No newline at end of file diff --git a/docs/tutorial_dataloader.md b/docs/tutorial_dataloader.md new file mode 100644 index 000000000..983ca2ef9 --- /dev/null +++ b/docs/tutorial_dataloader.md @@ -0,0 +1,89 @@ +# Loading Data from S3 for Model Training + +This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow here https://github.com/skyplane-project/skyplane/tree/main/examples. + +Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region. + +In many cases, datasets and virtual machines (VMs) are located in different regions. This can lead to slow data transfer speeds and high costs for data egress fees when using cloud provider tools, such as aws s3 cp, to download data to the VM's local disk. Skyplane offers a solution by allowing a fast and more cost-effective transfer of the dataset to an S3 bucket in the same region as the VM (e.g. US-West-2), with direct streaming of the data to the model without the need for downloading it to the local folder. + +![imagenet_training](_static/api/imagenet.png) +This process is as simple as adding just two lines of code, similar to the demonstration of the Skyplane simple copy. + +## Remove vs. Local Regions +Say that you have a VM for running training jobs in an AWS region, `us-west-2`. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in `us-east-1` (the remote region), but we are running training from another region `us-west-2` (the local region). + + +## Reading data from S3 +Directly reading data from S3 can be convinient to avoid having to download your entire dataset before starting to train. In this tutorial, we create an `ImageNetS3` dataset that extends AWS's `S3IterableDataset` object. + +``` +import skyplane +import torch +import torchvision.transforms as transforms +from torch.utils.data import IterableDataset, DataLoader +from awsio.python.lib.io.s3.s3dataset import S3IterableDataset + +class ImageNetS3(IterableDataset): + def __init__(self, url_list, shuffle_urls=True): + self.s3_iter_dataset = S3IterableDataset(url_list, shuffle_urls) + self.transform = transforms.Compose( + [ + transforms.RandomResizedCrop(224), + transforms.RandomHorizontalFlip(), + transforms.ToTensor(), + transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)), + ] + ) + + def data_generator(self): + try: + while True: + # Based on aplhabetical order of files sequence of label and image will change. + # e.g. for files 0186304.cls 0186304.jpg, 0186304.cls will be fetched first + _, label_fobj = next(self.s3_iter_dataset_iterator) + _, image_fobj = next(self.s3_iter_dataset_iterator) + label = int(label_fobj) + image_np = Image.open(io.BytesIO(image_fobj)).convert("RGB") + + # Apply torch visioin transforms if provided + if self.transform is not None: + image_np = self.transform(image_np) + yield image_np, label + + except StopIteration: + return +``` +We can create a data loader with the data located in our remote bucket: +``` + remote_bucket_url = "s3://us-east-1-bucket" + data_urls = [ + (remote_bucket_url + "/" if not remote_bucket_url.endswith("/") else remote_bucket_url) + f"imagenet-train-{i:06d}.tar" + for i in range(100) + ] + dataset = ImageNetS3(data_urls) + train_loader = DataLoader(dataset, batch_size=256, num_workers=2) +``` +However, the latency of this dataloader will be very high and likely degrade training performance. + +## Tranferring Data with Skyplane +We can improve our data loader's performance by transferring data to a local region first. We can do this by running: +``` + local_bucket_url = "s3://us-west-2-bucket" + + # Step 1: Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default + client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig()) + + # Step 2: Copy the data from the remote bucket to the local bucket. + client.copy(src=remote_bucket_url, dst=local_bucket_url, recursive=True) +``` +Once the copy completes, the following code will be able to read the training data from the bucket with low latency, and no egress cost: +``` + data_urls = [ + (local_bucket_url + "/" if not local_bucket_url.endswith("/") else local_bucket_url) + f"imagenet-train-{i:06d}.tar" + for i in range(100) + ] + dataset = ImageNetS3(data_urls) + train_loader = DataLoader(dataset, batch_size=256, num_workers=2) +``` + + From 8970282affb068a3d7e51d1614c488fb3cbc0bac Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Fri, 3 Mar 2023 19:31:57 -0800 Subject: [PATCH 03/10] make a more info section --- docs/index.rst | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 06de84cb3..2432f1b4c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -41,10 +41,8 @@ Contents :caption: Overview quickstart - benchmark configure architecture - performance_stats_collection faq .. toctree:: @@ -60,7 +58,6 @@ Contents build_from_source contributing - roadmap debugging .. toctree:: @@ -70,6 +67,14 @@ Contents skyplane_api skyplane_cli +.. toctree:: + :maxdepth: 2 + :caption: Learn More + + benchmark + performance_stats_collection + roadmap + .. toctree:: :caption: Community From bfc86e7b764d3d87278a416bebf40abc238a2ac3 Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Sat, 4 Mar 2023 12:00:12 -0800 Subject: [PATCH 04/10] add quickstart section --- docs/_templates/base.html | 6 --- docs/index.rst | 1 + docs/{quickstart.rst => installation.rst} | 46 +---------------------- docs/quickstart.md | 21 +++++++++++ 4 files changed, 24 insertions(+), 50 deletions(-) delete mode 100644 docs/_templates/base.html rename docs/{quickstart.rst => installation.rst} (52%) create mode 100644 docs/quickstart.md diff --git a/docs/_templates/base.html b/docs/_templates/base.html deleted file mode 100644 index c36165199..000000000 --- a/docs/_templates/base.html +++ /dev/null @@ -1,6 +0,0 @@ -{%- extends "!base.html" %} -{% block extrahead %} - - -{{ super() }} -{% endblock %} \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index 2432f1b4c..ed1c729ac 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -40,6 +40,7 @@ Contents :maxdepth: 2 :caption: Overview + installation quickstart configure architecture diff --git a/docs/quickstart.rst b/docs/installation.rst similarity index 52% rename from docs/quickstart.rst rename to docs/installation.rst index 06d01c5f2..9c8a152aa 100644 --- a/docs/quickstart.rst +++ b/docs/installation.rst @@ -1,9 +1,7 @@ *************** -Getting Started +Installation *************** -Installation ------------------------ We're ready to install Skyplane. It's as easy as: .. code-block:: bash @@ -25,7 +23,7 @@ We're ready to install Skyplane. It's as easy as: $ GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install skyplane[all] -Cloud Credentials +Setting up Cloud Credentials ----------------------- Skyplane needs access to cloud credentials to perform transfers. To get started with setting up credentials, make sure you have cloud provider CLI tools installed: @@ -59,43 +57,3 @@ Now, you can initialize Skyplane with your desired cloud providers. Skyplane aut ---> Setup cloud provider connectors: $ skyplane init - - -Transferring Data via Skyplane CLI ------------------------------------- - -We're ready to use the Skyplane CLI! Let's use `skyplane cp` to copy files from AWS to GCP: - -.. code-block:: bash - - ---> 🎸 Ready to rock and roll! Copy some files: - $ skyplane cp -r s3://... gs://... - -To transfer only new objects, you can instead use `skyplane sync`: - -.. code-block:: bash - - ---> Copy only diff - $ skyplane sync s3://... gs://... - -Transferring Data via Skyplane API ------------------------------------- - -We can also leverage the power of the Skyplane API! To access Skyplane and its functions, you can import it in your Python code like this: - -.. code-block:: python - - import skyplane - -To start a simple copy job using the Skyplane API, we simply create a SkyplaneClient and call `copy`: - -.. code-block:: python - :caption: Example of how to use API simple copy that automatically deprovisions the VMs - - import skyplane - - client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig()) - client.copy(src="s3://skycamp-demo-src/synset_labels.txt", dst="s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False) - -.. note:: - In this example, we use a defauly AWSConfig which infers AWS credentials from the environment. \ No newline at end of file diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 000000000..9b1c27cb4 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,21 @@ +# Quickstart + +## CLI +To transfer files from a AWS to GCP, you can run: +``` +skyplane cp -r s3://... gs://... +``` +You can also sync directories to avoid copying data that is already in the destination location: +``` +skyplane sync s3://... gs://... +``` + + +## Python API +You can also run skyplane from a Python API client. +``` +import skyplane + +client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig()) +client.copy(src="s3://skycamp-demo-src/synset_labels.txt", dst="s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False) +``` \ No newline at end of file From 08f75237a6eac402b1c9f81d70193e82e222eeaf Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Sat, 4 Mar 2023 12:15:21 -0800 Subject: [PATCH 05/10] update index with summary --- docs/index.rst | 24 ++---------------------- docs/summary.md | 30 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 22 deletions(-) create mode 100644 docs/summary.md diff --git a/docs/index.rst b/docs/index.rst index ed1c729ac..551710378 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -7,30 +7,10 @@ Welcome to Skyplane! .. note:: - This project is under active development. -**🔥 Blazing fast bulk data transfers between any cloud 🔥** - -Skyplane is a tool for blazingly fast bulk data transfers in the cloud. Skyplane manages parallelism, data partitioning, and network paths to optimize data transfers, and can also spin up VM instances to increase transfer throughput. - -You can use Skyplane to transfer data: -* Between buckets within a cloud provider -* Between object stores across multiple cloud providers -* (experimental) Between local storage and cloud object stores - -Copy a large dataset in the cloud in a minute, not hours: - -.. code-block:: bash - - $ pip install skyplane[aws] - $ skyplane init - $ skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/ - - -Skyplane supports copying data between any major public cloud: - -.. image:: /_static/supported-destinations.png +.. include:: summary.md + :parser: myst_parser.sphinx_ Contents -------- diff --git a/docs/summary.md b/docs/summary.md new file mode 100644 index 000000000..b04c02d12 --- /dev/null +++ b/docs/summary.md @@ -0,0 +1,30 @@ +**🔥 Blazing fast bulk data transfers between any cloud 🔥** + +``` +pip install skyplane[aws] +skyplane init +skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/ +``` + +Skyplane is a tool for blazingly fast bulk data transfers between object stores in the cloud. It provisions a fleet of VMs in the cloud to transfer data in parallel while using compression and bandwidth tiering to reduce cost. + +Skyplane is: +1. 🔥 Blazing fast ([110x faster than AWS DataSync](https://skyplane.org/en/latest/benchmark.html)) +2. 🤑 Cheap (4x cheaper than rsync) +3. 🌐 Universal (AWS, Azure and GCP) + +You can use Skyplane to transfer data: +* between object stores within a cloud provider (e.g. AWS us-east-1 to AWS us-west-2) +* between object stores across multiple cloud providers (e.g. AWS us-east-1 to GCP us-central1) +* between local storage and cloud object stores (experimental) + +Skyplane currently supports the following source and destination endpoints (any source and destination can be combined): + +| Endpoint | Source | Destination | +|--------------------|--------------------|--------------------| +| AWS S3 | ✅ | ✅ | +| Google Storage | ✅ | ✅ | +| Azure Blob Storage | ✅ | ✅ | +| Local Disk | ✅ | (in progress) | + +Skyplane is an actively developed project. It will have 🔪 SHARP EDGES 🔪. Please file an issue or ask the contributors via [the #help channel on our Slack](https://join.slack.com/t/skyplaneworkspace/shared_invite/zt-1cxmedcuc-GwIXLGyHTyOYELq7KoOl6Q) if you encounter bugs. \ No newline at end of file From 8a5f0a91ba21fe1f6876567597498fff4a869d61 Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Sat, 4 Mar 2023 12:48:36 -0800 Subject: [PATCH 06/10] write out python api section --- docs/quickstart.md | 45 ++++++++++++++++++++++++++++++++----- docs/tutorial_dataloader.md | 4 ++-- 2 files changed, 42 insertions(+), 7 deletions(-) diff --git a/docs/quickstart.md b/docs/quickstart.md index 9b1c27cb4..765144071 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -1,7 +1,7 @@ # Quickstart ## CLI -To transfer files from a AWS to GCP, you can run: +The simplest way to run transfers on Skyplane is to use the CLI. To transfer files from a AWS to GCP, you can run: ``` skyplane cp -r s3://... gs://... ``` @@ -12,10 +12,45 @@ skyplane sync s3://... gs://... ## Python API -You can also run skyplane from a Python API client. +You can also run skyplane from a Python API client. To copy a single object or folder, you can run: ``` import skyplane -client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig()) -client.copy(src="s3://skycamp-demo-src/synset_labels.txt", dst="s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False) -``` \ No newline at end of file +client = skyplane.SkyplaneClient() +client.copy(src="s3://bucket-src/key", dst="s3://bucket-dst/key", recursive=False) +``` +This will create a skyplane dataplane (i.e. cluster), execute the transfer, and tear down the cluster upon completion. + +You can also execute multiple transfers on the same dataplane to reduce overhead from VM startup time. To do this, you can define a dataplane object and provision it: +``` +dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=8) +dp.provision() +``` +This will create a dataplane for transfers between `us-east-1` and `us-east-2` with 8 VMs per region. Now, we can queue transfer jobs in this dataplane: +``` +# queue transfer +dp.queue_copy("s3://bucket1/key1", "s3://bucket2/key1") +dp.queue_copy("s3://bucket1/key2", "s3://bucket2/key2") + +# execute transfer +tracker = dp.run_async() + +# monitor transfer status +remaining_bytes = tracker.query_bytes_remaining() +``` +The queued transfer won't run until you call `dp.run()` or `dp.run_async()`. Once you run the transfer, you can moniter the transfer with the returned `tracker` object. Once the transfer is completed, make sure the deprovision the dataplane to avoid cloud costs: +``` +# tear down the dataplane +dp.deprovision() +``` +You can have Skyplane automatically deprovision `dp.auto_deprovision()`: +``` +with dp.auto_deprovision(): + dp.provision() + dp.queue_copy(...) + tracker = dp.run_async() +``` +Now you can programmatically transfer terabytes of data across clouds! To see some examples of applications you can build with the API, you can check out our tutorials on how to [load training data from another region](tutorial_dataloader.md) and [build an Airflow operator](tutorial_airflow.md). + + + diff --git a/docs/tutorial_dataloader.md b/docs/tutorial_dataloader.md index 983ca2ef9..6fa034d90 100644 --- a/docs/tutorial_dataloader.md +++ b/docs/tutorial_dataloader.md @@ -1,6 +1,6 @@ # Loading Data from S3 for Model Training -This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow here https://github.com/skyplane-project/skyplane/tree/main/examples. +This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow [here](https://github.com/skyplane-project/skyplane/tree/main/examples). Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region. @@ -9,7 +9,7 @@ In many cases, datasets and virtual machines (VMs) are located in different regi ![imagenet_training](_static/api/imagenet.png) This process is as simple as adding just two lines of code, similar to the demonstration of the Skyplane simple copy. -## Remove vs. Local Regions +## Remote vs. Local Regions Say that you have a VM for running training jobs in an AWS region, `us-west-2`. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in `us-east-1` (the remote region), but we are running training from another region `us-west-2` (the local region). From 7d4b3fbe49d230b5e30459f237b5a10dd00d06a5 Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Sat, 4 Mar 2023 13:10:22 -0800 Subject: [PATCH 07/10] rename data loader tutoiral --- docs/tutorial_dataloader.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/tutorial_dataloader.md b/docs/tutorial_dataloader.md index 6fa034d90..eb337f294 100644 --- a/docs/tutorial_dataloader.md +++ b/docs/tutorial_dataloader.md @@ -1,13 +1,10 @@ -# Loading Data from S3 for Model Training +# Faster Training Data Loading This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow [here](https://github.com/skyplane-project/skyplane/tree/main/examples). Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region. -In many cases, datasets and virtual machines (VMs) are located in different regions. This can lead to slow data transfer speeds and high costs for data egress fees when using cloud provider tools, such as aws s3 cp, to download data to the VM's local disk. Skyplane offers a solution by allowing a fast and more cost-effective transfer of the dataset to an S3 bucket in the same region as the VM (e.g. US-West-2), with direct streaming of the data to the model without the need for downloading it to the local folder. - ![imagenet_training](_static/api/imagenet.png) -This process is as simple as adding just two lines of code, similar to the demonstration of the Skyplane simple copy. ## Remote vs. Local Regions Say that you have a VM for running training jobs in an AWS region, `us-west-2`. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in `us-east-1` (the remote region), but we are running training from another region `us-west-2` (the local region). From 40a536b5907f067996f15f7e5a37f2248a57a222 Mon Sep 17 00:00:00 2001 From: Asim Biswal Date: Tue, 7 Mar 2023 22:17:22 +0000 Subject: [PATCH 08/10] updated airflow tutorial docs --- docs/tutorial_airflow.md | 35 ++++++++++++++++++++++------------- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/docs/tutorial_airflow.md b/docs/tutorial_airflow.md index 505fcb89b..8c8827e68 100644 --- a/docs/tutorial_airflow.md +++ b/docs/tutorial_airflow.md @@ -6,6 +6,7 @@ Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator In this tutorial, we extend Airflow's `BaseOperator` object to create a custom Skyplane operator, called `SkyplaneOperator`. We first define the fields of the `SkyplaneOperator`: ``` +import skyplane from airflow.models import BaseOperator # type: ignore class SkyplaneOperator(BaseOperator): @@ -27,7 +28,9 @@ class SkyplaneOperator(BaseOperator): dst_provider: str, dst_bucket: str, dst_region: str, - config_path: str, + aws_config: Optional[skyplane.AWSConfig] = None, + gcp_config: Optional[skyplane.GCPConfig] = None, + azure_config: Optional[skyplane.AzureConfig] = None, **kwargs, ) -> None: super().__init__(**kwargs) @@ -37,33 +40,39 @@ class SkyplaneOperator(BaseOperator): self.dst_provider = dst_provider self.dst_bucket = dst_bucket self.dst_region = dst_region - self.config_path = config_path + self.aws_config = aws_config + self.gcp_config = gcp_config + self.azure_config = azure_config def execute(self, context): pass ``` -Inside the `execute` function, we can instantiate call the Skyplane API to execute transfers: +Inside the `execute` function, we can instantiate a Skyplane client to create a dataplane and execute transfers: ``` -import skyplane - def execute(self, context): - aws_config, gcp_config, azure_config = skyplane.SkyplaneAuth.load_from_config_file(self.config_path) - client = skyplane.SkyplaneClient(aws_config=aws_config, gcp_config=gcp_config, azure_config=azure_config) + client = SkyplaneClient(aws_config=self.aws_config, gcp_config=self.gcp_config, azure_config=self.azure_config) dp = client.dataplane(self.src_provider, self.src_region, self.dst_provider, self.dst_region, n_vms=1) with dp.auto_deprovision(): dp.provision() dp.queue_copy(self.src_bucket, self.dst_bucket, recursive=True) tracker = dp.run_async() ``` -We can also add reporting on the transfer by adding a line: +We can also add reporting on the transfer: ``` with dp.auto_deprovision(): ... - reporter = skyplane.SimpleReporter(tracker) - - # monitor the transfer - while reporter.update(): - time.sleep(1) + print("Waiting for transfer to complete...") + while True: + bytes_remaining = tracker.query_bytes_remaining() + if bytes_remaining is None: + print(f"{timestamp} Transfer not yet started") + elif bytes_remaining > 0: + print(f"{(bytes_remaining / (2 ** 30)):.2f}GB left") + else: + break + time.sleep(1) + tracker.join() + print("Transfer complete!") ``` \ No newline at end of file From 22d62ccc7e6862045c17c2aa0e5438decfbb0346 Mon Sep 17 00:00:00 2001 From: Sarah Wooders Date: Wed, 8 Mar 2023 16:29:20 -0800 Subject: [PATCH 09/10] add back base.html --- docs/_templates/base.html | 6 ++++++ 1 file changed, 6 insertions(+) create mode 100644 docs/_templates/base.html diff --git a/docs/_templates/base.html b/docs/_templates/base.html new file mode 100644 index 000000000..bcee7304d --- /dev/null +++ b/docs/_templates/base.html @@ -0,0 +1,6 @@ +{%- extends "!base.html" %} +{% block extrahead %} + + +{{ super() }} +{% endblock %} From bb7333b18f1adca4e3bf53e77a5eab16dc1f6bec Mon Sep 17 00:00:00 2001 From: Asim Biswal Date: Thu, 9 Mar 2023 07:35:26 +0000 Subject: [PATCH 10/10] fixing warning readthedocs --- docs/index.rst | 2 +- docs/installation.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 551710378..6d6ad0066 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,7 +13,7 @@ Welcome to Skyplane! :parser: myst_parser.sphinx_ Contents --------- +--------- .. toctree:: diff --git a/docs/installation.rst b/docs/installation.rst index 9c8a152aa..513727a6f 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -24,7 +24,7 @@ We're ready to install Skyplane. It's as easy as: $ GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install skyplane[all] Setting up Cloud Credentials ------------------------ +----------------------------- Skyplane needs access to cloud credentials to perform transfers. To get started with setting up credentials, make sure you have cloud provider CLI tools installed: .. code-block:: bash