Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime env] [Doc] Add concepts and basic workflows #20222

Merged
merged 23 commits into from
Nov 19, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions doc/examples/doc_code/runtime_env_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,18 @@ def f():
pass

@ray.remote
class Actor:
class A:
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
pass

# __per_task_per_actor_start__
f.options(runtime_env=runtime_env).remote()
Actor.options(runtime_env=runtime_env).remote()
actor = A.options(runtime_env=runtime_env).remote()

@ray.remote(runtime_env=runtime_env)
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
def g():
pass

@ray.remote(runtime_env=runtime_env)
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
class B:
pass
# __per_task_per_actor_end__
2 changes: 1 addition & 1 deletion doc/source/actors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ sufficient CPU resources and the relevant custom resources.
.. tip::

Besides compute resources, you can also specify an environment for an actor to run in,
which can include Python packages, local files, environment variables, and more--see :ref:`Runtime Environments <runtime-environments>` for details.
which can include Python packages, local files, environment variables, and more---see :ref:`Runtime Environments <runtime-environments>` for details.


Terminating Actors
Expand Down
1 change: 1 addition & 0 deletions doc/source/cluster/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ run ``ray attach --help``.
# Attach to tmux session on cluster (creates a new one if none available)
$ ray attach cluster.yaml --tmux

.. _ray-rsync:

Synchronizing files from the cluster (``ray rsync-up/down``)
------------------------------------------------------------
Expand Down
127 changes: 0 additions & 127 deletions doc/source/dependency-management.rst

This file was deleted.

216 changes: 216 additions & 0 deletions doc/source/handling-dependencies.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
.. _handling_dependencies:

Handling Dependencies
=====================

Your Ray application may depend on environment variables, files, and Python packages.
Ray provides two features to specify these dependencies when working with a remote cluster: Runtime environments, and the Ray cluster launcher commands
Copy link
Contributor

@richardliaw richardliaw Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray provides two features to specify these dependencies when working with a remote cluster: Runtime environments, and the Ray cluster launcher commands
Ray provides two features to specify these dependencies when working with a Ray cluster: :ref:`runtime Environments<runtime-environments>`, and the :ref:`Ray cluster launcher commands <INSERT THE RIGHT LINK>`.

With these features, you no longer need to manually SSH into your cluster and set up your environment.
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused with this sentence. We don't need to manually SSH into your cluster for the existing solution now right? (it is handled by the setup commands)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant to include setup commands in "Ray cluster launcher commands", which this doc describes as an existing feature. Let me make this more clear


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start problem the highest problem to low level options here.

Maybe we can describe it in this way instead?

  1. What's the environment in Ray?
  2. Why environment matters in Ray?

And then we can say

There are 2 ways to set up your Ray environment (e.g., files, environment variables, python package dependencies, system dependencies and etc.)

  1. Set up the same environment across machines. This is the most common way to configure environments in Ray. You can use autoscaler's setup commands or docker container deployment. Blah blah... All of Ray tasks and actors will use the same environment as all machines are configured with the same environment. Pro is X con is Y (e.g., all jobs have to use the same environment.)

  2. Set up per job/task/actor environment. This is useful when X (e.g., Serve or multi tenant cluster). In this case you can use runtime environment API blah blah.. Pro is X con is Y.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need a section regarding how to setup environment when Ray client is used, and runtime environment can be used as a good solution as well (or you should mention the local machine / remote cluster should have the same environment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is useful to have in the docs. Maybe we can put them in a top-level page under "Multi-Node Ray" which then links to this runtime env page.

**Option 1.** You can specify dependencies dynamically at runtime in Python using :ref:`Runtime Environments<runtime-environments>`, described below.
This can be useful for

- quickly iterating on a project with changing dependencies and files, and

- for running jobs, tasks and actors with different dependencies, all on the same Ray cluster.

architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
**Option 2.** Alternatively, you can prepare your Ray cluster's environment when your cluster nodes start up, and modify it later from the command line.
Packages can be installed using ``setup_commands`` in the Ray Cluster configuration file (:ref:`docs<cluster-configuration-setup-commands>`) and files can be pushed to the cluster using ``ray rsync_up`` (:ref:`docs<ray-rsync>`).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 I know we still need more here but I'm not quite sure what to put, do you have any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should ask the autoscaler team to fill it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think common problems are

Manual:

  • Link to autoscaler section that describes how to set up deps
  • env variables (setup commands)
  • System deps (setup commands)
  • Files (rsync up or manually copy and paste. Make sure they are all synced)
  • Python packages (setup commands)

Container

  • Same things (link to container deployment)


Concepts
--------

architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
- **Local machine** and **Cluster**. The recommended way to connect to a remote Ray cluster is to use :ref:`Ray Client<ray-client>`, and we will call the machine running Ray Client your *local machine*. Note: you can also start a single-node Ray cluster on your local machine---in this case your Ray cluster is not really “remote”, but any comments in this documentation referring to a “remote cluster” will also apply to this setup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true ray client is a recommended way? Afaik, it is a lot less stable to use ray client now than directly submitting the driver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got that from here https://docs.ray.io/en/latest/cluster/guide.html#deploying-an-application "The recommended way of connecting to a Ray cluster is to use the ray.init("ray://:") API and connect via the Ray Client."

I'm not sure which is more stable, but you're right that we should be clear about which one is recommended


- **Files**: These are the files that your Ray application needs to run. These can include code files or data files. For a development workflow, these might live on your local machine, but when it comes time to run things at scale, you will need to get them to your remote cluster. For how to do this, see :ref:`Workflow: Local Files<workflow-local-files>` below.

- **Packages**: These are external libraries or executables required by your Ray application, often installed via ``pip`` or ``conda``.

.. _runtime-environments:

Runtime Environments
--------------------

.. note::

This feature requires a full installation of Ray using ``pip install "ray[default]>=1.4"``, and is currently only supported on macOS and Linux.

A **runtime environment** describes the dependencies your Ray application needs to run, including files, packages, environment variables, and more. It is specified by a Python ``dict``.
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

By using runtime environments to specify all your dependencies, you can seamlessly move from running your Ray application on your local machine to running it on a remote cluster, without any code changes or manual setup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is due to my lack of understanding in runtime environment. But if you run the driver in a head node, isn't it going to be the same? (like specifying runtime env on the job config)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's the same, but you would need to manually set up the files and dependencies on all the worker nodes. I'll make this more clear


Here are a few examples of runtime environments (for full details, see the :ref:`API Reference<runtime-environments-api-ref>` below):

..
TODO(architkulkarni): run working_dir doc example in CI

.. code-block:: python

runtime_env = {"working_dir": "/data/my_files", "pip": ["requests", "pendulum==2.1.2"]}

.. literalinclude:: ../examples/doc_code/runtime_env_example.py
:language: python
:start-after: __runtime_env_conda_def_start__
:end-before: __runtime_env_conda_def_end__

Specifying a Runtime Environment Per-Job
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can specify a runtime environment for your whole job, whether running a script directly on the cluster or using :ref:`Ray Client<ray-client>`:
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

.. literalinclude:: ../examples/doc_code/runtime_env_example.py
:language: python
:start-after: __ray_init_start__
:end-before: __ray_init_end__

..
TODO(architkulkarni): run Ray Client doc example in CI

.. code-block:: python

# Running on a local machine, connecting to remote cluster using Ray Client
ray.init("ray://123.456.7.89:10001", runtime_env=runtime_env)

.. note::
This will eagerly install the environment when ``ray.init()`` is called. To disable this, add ``"eager_install": False`` to the ``runtime_env``. This will only install the environment when a task is invoked or an actor is created.
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

Specifying a Runtime Environment Per-Task or Per-Actor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can specify different runtime environments per-actor or per-task using ``.options()`` or the ``@ray.remote()`` decorator:

.. literalinclude:: ../examples/doc_code/runtime_env_example.py
:language: python
:start-after: __per_task_per_actor_start__
:end-before: __per_task_per_actor_end__

This allows you to have actors and tasks running in their own environments, independent of the surrounding environment. (The surrounding environment could be the job's runtime environment, or the base environment of the cluster.)
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

.. _workflow-local-files:

Workflow: Local files
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^
Your Ray application might depend on source files or data files.
The following simple example explains how to get your local files on the cluster.

.. code-block:: python

import ray

# /path/to/files is a directory on the local machine.
# /path/to/files/hello.txt contains the string "Hello World!"
ray.init(runtime_env={"working_dir": "/path/to/files"})

@ray.remote
def f():
return open("hello.txt").read()

print(ray.get(f.remote())) # Hello World!

The example above starts a single-node Ray cluster on your local machine, but by specifying an address (e.g. ``ray.init("ray://123.456.7.89:10001", runtime_env=...)`` to connect to a remote cluster using Ray Client, the same code will still work, and ``f`` will be running on the remote cluster.
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

The specified local directory will automatically be pushed to the cluster nodes when `ray.init()` is called.

Workflow: Ray Library development
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Suppose you are developing a library ``my_module`` on Ray.
A typical iteration cycle will involve making some changes to the source code of ``my_module`` and then running a Ray script to test the changes. Perhaps your test requires a remote cluster. To ensure your local changes show up on the cluster, you can use the ``py_modules`` field of runtime environments.

.. code-block:: python

import ray
import my_module

ray.init("ray://123.456.7.89:10001", runtime_env={"py_modules": [my_module]})

@ray.remote
def test_my_module():
# No need to import my_module inside this function.
my_module.test()

ray.get(f.remote())

.. _runtime-environments-api-ref:

API Reference
^^^^^^^^^^^^^

The ``runtime_env`` is a Python dictionary including one or more of the following keys:

- ``working_dir`` (str): Specifies the working directory for the Ray workers. This must either be an existing directory on the local machine with total size at most 100 MiB, or a path to a zip file stored in Amazon S3.
If a local directory is specified, it will be uploaded to each node on the cluster.
Ray workers will be started in their node's copy of this directory.

- Examples

- ``"." # cwd``

- ``"/src/my_project"``

- ``"s3://path/to/my_dir.zip"``

Note: Setting a local directory per-task or per-actor is currently unsupported.

Note: If your local directory contains a ``.gitignore`` file, the files and paths specified therein will not be uploaded to the cluster.

- ``py_modules`` (List[str|module]): Specifies Python modules to import in the Ray workers. (For more ways to specify packages, see also the ``pip`` and ``conda`` fields below.)
Each entry must be either (1) a path to a local directory, (2) a path to a zip file stored in Amazon S3, or (3) a Python module object.

- Examples

- ``"."``

- ``"/src/my_module"``

- ``"s3://path/to/my_module.zip"``

- ``my_module # Assumes my_module has already been imported, e.g. via 'import my_module'``

Note: Note: Setting options (1) and (3) per-task or per-actor is currently unsupported.
Copy link
Contributor

@rkooo567 rkooo567 Nov 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe having a separate section to explain what APIs are supported for per job or per actor / tasks? Like;

Supported APIs:

Jobs

  • working dir
  • conda env
  • pymodule...

Per tasks/actors

  • conda env


Note: For option (1), if your local directory contains a ``.gitignore`` file, the files and paths specified therein will not be uploaded to the cluster.
- ``excludes`` (List[str]): When used with ``working_dir`` or ``py_modules``, specifies a list of files or paths to exclude from being uploaded to the cluster.
This field also supports the pattern-matching syntax used by ``.gitignore`` files: see `<https://git-scm.com/docs/gitignore>`_ for details.

- Example: ``["my_file.txt", "path/to/dir", "*.log"]``

- ``pip`` (List[str] | str): Either a list of pip `requirements specifiers <https://pip.pypa.io/en/stable/cli/pip_install/#requirement-specifiers>`_, or a string containing the path to a pip
`“requirements.txt” <https://pip.pypa.io/en/stable/user_guide/#requirements-files>`_ file.
This will be installed in the Ray workers at runtime.
To use a library like Ray Serve or Ray Tune, you will need to include ``"ray[serve]"`` or ``"ray[tune]"`` here.

- Example: ``["requests==1.0.0", "aiohttp", "ray[serve]"]``

- Example: ``"./requirements.txt"``

- ``conda`` (dict | str): Either (1) a dict representing the conda environment YAML, (2) a string containing the path to a
`conda “environment.yml” <https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually>`_ file,
or (3) the name of a local conda environment already installed on each node in your cluster (e.g., ``"pytorch_p36"``).
In the first two cases, the Ray and Python dependencies will be automatically injected into the environment to ensure compatibility, so there is no need to manually include them.
Note that the ``conda`` and ``pip`` keys of ``runtime_env`` cannot both be specified at the same time---to use them together, please use ``conda`` and add your pip dependencies in the ``"pip"`` field in your conda ``environment.yaml``.

- Example: ``{"dependencies": ["pytorch", “torchvision”, "pip", {"pip": ["pendulum"]}]}``

- Example: ``"./environment.yml"``

- Example: ``"pytorch_p36"``


- ``env_vars`` (Dict[str, str]): Environment variables to set.

- Example: ``{"OMP_NUM_THREADS": "32", "TF_WARNINGS": "none"}``

- ``eager_install`` (bool): Indicates whether to install the runtime env at `ray.init()` time, before the workers are leased. This flag is set to ``True`` by default.
Currently, specifying this option per-actor or per-task is not supported.

- Example: ``{"eager_install": False}``

The runtime environment is inheritable, so it will apply to all tasks/actors within a job and all child tasks/actors of a task or actor, once set.

If a child actor or task specifies a new ``runtime_env``, it will be merged with the parent’s ``runtime_env`` via a simple dict update.
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
For example, if ``runtime_env["pip"]`` is specified, it will override the ``runtime_env["pip"]`` field of the parent.
The one exception is the field ``runtime_env["env_vars"]``. This field will be `merged` with the ``runtime_env["env_vars"]`` dict of the parent.
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
This allows for an environment variables set in the parent's runtime environment to be automatically propagated to the child, even if new environment variables are set in the child's runtime environment.

Here are some examples of runtime environments combining multiple options:

2 changes: 1 addition & 1 deletion doc/source/serve/core-apis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ is set. In particular, it's also called when new replicas are created in the
future if scale up your deployment later. The `reconfigure` method is also called
each time `user_config` is updated.

Dependency Management
Handling Dependencies
=====================

Ray Serve supports serving deployments with different (possibly conflicting)
Expand Down
Loading