Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#2720: Add documentation for partition API #2721

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions docs/developer/pandas/partition_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Pandas Partition API in Modin
YarShev marked this conversation as resolved.
Show resolved Hide resolved
=============================

If you are working with a Modin Dataframe, you can unwrap its remote partitions
to get the raw future objects compatible with the execution engine (e.g. ``ray.ObjectRef`` for Ray).
YarShev marked this conversation as resolved.
Show resolved Hide resolved
You can use this API to get the IPs of the nodes that hold these objects as well. In that case you can pass the partitions
YarShev marked this conversation as resolved.
Show resolved Hide resolved
having needed IPs to your function. It can help with minimazing of data movement between nodes. In addition to
unwrapping of the remote partitions we also provide API to construct Modin DataFrame from them.

Ray engine
----------
However, it is worth noting that for Modin on ``Ray`` engine with ``pandas`` backend IPs of the remote partitions may not match
YarShev marked this conversation as resolved.
Show resolved Hide resolved
actual locations if the partitions are lower than 100 kB. Ray saves such objects (<= 100 kB, by default) in in-process store
of the calling process (please, refer to `Ray documentation`_ for more information). We can't get IPs for such objects while maintaining good performance.
So, you should keep in mind this for unwrapping of the remote partitions with their IPs. Several options are provided to handle the case in
``How to handle Ray objects that are lower 100 kB`` section.

Dask engine
-----------
There is no mentioned above issue for Modin on ``Dask`` engine with ``pandas`` backend because ``Dask`` saves any objects
in the worker process that processes a function (please, refer to `Dask documentation`_ for more information).

Install Modin Pandas Partition API
YarShev marked this conversation as resolved.
Show resolved Hide resolved
----------------------------------

Modin now comes with all the dependencies for pandas partition API functionality by default! See
the :doc:`installation page </installation>` for more information on installing Modin.

How to handle Ray objects that are lower than 100 kB
----------------------------------------------------

* If you are sure that each of the remote partitions being unwrapped is higher than 100 kB, you can just import Modin or perform ``ray.init()`` manually.

* If you don't know partition sizes you can pass the option ``_system_config={"max_direct_call_object_size": <nbytes>,}``, where ``nbytes`` is threshold for objects that will be stored in in-process store, to ``ray.init()`` or export the following environment variable:

.. code-block:: bash

export MODIN_ON_RAY_PARTITION_THRESHOLD=<nbytes>

When specifying ``nbytes`` equal to 0, all the objects will be saved to shared-memory object store (plasma).

* You can also start Ray as follows: ``ray start --head --system-config='{"max_direct_call_object_size":<nbytes>}'``.

Note that when specifying the threshold the performance of some Modin operations may change.

API
---

It is currently supported the following API:

.. automodule:: modin.distributed.dataframe.pandas
:noindex:
:members: unwrap_partitions

YarShev marked this conversation as resolved.
Show resolved Hide resolved
.. automodule:: modin.distributed.dataframe.pandas
:noindex:
:members: from_partitions
YarShev marked this conversation as resolved.
Show resolved Hide resolved

Running an example with pandas partition API
--------------------------------------------

Before you run this, please make sure you follow the instructions listed above.

.. code-block:: python

import modin.pandas as pd
from modin.distributed.dataframe.pandas import unwrap_partitions, from_partitions
import numpy as np
data = np.random.randint(0, 100, size=(2 ** 10, 2 ** 8))
df = pd.DataFrame(data)
partitions = unwrap_partitions(df, axis=0, get_ip=True)
print(partitions)
# Also, you can create Modin DataFrame from remote partitions including their IPs
new_df = from_partitions(partitions, axis=0)
print(new_df)


.. _`Ray documentation`: https://docs.ray.io/en/master/index.html#
.. _`Dask documentation`: https://distributed.dask.org/en/latest/index.html
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ nature, you get a fast DataFrame at 1MB and 1TB+.

contributing
developer/architecture
developer/pandas/partition_api

.. toctree::
:caption: Engines, Backends, and APIs
Expand Down