Skip to content

Commit

Permalink
[Data] Add partitioning classes to Data API reference (ray-project#24203
Browse files Browse the repository at this point in the history
)
  • Loading branch information
bveeramani authored and Ubuntu committed May 23, 2022
1 parent 71443f8 commit 2ce0b2d
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 8 deletions.
19 changes: 18 additions & 1 deletion doc/source/data/package-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,23 @@ Custom Datasource API
.. autoclass:: ray.data.ReadTask
:members:

Datasource Partitioning API
---------------------------

.. autoclass:: ray.data.datasource.PartitionStyle
:members:

.. autoclass:: ray.data.datasource.PathPartitionScheme
:members:

.. autoclass:: ray.data.datasource.PathPartitionEncoder
:members:

.. autoclass:: ray.data.datasource.PathPartitionParser
:members:

.. autoclass:: ray.data.datasource.PathPartitionFilter

Built-in Datasources
--------------------

Expand All @@ -146,7 +163,7 @@ Built-in Datasources

.. autoclass:: ray.data.datasource.RangeDatasource
:members:

.. autoclass:: ray.data.datasource.SimpleTensorFlowDatasource
:members:

Expand Down
2 changes: 2 additions & 0 deletions python/ray/data/datasource/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
PathPartitionEncoder,
PathPartitionFilter,
PathPartitionParser,
PathPartitionScheme,
)
from ray.data.datasource.tensorflow_datasource import SimpleTensorFlowDatasource
from ray.data.datasource.torch_datasource import SimpleTorchDatasource
Expand All @@ -57,6 +58,7 @@
"PathPartitionEncoder",
"PathPartitionFilter",
"PathPartitionParser",
"PathPartitionScheme",
"RandomIntRowDatasource",
"RangeDatasource",
"ReadTask",
Expand Down
30 changes: 23 additions & 7 deletions python/ray/data/datasource/partitioning.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,12 @@ class PartitionStyle(str, Enum):
Examples:
>>> # Serialize to JSON text.
>>> json.dumps(PartitionStyle.HIVE) # "hive"
>>> json.dumps(PartitionStyle.HIVE) # doctest: +SKIP
'"hive"'
>>> # Deserialize from JSON text.
>>> PartitionStyle(json.loads('"hive"')) # PartitionStyle.HIVE
>>> PartitionStyle(json.loads('"hive"')) # doctest: +SKIP
<PartitionStyle.HIVE: 'hive'>
"""

HIVE = "hive"
Expand Down Expand Up @@ -151,6 +153,7 @@ def of(
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
) -> "PathPartitionEncoder":
"""Creates a new partition path encoder.
Args:
style: The partition style - may be either HIVE or DIRECTORY.
base_dir: "/"-delimited base directory that all partition paths will be
Expand Down Expand Up @@ -426,13 +429,26 @@ def of(
partition or `False` to skip it. Partition keys and values are always
strings read from the filesystem path. For example, this removes all
unpartitioned files:
``lambda d: True if d else False``
.. code:: python
lambda d: True if d else False
This raises an assertion error for any unpartitioned file found:
``def do_assert(val, msg):
assert val, msg
lambda d: do_assert(d, "Expected all files to be partitioned!")``
.. code:: python
def do_assert(val, msg):
assert val, msg
lambda d: do_assert(d, "Expected all files to be partitioned!")
And this only reads files from January, 2022 partitions:
``lambda d: d["month"] == "January" and d["year"] == "2022"``
.. code:: python
lambda d: d["month"] == "January" and d["year"] == "2022"
style: The partition style - may be either HIVE or DIRECTORY.
base_dir: "/"-delimited base directory to start searching for partitions
(exclusive). File paths outside of this directory will be considered
Expand Down

0 comments on commit 2ce0b2d

Please sign in to comment.