[Datasets] Add Path Partitioning Support for All Content Types #23624

pdames · 2022-03-31T01:26:37Z

Why are these changes needed?

Adds a content-type-agnostic partition parser with support for filtering files. Also adds some corner-case bug fixes and usability improvements for supporting more robust input path types. This is the first PR of a series originally proposed in #23179.

The primary difference from #23179 is that this PR (1) only includes changes related to path-based partitioning and extended input path type support, (2) includes unit tests for path-based partitioning and corner-case path type support, and (3) refactors the single PathPartitioning class into PathPartitionBase, PathPartitionGenerator, and PathPartitionParser.

Related issue number

Partially resolves #22910.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

pdames · 2022-04-04T17:20:56Z

@jianoaix @ericl @clarkzinzow This should be ready for review - it doesn't look like the failing tests are related.

ericl · 2022-04-04T22:41:42Z

@clarkzinzow @jianoaix can you review?

jianoaix

Thank you Patrick for splitting the PR and adding unit tests!
Overall looking good to me, just some small comments.

python/ray/data/datasource/partitioning.py

jianoaix · 2022-04-05T00:44:29Z

python/ray/data/datasource/partitioning.py

+        paths: List[str],
+        filesystem: "pyarrow.fs.FileSystem",
+    ) -> List[str]:
+        """Removes all paths that don't pass this partition scheme's partition filter.


This one-liner reads like it's going to mutate the state of this class. Maybe just say "Returns ...."

jianoaix · 2022-04-05T21:34:51Z

python/ray/data/datasource/file_based_datasource.py

@@ -114,7 +117,7 @@ def _get_write_path_for_block(
        # Use forward slashes for cross-filesystem compatibility, since PyArrow


Update to reflect this is choosing posix format for cross-fs compability?

jianoaix · 2022-04-05T21:37:20Z

python/ray/data/tests/conftest.py

@@ -93,7 +110,116 @@ def _get_write_path_for_block(
            suffix = (
                f"{block_index:06}_{num_rows:02}_{dataset_uuid}" f".test.{file_format}"
            )
-            print(f"Writing to: {base_path}/{suffix}")
            return f"{base_path}/{suffix}"


Use posix util?

python/ray/data/datasource/partitioning.py

pdames · 2022-04-08T16:03:23Z

Hmm, seems like the latest change to just python/ray/data/datasource/partitioning.py didn't trigger the CI datasets tests. Do we need to change a CI script somewhere to ensure that changes to this file automatically run dataset tests?

clarkzinzow · 2022-04-08T16:08:22Z

@pdames CI was broken yesterday due to a TypeScript dependency change causing the dashboard compilation to fail, if you rebase onto latest master it should be fixed!

pdames · 2022-04-11T05:09:03Z

@pdames CI was broken yesterday due to a TypeScript dependency change causing the dashboard compilation to fail, if you rebase onto latest master it should be fixed!

Tests are passing after rebase. Failing checks appear to be unrelated.

pdames · 2022-04-13T05:46:39Z

@jianoaix @clarkzinzow This should be ready for a final review.

jianoaix · 2022-04-13T16:47:31Z

@jianoaix @clarkzinzow This should be ready for a final review.

Thank you for your patience. I'll get to it soon after 1.12 release (last push for it now, should happen these 1-2 days).

jianoaix · 2022-04-18T00:11:25Z

python/ray/data/tests/conftest.py

+    def _assert_base_partitioned_ds(
+        ds,
+        count=6,
+        input_files=2,


nit: num_input_files

jianoaix · 2022-04-18T00:16:40Z

python/ray/data/tests/test_partitioning.py

+    assert path_partition_generator.normalized_base_dir is None
+    partition_values = ["1", "2"]
+    partition_path = path_partition_generator(partition_values, fs)
+    assert path_partition_generator.normalized_base_dir is not None


Can this assert its actual content?

jianoaix · 2022-04-18T00:31:43Z

python/ray/data/tests/test_partitioning.py

+)
+from ray.data.tests.conftest import *  # noqa
+
+


Thanks for adding unit tests, the coverage looks good!

jianoaix · 2022-04-18T00:39:55Z

python/ray/data/datasource/partitioning.py

+        """Gets the partition key field names."""
+        return self._field_names
+
+    def _normalize_base_dir(self, filesystem: "pyarrow.fs.FileSystem"):


Could it make more sense to have filesystem in the constructor?
It looks a path or partition is always in the context of a filesystem. There is no need to use the same PathPartitionBase object for different filesystems?

Agreed - I've moved the filesystem to the constructor. We just have to make sure that FileBasedDatasource and PathPartitionScheme keep their filesystem resolution methods in-sync going forward. Passing the filesystem in post-construction made more sense in an earlier iteration when filesystem resolution wasn't built into this class.

jianoaix · 2022-04-18T00:46:33Z

python/ray/data/datasource/partitioning.py

+
+
+@DeveloperAPI
+class PathPartitionGenerator(PathPartitionBase):


My understanding of roles for these 3 classes is that

PathPartitionBase (or its subclass - do we need that?) describes a path-based partition

PathPartitionGenerator encodes the path-based partition (into string): so it seems not a IS-A relationship, hence not a good fit to model as a subclass

PathPartitionParser decodes a path-based partition (out of string), which can then be used for use cases like filtering: so similar it's not a IS-A relationship
So it looks we should model them with composition not inheritance. WDYT?

We can discuss this, but developing on top of the above, I think we can create abstractions and their roles:

PathPartition: describes a path-based partition

PathPartitionEncoder: encodes the path-based partition into string, so we can pass it around and use it in writing out Dataset to file systems

PathPartitionParser (or PathPartitionDecoder): decodes a path-based partition out of string, so we can easily access each fields/values, and use them for e.g. filtering

PathPartitionSelector (or PathPartitionFilter): given the PathPartition, the selector/filter function, and the field/values, it produces a subset of partitions (to read into Dataset)

Btw, we can have an offline meeting if this is a bit complex or inefficient to discuss over GitHub :)

I definitely like the cleaner separation of responsibilities that composition provides vs. inheritance. Let me know what you think of the latest refactor here. One naming difference is that I explicitly renamed PathPartitionBase to PathPartitionScheme instead of PathPartition, since I felt like the latter gave the impression that it was only referring to one partition, when in fact it holds the spec for an arbitrarily large number of partitions.

One thing I didn't like initially was that constructing something like a PathPartitionFilter required first constructing a PathPartitionParser and a PathPartitionScheme, but I decided to strike a balance here with static factories that provide alternate constructors with flattened arguments.

Thanks! For the naming, the PathPartitionScheme looks great!

What do you think making PathPartitionFilter as a function v.s. as a class?

jianoaix · 2022-04-18T00:52:20Z

python/ray/data/datasource/partitioning.py

+            )
+        self._filter_fn = filter_fn
+
+    def filter_paths(


Thinking more about abstractions we are building here, how about we separate this out from Parser as a Filter?
I mean whether this will make sense depends on what you think about the roles of those classes as mentioned above :)

I've separated the filter out from the parser in the latest revision. One tangential benefit is the ability to now test the path partition parser independent of partition filtering.

…filter.

…docstring updates.

…artitioning w/ field names.

jianoaix

Thank you for making the change, looks nice!

jianoaix · 2022-04-21T17:14:38Z

python/ray/data/datasource/partitioning.py

+        )
+        return {field_names[i]: d for i, d in enumerate(dirs)} if dirs else {}
+
+    def __call__(self, path: str) -> Dict[str, str]:


Nit: move this up before private methods and other public methods.

jianoaix · 2022-04-21T17:14:53Z

python/ray/data/datasource/partitioning.py

+            )
+        return self._encoder_fn(values)
+
+    def __call__(self, partition_values: List[str]) -> str:


nit: move magic method up - I think it's the core of this class.

jianoaix · 2022-04-21T17:17:03Z

python/ray/data/datasource/partitioning.py

+    ) -> "PathPartitionParser":
+        """Creates a path-based partition parser using a flattened argument list.
+
+        Args:


Add "filesystem" arg to this section?

jianoaix · 2022-04-21T19:46:57Z

python/ray/data/datasource/partitioning.py

+        self._filter_fn = filter_fn
+
+    @property
+    def parser(self) -> PathPartitionParser:


Do we need to expose parser? It looks never used; if so, can we remove this method?

It's currently only used by tests, but I'd expect it to also be useful to anyone constructing a filter since it provides the only path back to the properties of the underlying partition scheme (e.g. via filter.parser.scheme.* to retrieve the base directory, partition field names, etc.).

jianoaix · 2022-04-21T19:58:52Z

python/ray/data/datasource/partitioning.py

+    ) -> "PathPartitionFilter":
+        """Creates a path-based partition filter using a flattened argument list.
+
+        Args:


Add 'filesystem' to Args as well here.

jianoaix

Nice work, LGTM

clarkzinzow

LGTM! @pdames awesome tests and documentation, and @jianoaix great reviewing! 🙌

clarkzinzow · 2022-04-14T20:47:51Z

python/ray/data/datasource/file_based_datasource.py

-    return parsed.netloc + parsed.path
+    parsed = urllib.parse.urlparse(path, allow_fragments=False)  # support '#' in path
+    query = "?" + parsed.query if parsed.query else ""  # support '?' in path
+    return parsed.netloc + parsed.path + query


clarkzinzow · 2022-04-22T22:48:03Z

Test failures appear to be unrelated, merging.

…ource (#24094) Adds a fast file metadata provider that trades comprehensive file metadata collection for speed of metadata collection, and which also disabled directory path expansion which can be very slow on some cloud storage service providers. This PR also refactors the Parquet datasource to be able to take advantage of both these changes and the content-type agnostic partitioning support from #23624. This is the second PR of a series originally proposed in #23179.

pdames requested a review from jianoaix March 31, 2022 01:26

pdames requested a review from ericl as a code owner March 31, 2022 01:26

pdames assigned ericl Mar 31, 2022

pdames requested review from scv119, clarkzinzow and jjyao as code owners March 31, 2022 01:26

pdames assigned clarkzinzow and jianoaix Mar 31, 2022

pdames force-pushed the path-partitioning branch from 9c7e654 to aaf28c6 Compare March 31, 2022 20:24

pdames mentioned this pull request Apr 1, 2022

[Datasets] Add bulk Parquet file reader API #23179

Merged

6 tasks

ericl removed their assignment Apr 4, 2022

jianoaix reviewed Apr 5, 2022

View reviewed changes

pdames requested a review from jianoaix April 7, 2022 09:09

pdames force-pushed the path-partitioning branch from 6032cbf to ea0bca8 Compare April 7, 2022 09:12

pdames force-pushed the path-partitioning branch from 7400daf to a7e66a7 Compare April 10, 2022 07:01

jianoaix reviewed Apr 18, 2022

View reviewed changes

pdames added 7 commits April 20, 2022 22:37

[Datasets] Add content-type-agnostic path-based partition parser and …

99b1770

…filter.

[Datasets] Add partition path generator and unit tests.

91f02b9

[Datasets] Add tests for corner-case characters in paths.

e7addc3

[Datasets] Minor refactoring and failing S3 unit test fixes.

1d01a54

[Datasets] Add partitioning API tests. Minor fixes, refactoring, and …

1a9ccf4

…docstring updates.

[Datasets] Allow unpartitioned paths under base dir when using hive p…

6db7665

…artitioning w/ field names.

[Datasets] Refactoring. Separate filter class from parser.

4051126

pdames force-pushed the path-partitioning branch from a7e66a7 to 4051126 Compare April 21, 2022 05:37

[Datasets] Fix broken tests.

ca5f522

jianoaix reviewed Apr 21, 2022

View reviewed changes

pdames force-pushed the path-partitioning branch from b956ceb to 01afd9d Compare April 21, 2022 21:37

[Datasets] Minor refactoring and docstring updates.

1e7a1b8

pdames force-pushed the path-partitioning branch from 01afd9d to 1e7a1b8 Compare April 21, 2022 21:43

jianoaix approved these changes Apr 21, 2022

View reviewed changes

pdames mentioned this pull request Apr 21, 2022

[Datasets] Add fast file metadata provider and refactor Parquet datasource #24094

Merged

6 tasks

[Datasets] Give updated unit test a more appropriate name.

0c49452

clarkzinzow approved these changes Apr 22, 2022

View reviewed changes

clarkzinzow merged commit 9f4cb9b into ray-project:master Apr 22, 2022

bveeramani mentioned this pull request Apr 26, 2022

[Data] Add partitioning classes to Data API reference #24203

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add Path Partitioning Support for All Content Types #23624

[Datasets] Add Path Partitioning Support for All Content Types #23624

pdames commented Mar 31, 2022

pdames commented Apr 4, 2022

ericl commented Apr 4, 2022

jianoaix left a comment

jianoaix Apr 5, 2022

jianoaix Apr 5, 2022

jianoaix Apr 5, 2022

pdames commented Apr 8, 2022

clarkzinzow commented Apr 8, 2022

pdames commented Apr 11, 2022

pdames commented Apr 13, 2022

jianoaix commented Apr 13, 2022

jianoaix Apr 18, 2022

jianoaix Apr 18, 2022

jianoaix Apr 18, 2022

jianoaix Apr 18, 2022

pdames Apr 21, 2022

jianoaix Apr 18, 2022

jianoaix Apr 18, 2022

pdames Apr 21, 2022 •

edited by jianoaix

Loading

jianoaix Apr 21, 2022

jianoaix Apr 18, 2022

pdames Apr 21, 2022

jianoaix left a comment

jianoaix Apr 21, 2022

jianoaix Apr 21, 2022

jianoaix Apr 21, 2022

jianoaix Apr 21, 2022

pdames Apr 21, 2022

jianoaix Apr 21, 2022

jianoaix left a comment

clarkzinzow left a comment

clarkzinzow Apr 14, 2022

clarkzinzow commented Apr 22, 2022

		@@ -114,7 +117,7 @@ def _get_write_path_for_block(
		# Use forward slashes for cross-filesystem compatibility, since PyArrow



		@DeveloperAPI
		class PathPartitionGenerator(PathPartitionBase):

[Datasets] Add Path Partitioning Support for All Content Types #23624

[Datasets] Add Path Partitioning Support for All Content Types #23624

Conversation

pdames commented Mar 31, 2022

Why are these changes needed?

Related issue number

Checks

pdames commented Apr 4, 2022

ericl commented Apr 4, 2022

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdames commented Apr 8, 2022

clarkzinzow commented Apr 8, 2022

pdames commented Apr 11, 2022

pdames commented Apr 13, 2022

jianoaix commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdames Apr 21, 2022 • edited by jianoaix Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Apr 22, 2022

pdames Apr 21, 2022 •

edited by jianoaix

Loading