[data] Add pickle support for PyArrow CSV WriteOptions #19378

pdames · 2021-10-14T04:48:50Z

Why are these changes needed?

Users cannot specify custom Dataset write options when writing output to CSV via dataset.write_csv(), because the pyarrow.csv.WriteOptions class retrieved from writer_args["write_options"] in CSVDatasource._write_block() cannot be pickled by Ray.

This PR presents a solution to this problem via a _WriteOptionsWrapper class that wraps WriteOptions while implementing the required __getstate__ and __setstate__ methods.

Related issue number

Closes #19366

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2021-10-14T18:25:35Z

Hmm what if we also accepted a lambda that returned write args in addition to the write args dict? That might be a lighter weight interface than adding a new wrapper class.

pdames · 2021-10-19T19:26:54Z

Hmm what if we also accepted a lambda that returned write args in addition to the write args dict? That might be a lighter weight interface than adding a new wrapper class.

I've added a new commit that updates the relevant write-side APIs to accept a write kwargs supplier lambda (i.e. a lambda that takes no input arguments and returns Dict[str, Any]). The current implementation merges the two dictionaries via a simple dictionary kwargs.update(lambda_supplied_kwargs) which gives precedence to the kwargs supplier lambda over conflicting **kwargs, and leverages the _resolve_kwargs() helper function for consistency.

If we're happy with the current changes, then I could also update the read-side APIs to accept a read kwargs supplier lambda that works in a similar way, which could help us both with code consistency and to head off other similar read kwarg pickling problems that may pop up on the read side either now or in the future.

ericl

Looks pretty good. Could we also rename the functions to be more consistent with the kwargs (e.g., arrow_parquet_args => arrow_parquet_args_fn, writer_args => writer_args_fn etc.)? I made a suggestion.

ericl · 2021-10-19T21:31:14Z

python/ray/data/dataset.py

+            filesystem: Optional["pyarrow.fs.FileSystem"] = None,
+            try_create_dir: bool = True,
+            arrow_open_stream_args: Optional[Dict[str, Any]] = None,
+            write_args_provider: Optional[Callable[[], Dict[str, Any]]] = None,


Suggested change

write_args_provider: Optional[Callable[[], Dict[str, Any]]] = None,

arrow_parquet_args_fn: Callable[[], Dict[str, Any]] = = lambda: {},

Updated to use consistent names with kwargs and default to an empty dict supplier everywhere in latest commit.

ericl · 2021-10-19T21:33:35Z

If we're happy with the current changes, then I could also update the read-side APIs to accept a read kwargs supplier lambda that works in a similar way, which could help us both with code consistency and to head off other similar read kwarg pickling problems that may pop up on the read side either now or in the future.

Also sounds good.

ericl · 2021-10-20T05:37:41Z

Some tests failures:

(pid=3787) 2021-10-20 05:08:21,870 INFO worker.py:425 -- Task failed with retryable exception: TaskID(a42cc390bccfe367ffffffffffffffffffffffff01000000).
| (pid=3787) Traceback (most recent call last):
| (pid=3787) File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
| (pid=3787) File "python/ray/_raylet.pyx", line 569, in ray._raylet.execute_task
| (pid=3787) File "/ray/python/ray/data/datasource/file_based_datasource.py", line 150, in write_block
| (pid=3787) write_args_fn, **write_args)
| (pid=3787) TypeError: _write_block() got multiple values for argument 'column'
| (pid=3787) 2021-10-20 05:08:21,872 INFO worker.py:425 -- Task failed with retryable exception: TaskID(6f77e31f0d8e20b7ffffffffffffffffffffffff01000000).
| (pid=3787) Traceback (most recent call last):
| (pid=3787) File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
| (pid=3787) File "python/ray/_raylet.pyx", line 569, in ray._raylet.execute_task
| (pid=3787) File "/ray/python/ray/data/datasource/file_based_datasource.py", line 150, in write_block
| (pid=3787) write_args_fn, **write_args)
| (pid=3787) TypeError: _write_block() got multiple values for argument 'column'
| (pid=3787) 2021-10-20 05:08:39,879 INFO worker.py:425 -- Task failed with retryable exception: TaskID(11361861e4b5a1a0ffffffffffffffffffffffff01000000).
| (pid=3787) Traceback (most recent call last):
| (pid=3787) File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
| (pid=3787) File "python/ray/_raylet.pyx", line 569, in ray._raylet.execute_task
| (pid=3787) File "/ray/python/ray/data/datasource/file_based_datasource.py", line 150, in write_block
| (pid=3787) write_args_fn, **write_args)
| (pid=3787) TypeError: _write_block() got multiple values for argument 'column'
| (pid=3787) 2021-10-20 05:08:39,886 INFO worker.py:425 -- Task failed with retryable exception: TaskID(3fc9bc8b50b396b1ffffffffffffffffffffffff01000000).
| (pid=3787) Traceback (most recent call last):
| (pid=3787) File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
| (pid=3787) File "python/ray/_raylet.pyx", line 569, in ray._raylet.execute_task
| (pid=3787) File "/ray/python/ray/data/datasource/file_based_datasource.py", line 150, in write_block
| (pid=3787) write_args_fn, **write_args)
| (pid=3787) TypeError: _write_block() got multiple values for argument 'column'
| (pid=5462) 2021-10-20 05:09:39,883 INFO worker.py:425 -- Task failed with retryable exception: TaskID(5caaf4fa024cdd1dffffffffffffffffffffffff01000000).FAILED [ 84%]
| ::test_json_write[local_fs-local_path-None] FAILED [ 84%]
| ::test_json_write[s3_fs-s3_path-s3_server] FAILED [ 85%]
| ::test_json_roundtrip[None-local_path] -- Test timed out at 2021-10-20 05:10:32 UTC --
| ================================================================================

ericl · 2021-10-20T18:59:32Z

still timing out on that json datasource test

clarkzinzow

Nice, thanks for adding this!

I personally find the "args provider" route less appealing since:

It adds an extra API arg that duplicates behavior of an existing API arg in a slightly confusing way for the user, who will ask: why are there two ways to give args, and when do I have to use the lambda?
The user having to package up their unserializable args into a provider lambda is worse UX than being able to pass it directly with us taking care of making it serializable under-the-hood.
Once Arrow fixes the serializability issues upstream, we'd need to break the API in order to remove this now unnecessary arg provider, compared to the wrapper class route which is an internal change.
The code impact is much larger overall compared to the wrapper, which only needed to wrap/unwrap at serialization boundaries.

But I don't think this is worth blocking this from getting merged.

clarkzinzow · 2021-10-20T21:23:08Z

python/ray/data/dataset.py

@@ -1019,6 +1021,9 @@ def write_parquet(self,
                if True. Does nothing if all directories already exist.
            arrow_open_stream_args: kwargs passed to
                pyarrow.fs.FileSystem.open_output_stream
+            arrow_parquet_args_fn: Callable that returns a dictionary of write
+                arguments to use when writing each block to a file. Overrides
+                any duplicate keys from arrow_parquet_args.


Nit: Maybe mention why this exists in addition to arrow_parquet_args? Ditto for the other functions as well.

I agree with your arguments against the args provider lambda approach, since I had many of the same thoughts while I was mulling over these changes. I wound up warming to the lambda approach the more I thought about it, for the following reasons:

It adds an extra API arg that duplicates behavior of an existing API arg in a slightly confusing way for the user, who will ask: why are there two ways to give args, and when do I have to use the lambda?

Agreed. In the presence of both options, docs can help clarify when to use each, but I think the worry-free approach here is to simply recommend always using the lambda instead of kwargs. To make the APIs more clear, I think we either (1) remove existing dataset write API **kwargs or (2) change their name and docs to say they are reserved for future (or internal/developer) use. The main con of the lambda is that it is less idiomatic than **kwargs but, given that it covers us against any args that can't be pickled now or in the future without having to exhaustively test and/or wrap all of them, I think this is worth pursuing for long-term clarity and maintainability. I would also err slightly on the side of option (2) for backwards compatibility,

The user having to package up their unserializable args into a provider lambda is worse UX than being able to pass it directly with us taking care of making it serializable under-the-hood.

Agreed. I'd like to keep these details hidden from the user and the wrapper is an effective way of doing so, but I also worry about the long-term maintainability of wrappers or custom serializers. If there are any corner cases that we miss now or in the future then waiting for a new wrapper to be developed is also a bad user experience. The odds of missing cases will also continue to grow as new datasources are developed, and as the file IO APIs we depend on continue to evolve. If we make the above changes to stop documenting (or allowing) the use of write option kwargs, then there's no longer any question about which args go where, since the lambda is the only option.

Once Arrow fixes the serializability issues upstream, we'd need to break the API in order to remove this now unnecessary arg provider, compared to the wrapper class route which is an internal change.

Even if we ignore the pickling problem, I've been thinking that a lambda's ability to lazily resolve distinct write options for each block is quite powerful. For example, I've been thinking that we should actually change the lambda signature from Callable[[], Dict[str, Any]] to Callable[[BlockAccessor], Dict[str, Any]]. What do you and @ericl think? This would allow advanced users to intelligently determine per-block write arguments based on the properties of the block being written (e.g. adjusting parquet data_page_size or `compression_level based on the size of the block being written, dictionary encoding based on column cardinalities, etc.), instead of being locked into one static set of write arguments for the entire dataset.

The code impact is much larger overall compared to the wrapper, which only needed to wrap/unwrap at serialization boundaries.

Agreed. The immediate code impact of this PR spread across multiple API layers and more than doubled in LOC changed. However, over the long-term, I wouldn't be surprised to see the code and developer time impact of adding/removing wrappers being greater. Granted, this is purely speculation on my part, but it is a risk.

@pdames All great counter-points, thank you for taking the time to write those up! The fact that the provider (1) future-proofs against future unserializable args, (2) gives a path to per-block write args and per-file read args, makes the UX hit more reasonable.

To make the APIs more clear, I think we either (1) remove existing dataset write API **kwargs or (2) change their name and docs to say they are reserved for future (or internal/developer) use. The main con of the lambda is that it is less idiomatic than **kwargs but, given that it covers us against any args that can't be pickled now or in the future without having to exhaustively test and/or wrap all of them, I think this is worth pursuing for long-term clarity and maintainability. I would also err slightly on the side of option (2) for backwards compatibility,

I think that a happy-path here, from a UX perspective, would be a third option: remove the catch-all **kwargs and have a single arrow_parquet_args arg that can either be a plain options dictionary or a provider callable:

def write_parquet( self, path: str, *, filesystem: Optional["pyarrow.fs.FileSystem"] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, arrow_parquet_args: Optional[ Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None ) -> None: """Write the dataset to parquet. ... Args: ... arrow_open_stream_args: kwargs passed to pyarrow.fs.FileSystem.open_output_stream arrow_parquet_args: A dictionary (or callable that creats a dictionary) of write options to use when writing each block to a file. Use a callable if you want to vary the write options based on the block, or if one or more of the write options are not serializable. """ # ... def _write_block(self, f: "pyarrow.NativeFile", block: BlockAccessor, writer_args: Optional[ Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None: import pyarrow.parquet as pq writer_args = _resolve_kwargs(writer_args, block) pq.write_table(block.to_arrow(), f, **writer_args) # ... def _resolve_kwargs( kwargs: Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]], block: BlockAccessor, ) -> Dict[str, Any]: if kwargs is None: kwargs = {} elif callable(kwargs): kwargs = kwargs(block) return kwargs

You have a more complicated type for arrow_parquet_args, but all three of the following options are available to the user without exposing multiple arguments that step on each other:

Happy/easy path: The common case of a plain dict of write options can be passed.

Unserializable args: Unserializable args can be generated by a callable.

Per-block options: The options can vary for each block via a callable.

ericl · 2021-10-21T08:07:12Z

Hmm, the idea to add a more powerful API to configure per block/file accesses is interesting. We could pass a context dictionary and it could return extra args... Not sure if such an API is overkill, but it could certainly justify having a separate lambda option provider.

…

On Thu, Oct 21, 2021, 12:45 AM Patrick Ames ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/data/dataset.py <#19378 (comment)>: > @@ -1019,6 +1021,9 @@ def write_parquet(self, if True. Does nothing if all directories already exist. arrow_open_stream_args: kwargs passed to pyarrow.fs.FileSystem.open_output_stream + arrow_parquet_args_fn: Callable that returns a dictionary of write + arguments to use when writing each block to a file. Overrides + any duplicate keys from arrow_parquet_args. I agree with your arguments against the args provider lambda approach, since I had many of the same thoughts while I was mulling over these changes. I wound up warming to the lambda approach the more I thought about it, for the following reasons: 1. It adds an extra API arg that duplicates behavior of an existing API arg in a slightly confusing way for the user, who will ask: why are there two ways to give args, and when do I have to use the lambda? Agreed. In the presence of both options, documentation can help clarify things write arguments lambda, but I think the worry-free approach here is to simply recommend *always* using the lambda instead of kwargs. To make the APIs more clear, I think we either (1) remove existing dataset write API **kwargs or (2) change their name and documentation to indicate that they are reserved for future (or internal/developer) use. I think the main con of the lambda is that it is less idiomatic than **kwargs but, given that it covers us against any args that can't be pickled now or in the future without having to exhaustively test and/or wrap all of them, I think this is worth pursuing for long-term clarity and maintainability. I would err slightly on the side of option (2) for backwards compatibility, 1. The user having to package up their unserializable args into a provider lambda is worse UX than being able to pass it directly with us taking care of making it serializable under-the-hood. Agreed. I'd like to keep these details hidden from the user and the wrapper is an effective way of doing just that, but I also worry about the long-term maintainability of wrappers or custom serializers. If there are any corner cases that we miss now or in the future then waiting for a new wrapper to be developed also presents a bad user experience. I also think the odds of missing cases will continue to grow as new datasources are developed, and as the file IO APIs we depend on continue to evolve. If we make the changes I've recommended above to stop encouraging (or allowing) users to provide write option kwargs, then an added benefit is that there will no longer be a question about which write args go where, since the lambda becomes the only supported provider of them. 1. Once Arrow fixes the serializability issues upstream, we'd need to break the API in order to remove this now unnecessary arg provider, compared to the wrapper class route which is an internal change. Even if we ignore the pickling problem, I've been thinking that the ability of a lambda to lazily resolve distinct write options for each block is quite powerful. On that note, one extension I would like to make is to change the lambda from Callable[[], Dict[str, Any]] to Callable[[BlockAccessor], Dict[str, Any]]. What do you and @ericl <https://github.com/ericl> think? This would allow advanced users to intelligently determine per-block write arguments based on the properties of the block being written (e.g. adjusting parquet data_page_size or `compression_level based on the size of the block being written, dictionary encoding based on column cardinalities, etc.), instead of being locked into one static set of write arguments for the entire dataset. 1. The code impact is much larger overall compared to the wrapper, which only needed to wrap/unwrap at serialization boundaries. Agreed. The immediate code impact of this PR spread across multiple API layers and more than doubled in LOC changed. However, over the long-term, I wouldn't be surprised to see the code and developer time impact of adding/removing wrappers being greater. Granted, this is purely speculation on my part, but it is a risk. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19378 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSXFQW7QMDXAHUVZXT3UH7AJTANCNFSM5F6XDJ5Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

ericl · 2021-10-21T17:07:47Z

+1 on removing the ** and making it explicit, I think that's probably a usability improvement even by itself.

…

On Thu, Oct 21, 2021, 7:09 AM Clark Zinzow ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/data/dataset.py <#19378 (comment)>: > @@ -1019,6 +1021,9 @@ def write_parquet(self, if True. Does nothing if all directories already exist. arrow_open_stream_args: kwargs passed to pyarrow.fs.FileSystem.open_output_stream + arrow_parquet_args_fn: Callable that returns a dictionary of write + arguments to use when writing each block to a file. Overrides + any duplicate keys from arrow_parquet_args. @pdames <https://github.com/pdames> All great counter-points, thank you for taking the time to write those up! The fact that the provider (1) future-proofs against future unserializable args, (2) gives a path to per-block write args and per-file read args, makes the UX hit more reasonable. To make the APIs more clear, I think we either (1) remove existing dataset write API **kwargs or (2) change their name and docs to say they are reserved for future (or internal/developer) use. The main con of the lambda is that it is less idiomatic than **kwargs but, given that it covers us against any args that can't be pickled now or in the future without having to exhaustively test and/or wrap all of them, I think this is worth pursuing for long-term clarity and maintainability. I would also err slightly on the side of option (2) for backwards compatibility, I think that a happy-path here, from a UX perspective, would be a third option: remove the catch-all **kwargs and have a single arrow_parquet_args arg that can either be a plain options dictionary *or* a provider callable: def write_parquet( self, path: str, *, filesystem: Optional["pyarrow.fs.FileSystem"] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, arrow_parquet_args: Optional[ Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None ) -> None: """Write the dataset to parquet. ... Args: ... arrow_open_stream_args: kwargs passed to pyarrow.fs.FileSystem.open_output_stream arrow_parquet_args: A dictionary (or callable that creats a dictionary) of write options to use when writing each block to a file. Use a callable if you want to vary the write options based on the block, or if one or more of the write options are not serializable. """ # ... def _write_block(self, f: "pyarrow.NativeFile", block: BlockAccessor, writer_args: Optional[ Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None: import pyarrow.parquet as pq writer_args = _resolve_kwargs(writer_args, block) pq.write_table(block.to_arrow(), f, **writer_args) # ... def _resolve_kwargs( kwargs: Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]], block: BlockAccessor, ) -> Dict[str, Any]: if kwargs is None: kwargs = {} elif callable(kwargs): kwargs = kwargs(block) return kwargs You have a more complicated type for arrow_parquet_args, but all three of the following options are available to the user without exposing multiple arguments that step on each other: 1. Happy path: The common case of a plain dict of write options can be passed. 2. Unserializable args: Unserializable args can be generated by a callable. 3. Per-block options: The options can vary for each block via a callable. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19378 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSUQTUWQZCBTQVI3BP3UIANJJANCNFSM5F6XDJ5Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

[data] Add pickle support for PyArrow CSV WriteOptions

3632ba8

pdames requested review from ericl and scv119 as code owners October 14, 2021 04:48

pdames requested a review from clarkzinzow October 14, 2021 04:49

ericl assigned ericl, clarkzinzow and scv119 Oct 14, 2021

ericl added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 14, 2021

ericl removed the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 19, 2021

ericl reviewed Oct 19, 2021

View reviewed changes

ericl added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 19, 2021

pdames force-pushed the dataset-csv-write-options branch from d70140c to 1fd0164 Compare October 20, 2021 03:57

ericl approved these changes Oct 20, 2021

View reviewed changes

pdames force-pushed the dataset-csv-write-options branch from 1fd0164 to be37b6b Compare October 20, 2021 06:25

clarkzinzow approved these changes Oct 20, 2021

View reviewed changes

pdames force-pushed the dataset-csv-write-options branch from be37b6b to aab93bb Compare October 21, 2021 06:16

[data] Add support for write options supplier lambdas.

cf3491e

pdames force-pushed the dataset-csv-write-options branch from aab93bb to cf3491e Compare October 21, 2021 06:32

ericl removed the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Oct 21, 2021

ericl merged commit 20d4787 into ray-project:master Oct 21, 2021

pdames deleted the dataset-csv-write-options branch October 21, 2021 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add pickle support for PyArrow CSV WriteOptions #19378

[data] Add pickle support for PyArrow CSV WriteOptions #19378

pdames commented Oct 14, 2021

ericl commented Oct 14, 2021

pdames commented Oct 19, 2021

ericl left a comment

ericl Oct 19, 2021

pdames Oct 20, 2021

ericl commented Oct 19, 2021

ericl commented Oct 20, 2021

ericl commented Oct 20, 2021

clarkzinzow left a comment •

edited

Loading

clarkzinzow Oct 20, 2021

pdames Oct 21, 2021 •

edited

Loading

clarkzinzow Oct 21, 2021 •

edited

Loading

ericl commented Oct 21, 2021 via email

ericl commented Oct 21, 2021 via email

	write_args_provider: Optional[Callable[[], Dict[str, Any]]] = None,
	arrow_parquet_args_fn: Callable[[], Dict[str, Any]] = = lambda: {},

[data] Add pickle support for PyArrow CSV WriteOptions #19378

[data] Add pickle support for PyArrow CSV WriteOptions #19378

Conversation

pdames commented Oct 14, 2021

Why are these changes needed?

Related issue number

Checks

ericl commented Oct 14, 2021

pdames commented Oct 19, 2021

ericl left a comment

Choose a reason for hiding this comment

ericl Oct 19, 2021

Choose a reason for hiding this comment

pdames Oct 20, 2021

Choose a reason for hiding this comment

ericl commented Oct 19, 2021

ericl commented Oct 20, 2021

ericl commented Oct 20, 2021

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow Oct 20, 2021

Choose a reason for hiding this comment

pdames Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

ericl commented Oct 21, 2021 via email

ericl commented Oct 21, 2021 via email

clarkzinzow left a comment •

edited

Loading

pdames Oct 21, 2021 •

edited

Loading

clarkzinzow Oct 21, 2021 •

edited

Loading