-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Add pickle support for PyArrow CSV WriteOptions #19378
Conversation
Hmm what if we also accepted a lambda that returned write args in addition to the write args dict? That might be a lighter weight interface than adding a new wrapper class. |
I've added a new commit that updates the relevant write-side APIs to accept a write kwargs supplier lambda (i.e. a lambda that takes no input arguments and returns If we're happy with the current changes, then I could also update the read-side APIs to accept a read kwargs supplier lambda that works in a similar way, which could help us both with code consistency and to head off other similar read kwarg pickling problems that may pop up on the read side either now or in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Could we also rename the functions to be more consistent with the kwargs (e.g., arrow_parquet_args => arrow_parquet_args_fn, writer_args => writer_args_fn etc.)? I made a suggestion.
python/ray/data/dataset.py
Outdated
filesystem: Optional["pyarrow.fs.FileSystem"] = None, | ||
try_create_dir: bool = True, | ||
arrow_open_stream_args: Optional[Dict[str, Any]] = None, | ||
write_args_provider: Optional[Callable[[], Dict[str, Any]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_args_provider: Optional[Callable[[], Dict[str, Any]]] = None, | |
arrow_parquet_args_fn: Callable[[], Dict[str, Any]] = = lambda: {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use consistent names with kwargs and default to an empty dict supplier everywhere in latest commit.
Also sounds good. |
d70140c
to
1fd0164
Compare
Some tests failures: (pid=3787) 2021-10-20 05:08:21,870 INFO worker.py:425 -- Task failed with retryable exception: TaskID(a42cc390bccfe367ffffffffffffffffffffffff01000000). |
1fd0164
to
be37b6b
Compare
still timing out on that json datasource test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for adding this!
I personally find the "args provider" route less appealing since:
- It adds an extra API arg that duplicates behavior of an existing API arg in a slightly confusing way for the user, who will ask: why are there two ways to give args, and when do I have to use the lambda?
- The user having to package up their unserializable args into a provider lambda is worse UX than being able to pass it directly with us taking care of making it serializable under-the-hood.
- Once Arrow fixes the serializability issues upstream, we'd need to break the API in order to remove this now unnecessary arg provider, compared to the wrapper class route which is an internal change.
- The code impact is much larger overall compared to the wrapper, which only needed to wrap/unwrap at serialization boundaries.
But I don't think this is worth blocking this from getting merged.
python/ray/data/dataset.py
Outdated
@@ -1019,6 +1021,9 @@ def write_parquet(self, | |||
if True. Does nothing if all directories already exist. | |||
arrow_open_stream_args: kwargs passed to | |||
pyarrow.fs.FileSystem.open_output_stream | |||
arrow_parquet_args_fn: Callable that returns a dictionary of write | |||
arguments to use when writing each block to a file. Overrides | |||
any duplicate keys from arrow_parquet_args. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe mention why this exists in addition to arrow_parquet_args
? Ditto for the other functions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your arguments against the args provider lambda approach, since I had many of the same thoughts while I was mulling over these changes. I wound up warming to the lambda approach the more I thought about it, for the following reasons:
-
It adds an extra API arg that duplicates behavior of an existing API arg in a slightly confusing way for the user, who will ask: why are there two ways to give args, and when do I have to use the lambda?
Agreed. In the presence of both options, docs can help clarify when to use each, but I think the worry-free approach here is to simply recommend always using the lambda instead of kwargs. To make the APIs more clear, I think we either (1) remove existing dataset write API **kwargs
or (2) change their name and docs to say they are reserved for future (or internal/developer) use. The main con of the lambda is that it is less idiomatic than **kwargs
but, given that it covers us against any args that can't be pickled now or in the future without having to exhaustively test and/or wrap all of them, I think this is worth pursuing for long-term clarity and maintainability. I would also err slightly on the side of option (2) for backwards compatibility,
-
The user having to package up their unserializable args into a provider lambda is worse UX than being able to pass it directly with us taking care of making it serializable under-the-hood.
Agreed. I'd like to keep these details hidden from the user and the wrapper is an effective way of doing so, but I also worry about the long-term maintainability of wrappers or custom serializers. If there are any corner cases that we miss now or in the future then waiting for a new wrapper to be developed is also a bad user experience. The odds of missing cases will also continue to grow as new datasources are developed, and as the file IO APIs we depend on continue to evolve. If we make the above changes to stop documenting (or allowing) the use of write option kwargs, then there's no longer any question about which args go where, since the lambda is the only option.
-
Once Arrow fixes the serializability issues upstream, we'd need to break the API in order to remove this now unnecessary arg provider, compared to the wrapper class route which is an internal change.
Even if we ignore the pickling problem, I've been thinking that a lambda's ability to lazily resolve distinct write options for each block is quite powerful. For example, I've been thinking that we should actually change the lambda signature from Callable[[], Dict[str, Any]]
to Callable[[BlockAccessor], Dict[str, Any]]
. What do you and @ericl think? This would allow advanced users to intelligently determine per-block write arguments based on the properties of the block being written (e.g. adjusting parquet data_page_size
or `compression_level based on the size of the block being written, dictionary encoding based on column cardinalities, etc.), instead of being locked into one static set of write arguments for the entire dataset.
-
The code impact is much larger overall compared to the wrapper, which only needed to wrap/unwrap at serialization boundaries.
Agreed. The immediate code impact of this PR spread across multiple API layers and more than doubled in LOC changed. However, over the long-term, I wouldn't be surprised to see the code and developer time impact of adding/removing wrappers being greater. Granted, this is purely speculation on my part, but it is a risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdames All great counter-points, thank you for taking the time to write those up! The fact that the provider (1) future-proofs against future unserializable args, (2) gives a path to per-block write args and per-file read args, makes the UX hit more reasonable.
To make the APIs more clear, I think we either (1) remove existing dataset write API **kwargs or (2) change their name and docs to say they are reserved for future (or internal/developer) use. The main con of the lambda is that it is less idiomatic than **kwargs but, given that it covers us against any args that can't be pickled now or in the future without having to exhaustively test and/or wrap all of them, I think this is worth pursuing for long-term clarity and maintainability. I would also err slightly on the side of option (2) for backwards compatibility,
I think that a happy-path here, from a UX perspective, would be a third option: remove the catch-all **kwargs
and have a single arrow_parquet_args
arg that can either be a plain options dictionary or a provider callable:
def write_parquet(
self,
path: str,
*,
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
try_create_dir: bool = True,
arrow_open_stream_args: Optional[Dict[str, Any]] = None,
arrow_parquet_args: Optional[
Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None
) -> None:
"""Write the dataset to parquet.
...
Args:
...
arrow_open_stream_args: kwargs passed to
pyarrow.fs.FileSystem.open_output_stream
arrow_parquet_args: A dictionary (or callable that creats a dictionary) of write
options to use when writing each block to a file. Use a callable if you want to
vary the write options based on the block, or if one or more of the write options
are not serializable.
"""
# ...
def _write_block(self,
f: "pyarrow.NativeFile",
block: BlockAccessor,
writer_args: Optional[
Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None:
import pyarrow.parquet as pq
writer_args = _resolve_kwargs(writer_args, block)
pq.write_table(block.to_arrow(), f, **writer_args)
# ...
def _resolve_kwargs(
kwargs: Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]],
block: BlockAccessor,
) -> Dict[str, Any]:
if kwargs is None:
kwargs = {}
elif callable(kwargs):
kwargs = kwargs(block)
return kwargs
You have a more complicated type for arrow_parquet_args
, but all three of the following options are available to the user without exposing multiple arguments that step on each other:
- Happy/easy path: The common case of a plain dict of write options can be passed.
- Unserializable args: Unserializable args can be generated by a callable.
- Per-block options: The options can vary for each block via a callable.
be37b6b
to
aab93bb
Compare
aab93bb
to
cf3491e
Compare
Hmm, the idea to add a more powerful API to configure per block/file
accesses is interesting. We could pass a context dictionary and it could
return extra args...
Not sure if such an API is overkill, but it could certainly justify having
a separate lambda option provider.
…On Thu, Oct 21, 2021, 12:45 AM Patrick Ames ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/data/dataset.py
<#19378 (comment)>:
> @@ -1019,6 +1021,9 @@ def write_parquet(self,
if True. Does nothing if all directories already exist.
arrow_open_stream_args: kwargs passed to
pyarrow.fs.FileSystem.open_output_stream
+ arrow_parquet_args_fn: Callable that returns a dictionary of write
+ arguments to use when writing each block to a file. Overrides
+ any duplicate keys from arrow_parquet_args.
I agree with your arguments against the args provider lambda approach,
since I had many of the same thoughts while I was mulling over these
changes. I wound up warming to the lambda approach the more I thought about
it, for the following reasons:
1.
It adds an extra API arg that duplicates behavior of an existing API
arg in a slightly confusing way for the user, who will ask: why are there
two ways to give args, and when do I have to use the lambda?
Agreed. In the presence of both options, documentation can help clarify
things write arguments lambda, but I think the worry-free approach here is
to simply recommend *always* using the lambda instead of kwargs. To make
the APIs more clear, I think we either (1) remove existing dataset write
API **kwargs or (2) change their name and documentation to indicate that
they are reserved for future (or internal/developer) use. I think the main
con of the lambda is that it is less idiomatic than **kwargs but, given
that it covers us against any args that can't be pickled now or in the
future without having to exhaustively test and/or wrap all of them, I think
this is worth pursuing for long-term clarity and maintainability. I would
err slightly on the side of option (2) for backwards compatibility,
1.
The user having to package up their unserializable args into a
provider lambda is worse UX than being able to pass it directly with us
taking care of making it serializable under-the-hood.
Agreed. I'd like to keep these details hidden from the user and the
wrapper is an effective way of doing just that, but I also worry about the
long-term maintainability of wrappers or custom serializers. If there are
any corner cases that we miss now or in the future then waiting for a new
wrapper to be developed also presents a bad user experience. I also think
the odds of missing cases will continue to grow as new datasources are
developed, and as the file IO APIs we depend on continue to evolve. If we
make the changes I've recommended above to stop encouraging (or allowing)
users to provide write option kwargs, then an added benefit is that there
will no longer be a question about which write args go where, since the
lambda becomes the only supported provider of them.
1.
Once Arrow fixes the serializability issues upstream, we'd need to
break the API in order to remove this now unnecessary arg provider,
compared to the wrapper class route which is an internal change.
Even if we ignore the pickling problem, I've been thinking that the
ability of a lambda to lazily resolve distinct write options for each block
is quite powerful. On that note, one extension I would like to make is to
change the lambda from Callable[[], Dict[str, Any]] to Callable[[BlockAccessor],
Dict[str, Any]]. What do you and @ericl <https://github.com/ericl> think?
This would allow advanced users to intelligently determine per-block write
arguments based on the properties of the block being written (e.g.
adjusting parquet data_page_size or `compression_level based on the size
of the block being written, dictionary encoding based on column
cardinalities, etc.), instead of being locked into one static set of write
arguments for the entire dataset.
1.
The code impact is much larger overall compared to the wrapper, which
only needed to wrap/unwrap at serialization boundaries.
Agreed. The immediate code impact of this PR spread across multiple API
layers and more than doubled in LOC changed. However, over the long-term, I
wouldn't be surprised to see the code and developer time impact of
adding/removing wrappers being greater. Granted, this is purely speculation
on my part, but it is a risk.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19378 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSXFQW7QMDXAHUVZXT3UH7AJTANCNFSM5F6XDJ5Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
+1 on removing the ** and making it explicit, I think that's probably a
usability improvement even by itself.
…On Thu, Oct 21, 2021, 7:09 AM Clark Zinzow ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/data/dataset.py
<#19378 (comment)>:
> @@ -1019,6 +1021,9 @@ def write_parquet(self,
if True. Does nothing if all directories already exist.
arrow_open_stream_args: kwargs passed to
pyarrow.fs.FileSystem.open_output_stream
+ arrow_parquet_args_fn: Callable that returns a dictionary of write
+ arguments to use when writing each block to a file. Overrides
+ any duplicate keys from arrow_parquet_args.
@pdames <https://github.com/pdames> All great counter-points, thank you
for taking the time to write those up! The fact that the provider (1)
future-proofs against future unserializable args, (2) gives a path to
per-block write args and per-file read args, makes the UX hit more
reasonable.
To make the APIs more clear, I think we either (1) remove existing dataset
write API **kwargs or (2) change their name and docs to say they are
reserved for future (or internal/developer) use. The main con of the lambda
is that it is less idiomatic than **kwargs but, given that it covers us
against any args that can't be pickled now or in the future without having
to exhaustively test and/or wrap all of them, I think this is worth
pursuing for long-term clarity and maintainability. I would also err
slightly on the side of option (2) for backwards compatibility,
I think that a happy-path here, from a UX perspective, would be a third
option: remove the catch-all **kwargs and have a single arrow_parquet_args
arg that can either be a plain options dictionary *or* a provider
callable:
def write_parquet(
self,
path: str,
*,
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
try_create_dir: bool = True,
arrow_open_stream_args: Optional[Dict[str, Any]] = None,
arrow_parquet_args: Optional[
Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None
) -> None:
"""Write the dataset to parquet. ... Args: ... arrow_open_stream_args: kwargs passed to pyarrow.fs.FileSystem.open_output_stream arrow_parquet_args: A dictionary (or callable that creats a dictionary) of write options to use when writing each block to a file. Use a callable if you want to vary the write options based on the block, or if one or more of the write options are not serializable. """
# ...
def _write_block(self,
f: "pyarrow.NativeFile",
block: BlockAccessor,
writer_args: Optional[
Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]]] = None:
import pyarrow.parquet as pq
writer_args = _resolve_kwargs(writer_args, block)
pq.write_table(block.to_arrow(), f, **writer_args)
# ...
def _resolve_kwargs(
kwargs: Union[Dict[str, Any], Callable[[BlockAccessor], Dict[str, Any]]],
block: BlockAccessor,
) -> Dict[str, Any]:
if kwargs is None:
kwargs = {}
elif callable(kwargs):
kwargs = kwargs(block)
return kwargs
You have a more complicated type for arrow_parquet_args, but all three of
the following options are available to the user without exposing multiple
arguments that step on each other:
1. Happy path: The common case of a plain dict of write options can be
passed.
2. Unserializable args: Unserializable args can be generated by a
callable.
3. Per-block options: The options can vary for each block via a
callable.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19378 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSUQTUWQZCBTQVI3BP3UIANJJANCNFSM5F6XDJ5Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Why are these changes needed?
Users cannot specify custom Dataset write options when writing output to CSV via
dataset.write_csv()
, because thepyarrow.csv.WriteOptions
class retrieved fromwriter_args["write_options"]
inCSVDatasource._write_block()
cannot be pickled by Ray.This PR presents a solution to this problem via a
_WriteOptionsWrapper
class that wrapsWriteOptions
while implementing the required__getstate__
and__setstate__
methods.Related issue number
Closes #19366
Checks
scripts/format.sh
to lint the changes in this PR.