-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Auto increasing block size for read_json #42357
Changes from 10 commits
4857e95
f6902fe
ebe1cfc
afeac34
360dc40
4c34794
da74990
59a3e36
f5b387a
b5a6ffc
838933b
0177858
76dead9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,15 @@ | ||
import logging | ||
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union | ||
|
||
from ray.data.context import DataContext | ||
from ray.data.datasource.file_based_datasource import FileBasedDatasource | ||
from ray.util.annotations import PublicAPI | ||
|
||
if TYPE_CHECKING: | ||
import pyarrow | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@PublicAPI | ||
class JSONDatasource(FileBasedDatasource): | ||
|
@@ -34,6 +38,36 @@ def __init__( | |
|
||
# TODO(ekl) The PyArrow JSON reader doesn't support streaming reads. | ||
def _read_stream(self, f: "pyarrow.NativeFile", path: str): | ||
from pyarrow import json | ||
from io import BytesIO | ||
|
||
from pyarrow import ArrowInvalid, json | ||
|
||
yield json.read_json(f, read_options=self.read_options, **self.arrow_json_args) | ||
buffer = f.read_buffer() | ||
block_size = self.read_options.block_size | ||
use_threads = self.read_options.use_threads | ||
max_block_size = DataContext.get_current().target_max_block_size | ||
while True: | ||
try: | ||
yield json.read_json( | ||
BytesIO(buffer), | ||
read_options=json.ReadOptions( | ||
use_threads=use_threads, block_size=block_size | ||
), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since ReadOptions can have other attributes/parameters as well, can we pass those or create + modify a copy? |
||
**self.arrow_json_args, | ||
) | ||
break | ||
except ArrowInvalid as e: | ||
if ( | ||
isinstance(e, ArrowInvalid) | ||
and "straddling" not in str(e) | ||
or block_size > max_block_size | ||
): | ||
raise e | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's modify the error message of the exception that is raised to include the largest block size that was tried. and let's also include a link to this GH issue for more details: apache/arrow#25674 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated the exception to this: |
||
else: | ||
# Increase the block size in case it was too small. | ||
logger.info( | ||
f"JSONDatasource read failed with " | ||
f"block_size={block_size}. Retrying with " | ||
f"block_size={block_size * 2}." | ||
) | ||
block_size *= 2 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1045,9 +1045,10 @@ def read_json( | |
When reading large files, the default block size configured in PyArrow can be too small, | ||
resulting in the following error: | ||
``pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries | ||
(try to increase block size?)``. | ||
|
||
To resolve this, use the ``read_options`` parameter to set a larger block size: | ||
(try to increase block size?)``. The read will be retried with geometrically | ||
increasing block size until the size reaches `DataContext.get_current().target_max_block_size`. | ||
The initial block size will start at the PyArrow default block size or it can be | ||
manually set through the ``read_options`` parameter as follows: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. with this PR, ideally the user shouldn't be seeing this error message any more. |
||
|
||
>>> import pyarrow.json as pajson | ||
>>> block_size = 10 << 20 # Set block size to 10MB | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use the
DatasetLogger
insteadhttps://github.com/ray-project/ray/blob/master/python/ray/data/_internal/dataset_logger.py