-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Auto increasing block size for read_json #42357
[Data] Auto increasing block size for read_json #42357
Conversation
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
…lines Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
from ray.data.datasource.file_based_datasource import FileBasedDatasource | ||
from ray.util.annotations import PublicAPI | ||
|
||
if TYPE_CHECKING: | ||
import pyarrow | ||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use the DatasetLogger
instead
https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/dataset_logger.py
python/ray/data/read_api.py
Outdated
When reading large files, the default block size configured in PyArrow can be too small, | ||
resulting in the following error: | ||
``pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries | ||
(try to increase block size?)``. | ||
|
||
To resolve this, use the ``read_options`` parameter to set a larger block size: | ||
(try to increase block size?)``. The read will be retried with geometrically | ||
increasing block size until the size reaches `DataContext.get_current().target_max_block_size`. | ||
The initial block size will start at the PyArrow default block size or it can be | ||
manually set through the ``read_options`` parameter as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this PR, ideally the user shouldn't be seeing this error message any more.
let's instead move this info (related to the "try to increase block size?") into the new code you implemented in _read_stream()
read_options=json.ReadOptions( | ||
use_threads=use_threads, block_size=block_size | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since ReadOptions can have other attributes/parameters as well, can we pass those or create + modify a copy?
and "straddling" not in str(e) | ||
or block_size > max_block_size | ||
): | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's modify the error message of the exception that is raised to include the largest block size that was tried. and let's also include a link to this GH issue for more details: apache/arrow#25674
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the exception to this: pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?) - Auto-increasing block size to 4B failed. More information on this issue can be found here: https://github.com/apache/arrow/issues/25674
Signed-off-by: Matthew Owen <[email protected]>
if isinstance(e, ArrowInvalid) and "straddling" in str(e): | ||
if self.read_options.block_size < max_block_size: | ||
# Increase the block size in case it was too small. | ||
logger.get_logger(log_to_stdout=False).info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use True for this since it is important, so that the user always sees this message in stdout logs
self.read_options.block_size = init_block_size | ||
break | ||
except ArrowInvalid as e: | ||
if isinstance(e, ArrowInvalid) and "straddling" in str(e): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this check inside the except block right? let's also compare to a longer string in case there are other error messages which use the word straddling:
if isinstance(e, ArrowInvalid) and "straddling" in str(e): | |
if "straddling object straddles two block boundaries" in str(e): |
else: | ||
raise ArrowInvalid( | ||
f"{e} - Auto-increasing block size to " | ||
f"{self.read_options.block_size}B failed. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
f"{self.read_options.block_size}B failed. " | |
f"{self.read_options.block_size} bytes failed. " |
# When reading large files, the default block size configured in PyArrow can be | ||
# too small, resulting in the following error: `pyarrow.lib.ArrowInvalid: | ||
# straddling object straddles two block boundaries (try to increase block | ||
# size?)`. The read will be retried with geometrically increasing block size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we also include the arrow issue link here?
Signed-off-by: Matthew Owen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix @omatthew98! Just one minor comment.
if "straddling object straddles two block boundaries" in str(e): | ||
if self.read_options.block_size < max_block_size: | ||
# Increase the block size in case it was too small. | ||
logger.get_logger(log_to_stdout=True).info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to log to stdout, as this would confuse users if there's something wrong.
Signed-off-by: Matthew Owen <[email protected]>
Why are these changes needed?
This PR adds logic to dynamically increase
block_size
used in Arrow's JSON loader, if the initialblock_size
resulting in a more graceful handling of these cases.Related issue number
Closes #41196
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.