Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Auto increasing block size for read_json #42357

Merged
merged 13 commits into from
Jan 22, 2024

Conversation

omatthew98
Copy link
Contributor

@omatthew98 omatthew98 commented Jan 12, 2024

Why are these changes needed?

This PR adds logic to dynamically increase block_size used in Arrow's JSON loader, if the initial block_size resulting in a more graceful handling of these cases.

Related issue number

Closes #41196

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

from ray.data.datasource.file_based_datasource import FileBasedDatasource
from ray.util.annotations import PublicAPI

if TYPE_CHECKING:
import pyarrow

logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1045 to 1051
When reading large files, the default block size configured in PyArrow can be too small,
resulting in the following error:
``pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries
(try to increase block size?)``.

To resolve this, use the ``read_options`` parameter to set a larger block size:
(try to increase block size?)``. The read will be retried with geometrically
increasing block size until the size reaches `DataContext.get_current().target_max_block_size`.
The initial block size will start at the PyArrow default block size or it can be
manually set through the ``read_options`` parameter as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this PR, ideally the user shouldn't be seeing this error message any more.
let's instead move this info (related to the "try to increase block size?") into the new code you implemented in _read_stream()

Comment on lines 53 to 55
read_options=json.ReadOptions(
use_threads=use_threads, block_size=block_size
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since ReadOptions can have other attributes/parameters as well, can we pass those or create + modify a copy?

and "straddling" not in str(e)
or block_size > max_block_size
):
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's modify the error message of the exception that is raised to include the largest block size that was tried. and let's also include a link to this GH issue for more details: apache/arrow#25674

Copy link
Contributor Author

@omatthew98 omatthew98 Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the exception to this: pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?) - Auto-increasing block size to 4B failed. More information on this issue can be found here: https://github.com/apache/arrow/issues/25674

Signed-off-by: Matthew Owen <[email protected]>
if isinstance(e, ArrowInvalid) and "straddling" in str(e):
if self.read_options.block_size < max_block_size:
# Increase the block size in case it was too small.
logger.get_logger(log_to_stdout=False).info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use True for this since it is important, so that the user always sees this message in stdout logs

self.read_options.block_size = init_block_size
break
except ArrowInvalid as e:
if isinstance(e, ArrowInvalid) and "straddling" in str(e):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this check inside the except block right? let's also compare to a longer string in case there are other error messages which use the word straddling:

Suggested change
if isinstance(e, ArrowInvalid) and "straddling" in str(e):
if "straddling object straddles two block boundaries" in str(e):

else:
raise ArrowInvalid(
f"{e} - Auto-increasing block size to "
f"{self.read_options.block_size}B failed. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
f"{self.read_options.block_size}B failed. "
f"{self.read_options.block_size} bytes failed. "

# When reading large files, the default block size configured in PyArrow can be
# too small, resulting in the following error: `pyarrow.lib.ArrowInvalid:
# straddling object straddles two block boundaries (try to increase block
# size?)`. The read will be retried with geometrically increasing block size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we also include the arrow issue link here?

Signed-off-by: Matthew Owen <[email protected]>
@c21 c21 changed the title Auto increasing block size for read_json [Data] Auto increasing block size for read_json Jan 22, 2024
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix @omatthew98! Just one minor comment.

if "straddling object straddles two block boundaries" in str(e):
if self.read_options.block_size < max_block_size:
# Increase the block size in case it was too small.
logger.get_logger(log_to_stdout=True).info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to log to stdout, as this would confuse users if there's something wrong.

Signed-off-by: Matthew Owen <[email protected]>
@c21 c21 merged commit 3d91fb7 into ray-project:master Jan 22, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Improve block size selection when reading large jsonl data chunks
3 participants