[FEAT] [JSON Reader] Add native streaming + parallel JSON reader. #1679

clarkzinzow · 2023-11-29T02:46:54Z

This PR adds a streaming + parallel JSON reader, with full support for most fundamental dtypes (sans decimal and binary types), arbitrary nesting with JSON lists and objects, including nulls at all levels of the JSON object tree.

TODOs

Add schema inference unit test for dtype coverage (i.e. reading the dtypes.jsonl file).
Add temporal type inference + parsing test coverage.
Benchmarking + performance audit: this reader follows the same general concurrency + parallelism model of the streaming CSV reader, which performs relatively well for cloud reads, but there's bound to be a lot of low-hanging fruit around unnecessary copies.
(Follow-up?) Add thorough parsing and dtype inference unit tests on in-memory defined JSON strings.
(Follow-up) Support for decimal and (large) binary types.
(Follow-up) Add support for strict parsing, i.e. returning an error instead of falling back to a null value when parsing fails.
(Follow-up) Misc. bugs in Arrow2 that should be fixed and upstreamed.
(Follow-up) Deflate compression support.

codecov · 2023-11-29T20:16:49Z

Codecov Report

Merging #1679 (ce4ac30) into main (a53cd51) will decrease coverage by 3.22%.
The diff coverage is 96.66%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1679      +/-   ##
==========================================
- Coverage   85.11%   81.89%   -3.22%     
==========================================
  Files          55       55              
  Lines        5368     5391      +23     
==========================================
- Hits         4569     4415     -154     
- Misses        799      976     +177

Files	Coverage Δ
daft/execution/execution_step.py	`85.15% <ø> (-7.82%)`	⬇️
daft/io/_json.py	`100.00% <100.00%> (ø)`
daft/logical/schema.py	`89.47% <100.00%> (-1.44%)`	⬇️
daft/table/micropartition.py	`89.74% <100.00%> (+0.16%)`	⬆️
daft/table/schema_inference.py	`98.27% <100.00%> (+0.12%)`	⬆️
daft/table/table_io.py	`95.67% <100.00%> (+0.16%)`	⬆️
daft/table/table.py	`59.39% <80.00%> (-24.62%)`	⬇️

... and 9 files with indirect coverage changes

samster25

Looks really clean! Just a few minor bugs and issues!

samster25 · 2023-12-01T06:51:38Z

daft/io/_json.py

    file_format_config = FileFormatConfig.from_json_config(json_config)
-    storage_config = StorageConfig.python(PythonStorageConfig(io_config=io_config))
+    if use_native_downloader:
+        storage_config = StorageConfig.native(NativeStorageConfig(True, io_config))


Suggested change

storage_config = StorageConfig.native(NativeStorageConfig(True, io_config))

multithreaded_io = not context.get_context().is_ray_runner

storage_config = StorageConfig.native(NativeStorageConfig(multithreaded_io, io_config))

We're not doing that for the CSV reader at the moment, not sure if the CSV and JSON readers suffers from the same issues as the Parquet reader here, and I'd want to ensure that benchmarks aren't negatively effected before disabling multithreaded reads on the Ray runner!

Daft/daft/io/_csv.py

Line 81 in 2d499c4

storage_config = StorageConfig.native(NativeStorageConfig(True, io_config))

daft/table/table_io.py

samster25 · 2023-12-01T06:55:57Z

daft/table/table_io.py

+                read_options=json_read_options,
+                io_config=config.io_config,
+            )
+            return _cast_table_to_schema(tbl, read_options=read_options, schema=schema)


Whats the rational for passing in the schema both in the read options and then to _cast_table_to_schema ?

The schema is used by the JSON reader for deserialization, and then _cast_table_to_schema() is a file format agnostic utility for ensuring that the table read from each file (1) has its dtypes coerced to the inferred global schema, (2) has a column ordering that matches the inferred global schema, and (3) applies column pruning imposed by projections.

Daft/daft/table/table_io.py

Lines 56 to 73 in 9b1b830

def _cast_table_to_schema(table: MicroPartition, read_options: TableReadOptions, schema: Schema) -> pa.Table:

"""Performs a cast of a Daft MicroPartition to the requested Schema/Data. This is required because:

1. Data read from the datasource may have types that do not match the inferred global schema

2. Data read from the datasource may have columns that are out-of-order with the inferred schema

3. We may need only a subset of columns, or differently-ordered columns, in `read_options`

This helper function takes care of all that, ensuring that the resulting MicroPartition has all column types matching

their corresponding dtype in `schema`, and column ordering/inclusion matches `read_options.column_names` (if provided).

"""

pruned_schema = schema

# If reading only a subset of fields, prune the schema

if read_options.column_names is not None:

pruned_schema = Schema._from_fields([schema[name] for name in read_options.column_names])

table = table.cast_to_schema(pruned_schema)

return table

cc @jaychia for that utility

src/daft-decoding/src/compression.rs

samster25 · 2023-12-01T07:09:40Z

src/daft-micropartition/Cargo.toml

@@ -6,6 +6,7 @@ daft-core = {path = "../daft-core", default-features = false}
 daft-csv = {path = "../daft-csv", default-features = false}
 daft-dsl = {path = "../daft-dsl", default-features = false}
 daft-io = {path = "../daft-io", default-features = false}
+daft-json = {path = "../daft-json", default-features = false}


you may also need to add daft-json/python to the python feature in this toml

Hmm I don't think so, none of the daft_json/python APIs should be used from daft_micropartition, and all pyo3-exposed classes (e.g. the parse and read options) should be usable from Rust without the pyo3 wrapping.

Let me know if I'm misunderstanding this dependency!

src/daft-json/src/inference.rs

src/daft-json/src/read.rs

samster25 · 2023-12-01T21:19:52Z

src/daft-json/src/decoding.rs

+fn deserialize_into<'a, A: Borrow<Value<'a>>>(target: &mut Box<dyn MutableArray>, rows: &[A]) {
+    match target.data_type() {
+        DataType::Null => {
+            // TODO(Clark): Return an error if any of rows are not Value::Null.


is this TODO important?

Nah, we still need to add a "strict" parsing mode for both the CSV reader and JSON reader, so this TODO can be addressed when we add that parsing mode; for now, the readers are very forgiving, falling back to UTF8 when parsing fails and trusting the inferred dtype (e.g. here with the DataType::Null dtype).

…d extension thereof.

clarkzinzow requested review from jaychia and samster25 November 29, 2023 02:46

clarkzinzow changed the title ~~[JSON Reader] Add native streaming + parallel JSON reader.~~ [FEAT] [JSON Reader] Add native streaming + parallel JSON reader. Nov 29, 2023

github-actions bot added the enhancement New feature or request label Nov 29, 2023

clarkzinzow force-pushed the clark/json-reader branch 3 times, most recently from 0da989c to 9b1b830 Compare December 1, 2023 17:03

samster25 reviewed Dec 1, 2023

View reviewed changes

clarkzinzow force-pushed the clark/json-reader branch from 9b1b830 to a5d5b57 Compare December 5, 2023 22:06

clarkzinzow added 6 commits December 5, 2023 16:39

Add native JSON reader.

abed5fb

Fix tests and style.

b2d815a

Add temporal type inference and parsing test coverage, misc. fixes an…

6b52daa

…d extension thereof.

Buffer/chunk size tweaks.

1a43fb5

PR feedback.

4a2c336

Fix deserializing ints as strings.

ce4ac30

clarkzinzow force-pushed the clark/json-reader branch from a5d5b57 to ce4ac30 Compare December 6, 2023 00:40

clarkzinzow merged commit 3693c22 into main Dec 6, 2023
39 of 40 checks passed

clarkzinzow deleted the clark/json-reader branch December 6, 2023 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] [JSON Reader] Add native streaming + parallel JSON reader. #1679

[FEAT] [JSON Reader] Add native streaming + parallel JSON reader. #1679

clarkzinzow commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading

samster25 left a comment

samster25 Dec 1, 2023

clarkzinzow Dec 2, 2023

samster25 Dec 1, 2023

clarkzinzow Dec 2, 2023

samster25 Dec 1, 2023

clarkzinzow Dec 2, 2023

samster25 Dec 1, 2023

clarkzinzow Dec 2, 2023

	storage_config = StorageConfig.native(NativeStorageConfig(True, io_config))
	multithreaded_io = not context.get_context().is_ray_runner
	storage_config = StorageConfig.native(NativeStorageConfig(multithreaded_io, io_config))

	def _cast_table_to_schema(table: MicroPartition, read_options: TableReadOptions, schema: Schema) -> pa.Table:
	"""Performs a cast of a Daft MicroPartition to the requested Schema/Data. This is required because:

	1. Data read from the datasource may have types that do not match the inferred global schema
	2. Data read from the datasource may have columns that are out-of-order with the inferred schema
	3. We may need only a subset of columns, or differently-ordered columns, in `read_options`

	This helper function takes care of all that, ensuring that the resulting MicroPartition has all column types matching
	their corresponding dtype in `schema`, and column ordering/inclusion matches `read_options.column_names` (if provided).
	"""
	pruned_schema = schema

	# If reading only a subset of fields, prune the schema
	if read_options.column_names is not None:
	pruned_schema = Schema._from_fields([schema[name] for name in read_options.column_names])

	table = table.cast_to_schema(pruned_schema)
	return table

[FEAT] [JSON Reader] Add native streaming + parallel JSON reader. #1679

[FEAT] [JSON Reader] Add native streaming + parallel JSON reader. #1679

Conversation

clarkzinzow commented Nov 29, 2023 • edited Loading

TODOs

codecov bot commented Nov 29, 2023 • edited Loading

Codecov Report

samster25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading