[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

revans2 · 2024-09-11T19:29:34Z

Describe the bug
I noticed this in some unit tests for the java APIs when I tried to enable schema pruning in CUDF by default for java JSON read APIs that explicitly do column pruning.

cudf/java/src/test/java/ai/rapids/cudf/TableTest.java

Lines 664 to 714 in 0b32f55

    
           @Test 
        
           void testReadJSONNestedTypes() { 
        
             Schema.Builder root = Schema.builder(); 
        
             Schema.Builder a = root.addColumn(DType.STRUCT, "a"); 
        
             a.addColumn(DType.STRING, "b"); 
        
             a.addColumn(DType.STRING, "c"); 
        
             a.addColumn(DType.STRING, "missing"); 
        
             Schema.Builder d = root.addColumn(DType.LIST, "d"); 
        
             d.addColumn(DType.INT64, "ignored"); 
        
             root.addColumn(DType.INT64, "also_missing"); 
        
             Schema.Builder e = root.addColumn(DType.LIST, "e"); 
        
             Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored"); 
        
             eChild.addColumn(DType.INT64, "f"); 
        
             eChild.addColumn(DType.STRING, "missing_in_list"); 
        
             eChild.addColumn(DType.INT64, "g"); 
        
             Schema schema = root.build(); 
        
             JSONOptions opts = JSONOptions.builder() 
        
                 .withLines(true) 
        
                 .build(); 
        
             StructType aStruct = new StructType(true, 
        
                 new BasicType(true, DType.STRING), 
        
                 new BasicType(true, DType.STRING), 
        
                 new BasicType(true, DType.STRING)); 
        
             ListType dList = new ListType(true, new BasicType(true, DType.INT64)); 
        
             StructType eChildStruct = new StructType(true, 
        
                 new BasicType(true, DType.INT64), 
        
                 new BasicType(true, DType.STRING), 
        
                 new BasicType(true, DType.INT64)); 
        
             ListType eList = new ListType(true, eChildStruct); 
        
             try (Table expected = new Table.TestBuilder() 
        
                 .column(aStruct, 
        
                     new StructData(null, "C1", null), 
        
                     new StructData("B2", "C2", null), 
        
                     null, 
        
                     null) 
        
                 .column(dList, 
        
                     null, 
        
                     null, 
        
                     Arrays.asList(1L,2L,3L), 
        
                     new ArrayList<Long>()) 
        
                 .column((Long)null, null, null, null) // also_missing 
        
                 .column(eList, 
        
                     null, 
        
                     null, 
        
                     null, 
        
                     Arrays.asList(new StructData(null, null, 1L), new StructData(2L, null, null), new StructData(3L, null, 4L))) 
        
                 .build(); 
        
                 Table table = Table.readJSON(schema, opts, NESTED_JSON_DATA_BUFFER)) { 
        
               assertTablesAreEqual(expected, table); 
        
             } 
        
           }

which fails because column d is being returned as a LIST<INT8> instead of a LIST<INT64> which is what it was requested to be, and which is what is returned for column d if pruning is disabled.

cudf/java/src/test/java/ai/rapids/cudf/TableTest.java

Lines 743 to 790 in 0b32f55

    
           @Test 
        
           void testReadJSONNestedTypesDataSource() { 
        
             Schema.Builder root = Schema.builder(); 
        
             Schema.Builder a = root.addColumn(DType.STRUCT, "a"); 
        
             a.addColumn(DType.STRING, "b"); 
        
             a.addColumn(DType.STRING, "c"); 
        
             a.addColumn(DType.STRING, "missing"); 
        
             Schema.Builder d = root.addColumn(DType.LIST, "d"); 
        
             d.addColumn(DType.INT64, "ignored"); 
        
             root.addColumn(DType.INT64, "also_missing"); 
        
             Schema.Builder e = root.addColumn(DType.LIST, "e"); 
        
             Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored"); 
        
             eChild.addColumn(DType.INT64, "g"); 
        
             Schema schema = root.build(); 
        
             JSONOptions opts = JSONOptions.builder() 
        
                 .withLines(true) 
        
                 .build(); 
        
             StructType aStruct = new StructType(true, 
        
                 new BasicType(true, DType.STRING), 
        
                 new BasicType(true, DType.STRING), 
        
                 new BasicType(true, DType.STRING)); 
        
             ListType dList = new ListType(true, new BasicType(true, DType.INT64)); 
        
             StructType eChildStruct = new StructType(true, 
        
                 new BasicType(true, DType.INT64)); 
        
             ListType eList = new ListType(true, eChildStruct); 
        
             try (Table expected = new Table.TestBuilder() 
        
                 .column(aStruct, 
        
                     new StructData(null, "C1", null), 
        
                     new StructData("B2", "C2", null), 
        
                     null, 
        
                     null) 
        
                 .column(dList, 
        
                     null, 
        
                     null, 
        
                     Arrays.asList(1L,2L,3L), 
        
                     new ArrayList<Long>()) 
        
                 .column((Long)null, null, null, null) // also_missing 
        
                 .column(eList, 
        
                     null, 
        
                     null, 
        
                     null, 
        
                     Arrays.asList(new StructData(1L), new StructData((Long)null), new StructData(4L))) 
        
                 .build(); 
        
                  MultiBufferDataSource source = sourceFrom(NESTED_JSON_DATA_BUFFER); 
        
                  Table table = Table.readJSON(schema, opts, source)) { 
        
               assertTablesAreEqual(expected, table); 
        
             } 
        
           }

is failing for the same reason as the above one. column d is the wrong type.

cudf/java/src/test/java/ai/rapids/cudf/TableTest.java

Lines 716 to 741 in 0b32f55

    
           @Test 
        
           void testReadJSONNestedTypesVerySmallChanges() { 
        
             Schema.Builder root = Schema.builder(); 
        
             Schema.Builder e = root.addColumn(DType.LIST, "e"); 
        
             Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored"); 
        
             eChild.addColumn(DType.INT64, "g"); 
        
             eChild.addColumn(DType.INT64, "f"); 
        
             Schema schema = root.build(); 
        
             JSONOptions opts = JSONOptions.builder() 
        
                 .withLines(true) 
        
                 .build(); 
        
             StructType eChildStruct = new StructType(true, 
        
                 new BasicType(true, DType.INT64), 
        
                 new BasicType(true, DType.INT64)); 
        
             ListType eList = new ListType(true, eChildStruct); 
        
             try (Table expected = new Table.TestBuilder() 
        
                 .column(eList, 
        
                     null, 
        
                     null, 
        
                     null, 
        
                     Arrays.asList(new StructData(1L, null), new StructData(null, 2L), new StructData(4L, 3L))) 
        
                 .build(); 
        
                  Table table = Table.readJSON(schema, opts, NESTED_JSON_DATA_BUFFER)) { 
        
               assertTablesAreEqual(expected, table); 
        
             } 
        
           }

is failing because column e was requested to be a LIST<STRUCT>, but it was returned as a LIST<INT8> column.

Steps/Code to reproduce bug
If you want to reproduce this you can take #16796 and enable column pruning for the tests that are listed as failing. The third test is the scariest one, and it appears to return totally invalid results where the data column is empty despite the there being offsets pointing into it.

If I need to create a C++ repro case I am happy to do it

Expected behavior
I would expect the types in the schema to be honored at least in the same way that it is for the non pruning use case.

The text was updated successfully, but these errors were encountered:

This adds in the options to enable column_pruning when reading JSON using the java APIs. This is still in draft because there are test failures if this is turned on for those tests. #16797 That said the performance impact from enabling column pruning on some queries is huge. For one query in particular the current code takes 161.5 seconds and with CUDF column pruning it is just 16.5 seconds. That is a 10x speedup for something that is fairly real world. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Alessandro Bellina (https://github.com/abellina) - Nghia Truong (https://github.com/ttnghia) URL: #16796

revans2 · 2024-10-07T20:17:01Z

I think this is fixed if the experimental feature is enabled.

ttnghia · 2024-11-12T19:17:01Z

Recent changes in JNI always enable prune column by default if there is a schema given. Thus, this is already fixed.

revans2 added bug Something isn't working Spark Functionality that helps Spark RAPIDS labels Sep 11, 2024

revans2 mentioned this issue Sep 11, 2024

Add in option for Java JSON APIs to do column pruning in CUDF #16796

Merged

3 tasks

mattahrens mentioned this issue Sep 17, 2024

[FEA] enable prune_columns for from_json NVIDIA/spark-rapids#11458

Closed

karthikeyann mentioned this issue Oct 21, 2024

JSON spark reader plan for 24.12 #17138

Open

karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024

github-project-automation bot added this to cuDF/Dask/Numba/UCX Nov 12, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Nov 12, 2024

ttnghia closed this as completed Nov 12, 2024

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

revans2 commented Sep 11, 2024

revans2 commented Oct 7, 2024

ttnghia commented Nov 12, 2024

[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

Comments

revans2 commented Sep 11, 2024

revans2 commented Oct 7, 2024

ttnghia commented Nov 12, 2024