[Kernel] Add widening type conversions to Kernel default parquet reader #3541

johanl-db · 2024-08-13T15:23:35Z

Which Delta project/connector is this regarding?

Description

Add a set of conversions to the default parquet reader provided by kernel to allow reading columns using a wider type than the actual in the parquet file.
This will support the type widening table feature, see https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md.

Conversions added:

INT32 -> long
FLOAT -> double
decimal precision/scale increase
DATE -> timestamp_ntz
INT32 -> double
integers -> decimal

How was this patch tested?

Added tests covering all conversions in ParquetColumnReaderSuite

Does this PR introduce any user-facing changes?

This change alone doesn't allow reading Delta table that use the type widening table feature. That feature is still unsupported.
It does allow reading Delta tables that somehow have Parquet files that contain types that are different from the table schema, but that really should never happen for tables that don't support type widening..

vkorukanti

Looks great!

vkorukanti · 2024-08-19T23:00:51Z

...ults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetColumnReaderSuite.scala

+/**
+ * Suite covering reading Parquet columns with different types.
+ */
+class ParquetColumnReaderSuite  extends AnyFunSuite with ParquetSuiteBase {


rename to ParquetTypeWideningSuite or merge these tests into ParquetFileReaderSuite.scala?

I moved the tests to ParquetFileReaderSuite

vkorukanti · 2024-08-19T23:01:15Z

...ults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetColumnReaderSuite.scala

+  private val wideningTestCases: Seq[TestCase] = Seq(
+    TestCase("ByteType", ShortType.SHORT, i => if (i % 72 != 0) i.toByte.toShort else null),
+    TestCase("ByteType", IntegerType.INTEGER, i => if (i % 72 != 0) i.toByte.toInt else null),
+    TestCase("ByteType", LongType.LONG, i => if (i % 72 != 0) i.toByte.toLong else null),


is byte to float or double not allowed?

Byte to double is allowed, byte to float isn't. I added more test cases

vkorukanti · 2024-08-19T23:02:36Z

...ults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetColumnReaderSuite.scala

+  test(s"parquet widening conversion - ${testCase.columnName} -> ${testCase.toType.toString}") {
+    val inputLocation = goldenTablePath("parquet-all-types")
+    val readSchema = new StructType().add(testCase.columnName, testCase.toType)
+    val result = readParquetFilesUsingKernel(inputLocation, readSchema)


can we also read using Spark and verify? Or is it not possible because of the Spark 3.5.x dependency and not the Spark 4.0.0?

parquet-mr supports most of these conversions in 3.5 (the vectorized supports almost none though)

I'm adding a check against results produced by Spark + parquet-mr

vkorukanti · 2024-08-19T23:03:44Z

...ults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetColumnReaderSuite.scala

+      checkAnswer(result, Seq(TestRow(1577836800000000L)))
+    }
+  }
+}


Also can we add negative tests where type change is not valid? long read as int? I am not sure if we throw any errors at the moment.

I added a relatively exhaustive list of tests to ensure we fail for unsupported conversions. Found a few where we don't properly fail and documented them in a separate list of test cases.

Not fixing them here though, this goes beyond the scope of this PR.

vkorukanti

lgtm

...faults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetFileReaderSuite.scala

johanl-db · 2024-08-26T12:48:17Z

@vkorukanti I addressed your remaining comments, you can merge the PR once the tests finish running

…er (delta-io#3541) \## Description Add a set of conversions to the default parquet reader provided by kernel to allow reading columns using a wider type than the actual in the parquet file. This will support the type widening table feature, see https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md. Conversions added: - INT32 -> long - FLOAT -> double - decimal precision/scale increase - DATE -> timestamp_ntz - INT32 -> double - integers -> decimal ## How was this patch tested? Added tests covering all conversions in `ParquetColumnReaderSuite` ## Does this PR introduce _any_ user-facing changes? This change alone doesn't allow reading Delta table that use the type widening table feature. That feature is still unsupported. It does allow reading Delta tables that somehow have Parquet files that contain types that are different from the table schema, but that really should never happen for tables that don't support type widening..

johanl-db added the kernel label Aug 13, 2024

johanl-db changed the title ~~[Kernel] Add widening type conversions to Kernel default parquet reader~~ [WIP][Kernel] Add widening type conversions to Kernel default parquet reader Aug 13, 2024

johanl-db self-assigned this Aug 13, 2024

johanl-db force-pushed the type-widening-kernel-default-parquet-handler branch 2 times, most recently from 4627545 to 58f8c86 Compare August 15, 2024 06:54

johanl-db changed the title ~~[WIP][Kernel] Add widening type conversions to Kernel default parquet reader~~ [Kernel] Add widening type conversions to Kernel default parquet reader Aug 15, 2024

johanl-db requested a review from vkorukanti August 15, 2024 06:55

Add widening type conversions to Kernel default parquet handler

ea6dcb2

johanl-db force-pushed the type-widening-kernel-default-parquet-handler branch from 58f8c86 to ea6dcb2 Compare August 15, 2024 07:02

vkorukanti requested changes Aug 19, 2024

View reviewed changes

johanl-db requested a review from vkorukanti August 21, 2024 17:30

Add more tests: negative test cases, checking results against spark

d6fa06c

johanl-db force-pushed the type-widening-kernel-default-parquet-handler branch from 4fad52e to d6fa06c Compare August 21, 2024 17:32

vkorukanti approved these changes Aug 23, 2024

View reviewed changes

...faults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetFileReaderSuite.scala Show resolved Hide resolved

...faults/src/test/scala/io/delta/kernel/defaults/internal/parquet/ParquetFileReaderSuite.scala Outdated Show resolved Hide resolved

Address nits on tests

4b4e47f

vkorukanti merged commit dcf9ea9 into delta-io:master Aug 26, 2024
16 checks passed

johanl-db mentioned this pull request Sep 9, 2024

[Kernel] Add type widening to supported table features #3656

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Add widening type conversions to Kernel default parquet reader #3541

[Kernel] Add widening type conversions to Kernel default parquet reader #3541

johanl-db commented Aug 13, 2024

vkorukanti left a comment

vkorukanti Aug 19, 2024

johanl-db Aug 21, 2024

vkorukanti Aug 19, 2024

johanl-db Aug 21, 2024

vkorukanti Aug 19, 2024

johanl-db Aug 21, 2024

vkorukanti Aug 19, 2024

johanl-db Aug 21, 2024

vkorukanti left a comment

johanl-db commented Aug 26, 2024

[Kernel] Add widening type conversions to Kernel default parquet reader #3541

[Kernel] Add widening type conversions to Kernel default parquet reader #3541

Conversation

johanl-db commented Aug 13, 2024

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

vkorukanti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti left a comment

Choose a reason for hiding this comment

johanl-db commented Aug 26, 2024