[SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning #31993

wangyum · 2021-03-29T09:14:18Z

What changes were proposed in this pull request?

It will remove StructField when pruning nested columns. For example:

spark.sql(
  """
    |CREATE TABLE t1 (
    |  _col0 INT,
    |  _col1 STRING,
    |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
    |USING ORC
    |""".stripMargin)

spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))")

spark.sql("SELECT _col0, _col2.c1 FROM t1").show

Before this pr. The returned schema is: `_col0` INT,`_col2` STRUCT<`c1`: STRING> add it will throw exception:

java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)

After this pr. The returned schema is: `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>.

The finally schema is `_col0` INT,`_col2` STRUCT<`c1`: STRING> after the complete column pruning:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Lines 208 to 213 in 7a5647a

    
           val readDataColumns = 
        
             dataColumns 
        
               .filter(requiredAttributes.contains) 
        
               .filterNot(partitionColumns.contains) 
        
           val outputSchema = readDataColumns.toStructType 
        
           logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

Lines 96 to 97 in e64eb75

    
           val neededFieldNames = neededOutput.map(_.name).toSet 
        
           r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))

Why are the changes needed?

Fix bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

cloud-fan · 2021-03-29T10:07:18Z

isn't it a bug? cc @viirya

SparkQA · 2021-03-29T13:53:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41226/

SparkQA · 2021-03-29T14:02:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41226/

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

SparkQA · 2021-03-29T15:54:14Z

Test build #136644 has finished for PR 31993 at commit 2a3f136.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Hmm, this seems to be a special case where nested column pruning doesn't work for ORC. For the case, it is needed to send entire unpruned data schema to ORC.

viirya · 2021-03-30T06:18:01Z

As nested column pruning rule is far from the point we get the physical information of ORC files, and this should be a narrow case, it looks okay to me to inform users a possible workaround here.

wangyum · 2021-03-30T06:19:48Z

It is a Hive ORC table in our production environment.

cloud-fan · 2021-03-30T06:21:43Z

Can we automatically disable nested column pruning at executor side when we find the orc file schema is the by-position style?

wangyum · 2021-03-31T03:09:15Z

Can we disable column pruning when it is Hive ORC table?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala

Lines 98 to 100 in 25e7d1c

    
           private def canPruneRelation(fsRelation: HadoopFsRelation) = 
        
             fsRelation.fileFormat.isInstanceOf[ParquetFileFormat] || 
        
               fsRelation.fileFormat.isInstanceOf[OrcFileFormat]

Update canPruneRelation to:

  private def canPruneRelation(fsRelation: HadoopFsRelation) = {
    fsRelation.fileFormat match {
      case _: ParquetFileFormat => true
      case _: OrcFileFormat =>
        fsRelation.location match {
          case c: CatalogFileIndex =>
            !c.table.provider.contains(DDLUtils.HIVE_PROVIDER)
          case _ => true
        }
    }
  }

cloud-fan · 2021-03-31T05:09:00Z

Sorry I may miss something. Why it's only a problem in nested column pruning but not column pruning?

wangyum · 2021-04-11T05:50:49Z

Sorry I may miss something. Why it's only a problem in nested column pruning but not column pruning?

Nested column pruning removed the field:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

Lines 28 to 42 in 0f2c0b5

    
           def pruneDataSchema( 
        
               dataSchema: StructType, 
        
               requestedRootFields: Seq[RootField]): StructType = { 
        
             // Merge the requested root fields into a single schema. Note the ordering of the fields 
        
             // in the resulting schema may differ from their ordering in the logical relation's 
        
             // original schema 
        
             val mergedSchema = requestedRootFields 
        
               .map { case root: RootField => StructType(Array(root.field)) } 
        
               .reduceLeft(_ merge _) 
        
             val dataSchemaFieldNames = dataSchema.fieldNames.toSet 
        
             val mergedDataSchema = 
        
               StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) 
        
             // Sort the fields of mergedDataSchema according to their order in dataSchema, 
        
             // recursively. This makes mergedDataSchema a pruned schema of dataSchema 
        
             sortLeftFieldsByRight(mergedDataSchema, dataSchema).asInstanceOf[StructType]

SparkQA · 2021-04-11T06:12:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41754/

SparkQA · 2021-04-11T06:16:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41754/

SparkQA · 2021-04-11T09:51:09Z

Test build #137176 has finished for PR 31993 at commit e64eb75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-04-13T00:16:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

@@ -89,14 +93,12 @@ object PushDownUtils extends PredicateHelper {
        } else {
          new StructType()
        }
-        r.pruneColumns(prunedSchema)
+        val neededFieldNames = neededOutput.map(_.name).toSet
+        r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))


Move filter logical from SchemaPruning to PushDownUtils to support datasource V2 column pruning.

cloud-fan · 2021-04-14T16:13:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

    val mergedDataSchema =
-      StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
+      StructType(dataSchema.map(s => mergedSchema.find(_.name.equals(s.name)).getOrElse(s)))


what's the actual difference? can you give a simple example?

It seems we don't prune anything from the root fields now.

if this is the case please update the document of this method.

spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("SELECT _col0, _col2.c1 FROM t1").show

The origin schema is:

`_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING, `c2`: STRING, `c3`: STRING, `c4`: BIGINT>

Before this PR, the pruneDataSchema returns:

`_col0` INT,`_col2` STRUCT<`c1`: STRING>

After this PR, the pruneDataSchema returns:

`_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING>

It only prune nested schemas.

What's wrong with the previous behavior? We can't sacrifice performance for all the cases only because the ORC by ordinal case is problematic.

is it because column pruning will be done by other rules so we don't need to consider it here?

Yes.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Lines 208 to 213 in 7a5647a

val readDataColumns =

dataColumns

.filter(requiredAttributes.contains)

.filterNot(partitionColumns.contains)

val outputSchema = readDataColumns.toStructType

logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

Lines 96 to 97 in e64eb75

val neededFieldNames = neededOutput.map(_.name).toSet

r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))

Can you provide the full code workflow to explain why this causes issues in ORC? I'm still not very sure.

Prune nested schema:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

Lines 28 to 43 in 0f2c0b5

def pruneDataSchema(

dataSchema: StructType,

requestedRootFields: Seq[RootField]): StructType = {

// Merge the requested root fields into a single schema. Note the ordering of the fields

// in the resulting schema may differ from their ordering in the logical relation's

// original schema

val mergedSchema = requestedRootFields

.map { case root: RootField => StructType(Array(root.field)) }

.reduceLeft(_ merge _)

val dataSchemaFieldNames = dataSchema.fieldNames.toSet

val mergedDataSchema =

StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))

// Sort the fields of mergedDataSchema according to their order in dataSchema,

// recursively. This makes mergedDataSchema a pruned schema of dataSchema

sortLeftFieldsByRight(mergedDataSchema, dataSchema).asInstanceOf[StructType]

}

Use this pruned nested schema to build the dataSchema in Relation

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala

Lines 81 to 86 in 25e7d1c

if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {

val prunedRelation = leafNodeBuilder(prunedDataSchema)

val projectionOverSchema = ProjectionOverSchema(prunedDataSchema)

Some(buildNewProjection(normalizedProjects, normalizedFilters, prunedRelation,

projectionOverSchema))

The readDataColumns is the complete column pruning:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Lines 208 to 226 in 7a5647a

val readDataColumns =

dataColumns

.filter(requiredAttributes.contains)

.filterNot(partitionColumns.contains)

val outputSchema = readDataColumns.toStructType

logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")

val outputAttributes = readDataColumns ++ partitionColumns

val scan =

FileSourceScanExec(

fsRelation,

outputAttributes,

outputSchema,

partitionKeyFilters.toSeq,

bucketSet,

None,

dataFilters,

table.map(_.identifier))

dataSchema from relation.dataSchema. It is the pruned nested schema:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

Lines 398 to 407 in 935aa8c

lazy val inputRDD: RDD[InternalRow] = {

val readFile: (PartitionedFile) => Iterator[InternalRow] =

relation.fileFormat.buildReaderWithPartitionValues(

sparkSession = relation.sparkSession,

dataSchema = relation.dataSchema,

partitionSchema = relation.partitionSchema,

requiredSchema = requiredSchema,

filters = pushedDownFilters,

options = relation.options,

hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))

OrcUtils.requestedColumnIds use this pruned nested schema:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

Lines 193 to 197 in 1fc66f6

val resultedColPruneInfo =

Utils.tryWithResource(OrcFile.createReader(filePath, readerOptions)) { reader =>

OrcUtils.requestedColumnIds(

isCaseSensitive, dataSchema, requiredSchema, reader, conf)

}

It is because requestedColumnIds will check if given data schema has less fields than physical schema in ORC file.

Under nested column pruning, Spark will let data source use pruned schema as data schema to read files. E.g., Spark prune _col1, for the above example. But the ORC file has three top-level fields _col0, _col1, and _col2, so the check in requestedColumnIds will fail on the case.

is it because column pruning will be done by other rules so we don't need to consider it here?

Yes.

Hmm? In PushDownUtils.pruneColumns, if you enable nested column pruning, Spark will only run the path of nested column pruning, not the quoted L96-97.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

SparkQA · 2021-04-16T01:49:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42024/

SparkQA · 2021-04-16T01:54:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42024/

SparkQA · 2021-04-16T05:07:10Z

Test build #137449 has finished for PR 31993 at commit a966bac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-21T01:52:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42235/

SparkQA · 2021-04-21T01:57:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42235/

SparkQA · 2021-04-21T05:33:28Z

Test build #137707 has finished for PR 31993 at commit 6112c9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-21T07:39:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

-   * and given requested field are "a", the field "b" is pruned in the returned schema.
-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is:
+   * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned


top-level columns need to have a name, id int, s struct<a:int, b:int>

and given requested field are "a" -> and given requested field "s.a"

the field "b" is pruned -> the inner field "b" ...

cloud-fan · 2021-04-21T07:40:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

-   * Note that schema field ordering at original schema is still preserved in pruned schema.
+   * Prunes the nested schema by the requested fields. For example, if the schema is:
+   * `id int, struct<a:int, b:int>`, and given requested field are "a", the field "b" is pruned
+   * in the returned schema: `id int, struct<a:int>`.


ditto, id int, s struct<a:int>

cloud-fan · 2021-04-21T07:42:13Z

@wangyum there are conflicts

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

wangyum · 2021-04-21T10:00:37Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala

+    val upperCaseSchema = StructType.fromDDL("A struct<A:int, B:int>, B int")
+    val lowerCaseSchema = StructType.fromDDL("a struct<a:int, b:int>, b int")
+    val upperCaseRequestedFields = Seq(StructField("A", StructType.fromDDL("A int")))
+    val lowerCaseRequestedFields = Seq(StructField("a", StructType.fromDDL("a int")))
+
+    Seq(true, false).foreach { isCaseSensitive =>
      withSQLConf(CASE_SENSITIVE.key -> isCaseSensitive.toString) {
        if (isCaseSensitive) {
-          // Schema is case-sensitive
-          val requestedFields = getRootFields(StructField("id", IntegerType))
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"), requestedFields)
-          assert(prunedSchema == StructType(Seq.empty))
-          // Root fields are case-sensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructType(Seq.empty)))
+          testPrunedSchema(
+            upperCaseSchema,
+            upperCaseRequestedFields,
+            StructType.fromDDL("A struct<A:int>, B int"))
+          testPrunedSchema(
+            upperCaseSchema,
+            lowerCaseRequestedFields,
+            upperCaseSchema)
+
+          testPrunedSchema(
+            lowerCaseSchema,
+            upperCaseRequestedFields,
+            lowerCaseSchema)
+          testPrunedSchema(
+            lowerCaseSchema,
+            lowerCaseRequestedFields,
+            StructType.fromDDL("a struct<a:int>, b int"))
        } else {
-          // Schema is case-insensitive
-          val prunedSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("ID int, name String"),
-            getRootFields(StructField("id", IntegerType)))
-          assert(prunedSchema == StructType(StructField("ID", IntegerType) :: Nil))
-          // Root fields are case-insensitive
-          val rootFieldsSchema = SchemaPruning.pruneDataSchema(
-            StructType.fromDDL("id int, name String"),
-            getRootFields(StructField("ID", IntegerType)))
-          assert(rootFieldsSchema == StructType(StructField("id", IntegerType) :: Nil))
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              upperCaseSchema,
+              requestedFields,
+              StructType.fromDDL("A struct<A:int>, B int"))
+          }
+
+          Seq(upperCaseRequestedFields, lowerCaseRequestedFields).foreach { requestedFields =>
+            testPrunedSchema(
+              lowerCaseSchema,
+              requestedFields,
+              StructType.fromDDL("a struct<a:int>, b int"))
+          }
        }
      }
-    })
+    }


cc @sandeep-katta

Tests LGTM, thanks for add more scenarios

wangyum · 2021-04-21T10:01:44Z

@wangyum there are conflicts

Fixed.

SparkQA · 2021-04-21T10:59:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42257/

SparkQA · 2021-04-21T10:59:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42257/

cloud-fan · 2021-04-21T14:08:52Z

how far shall we backport? to 3.0?

wangyum · 2021-04-21T14:37:36Z

Yes. to 3.0.

SparkQA · 2021-04-21T14:41:43Z

Test build #137730 has finished for PR 31993 at commit 4d0b510.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

lgtm

viirya · 2021-04-21T17:23:00Z

Thanks! Merging to master.

viirya · 2021-04-21T17:25:06Z

@wangyum There are conflicts in 3.1/3.0. Can you create backport PRs? Thanks.

…r nested column pruning This PR backports #31993 to branch-3.1. The origin PR description: ### What changes were proposed in this pull request? It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

…r nested column pruning This PR backports #31993 to branch-3.0. The origin PR description: ### What changes were proposed in this pull request? It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32310 from wangyum/SPARK-34897-3.0. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

…r nested column pruning This PR backports apache#31993 to branch-3.1. The origin PR description: It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 Fix bug. No. Unit test. Closes apache#32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

…r nested column pruning This PR backports apache#31993 to branch-3.1. The origin PR description: ### What changes were proposed in this pull request? It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

Improve error message

2a3f136

wangyum requested a review from cloud-fan March 29, 2021 09:21

github-actions bot added the SQL label Mar 29, 2021

dongjoon-hyun reviewed Mar 29, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala Outdated Show resolved Hide resolved

viirya reviewed Mar 30, 2021

View reviewed changes

wangyum added 2 commits April 10, 2021 21:51

Merge remote-tracking branch 'upstream/master' into SPARK-34897

9a9322b

Support reconcile schemas based on index after nested column pruning

e64eb75

wangyum changed the title ~~[SPARK-34897][SQL] Add workaround to error message when OrcUtils.requestedColumnIds fails~~ [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning Apr 11, 2021

wangyum commented Apr 13, 2021

View reviewed changes

wangyum requested a review from viirya April 13, 2021 10:38

cloud-fan reviewed Apr 14, 2021

View reviewed changes

cloud-fan reviewed Apr 15, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala Outdated Show resolved Hide resolved

Update SchemaPruning.scala

a966bac

viirya reviewed Apr 16, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala Show resolved Hide resolved

wangyum added 2 commits April 21, 2021 08:52

Merge remote-tracking branch 'upstream/master' into SPARK-34897

fb81617

fix

6112c9d

cloud-fan reviewed Apr 21, 2021

View reviewed changes

wangyum added 2 commits April 21, 2021 16:34

Merge remote-tracking branch 'upstream/master' into SPARK-34897

fc904ba

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

Merge upstream

4d0b510

wangyum commented Apr 21, 2021

View reviewed changes

cloud-fan approved these changes Apr 21, 2021

View reviewed changes

viirya approved these changes Apr 21, 2021

View reviewed changes

viirya closed this in e609395 Apr 21, 2021

wangyum mentioned this pull request Apr 21, 2021

[SPARK-34897][SQL][3.1] Support reconcile schemas based on index after nested column pruning #32279

Closed

wangyum deleted the SPARK-34897 branch April 22, 2021 03:16

wangyum mentioned this pull request Apr 23, 2021

[SPARK-34897][SQL][3.0] Support reconcile schemas based on index after nested column pruning #32310

Closed

	val readDataColumns =
	dataColumns
	.filter(requiredAttributes.contains)
	.filterNot(partitionColumns.contains)
	val outputSchema = readDataColumns.toStructType
	logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")

	val neededFieldNames = neededOutput.map(_.name).toSet
	r.pruneColumns(StructType(prunedSchema.filter(f => neededFieldNames.contains(f.name))))

	def pruneDataSchema(
	dataSchema: StructType,
	requestedRootFields: Seq[RootField]): StructType = {
	// Merge the requested root fields into a single schema. Note the ordering of the fields
	// in the resulting schema may differ from their ordering in the logical relation's
	// original schema
	val mergedSchema = requestedRootFields
	.map { case root: RootField => StructType(Array(root.field)) }
	.reduceLeft(_ merge _)
	val dataSchemaFieldNames = dataSchema.fieldNames.toSet
	val mergedDataSchema =
	StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
	// Sort the fields of mergedDataSchema according to their order in dataSchema,
	// recursively. This makes mergedDataSchema a pruned schema of dataSchema
	sortLeftFieldsByRight(mergedDataSchema, dataSchema).asInstanceOf[StructType]
	}

	if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
	val prunedRelation = leafNodeBuilder(prunedDataSchema)
	val projectionOverSchema = ProjectionOverSchema(prunedDataSchema)

	Some(buildNewProjection(normalizedProjects, normalizedFilters, prunedRelation,
	projectionOverSchema))

	lazy val inputRDD: RDD[InternalRow] = {
	val readFile: (PartitionedFile) => Iterator[InternalRow] =
	relation.fileFormat.buildReaderWithPartitionValues(
	sparkSession = relation.sparkSession,
	dataSchema = relation.dataSchema,
	partitionSchema = relation.partitionSchema,
	requiredSchema = requiredSchema,
	filters = pushedDownFilters,
	options = relation.options,
	hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))

	val resultedColPruneInfo =
	Utils.tryWithResource(OrcFile.createReader(filePath, readerOptions)) { reader =>
	OrcUtils.requestedColumnIds(
	isCaseSensitive, dataSchema, requiredSchema, reader, conf)
	}

[SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning #31993

[SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning #31993

Conversation

wangyum commented Mar 29, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Mar 29, 2021

SparkQA commented Mar 29, 2021

SparkQA commented Mar 29, 2021

SparkQA commented Mar 29, 2021

viirya left a comment

Choose a reason for hiding this comment

viirya commented Mar 30, 2021

wangyum commented Mar 30, 2021

cloud-fan commented Mar 30, 2021

wangyum commented Mar 31, 2021

cloud-fan commented Mar 31, 2021

wangyum commented Apr 11, 2021

SparkQA commented Apr 11, 2021

SparkQA commented Apr 11, 2021

SparkQA commented Apr 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyum Apr 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 16, 2021

SparkQA commented Apr 16, 2021

SparkQA commented Apr 16, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyum commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

cloud-fan commented Apr 21, 2021

wangyum commented Apr 21, 2021

SparkQA commented Apr 21, 2021

viirya left a comment

Choose a reason for hiding this comment

viirya commented Apr 21, 2021 • edited Loading

viirya commented Apr 21, 2021

wangyum commented Mar 29, 2021 •

edited

Loading

wangyum Apr 14, 2021 •

edited

Loading

viirya commented Apr 21, 2021 •

edited

Loading