fixed binary serde issue #252

comphead · 2020-06-10T11:42:04Z

Fixed serde issue when reading/writing a dataframe in binary mode. Please consider next example

case class Outer(
                  arr0: Array[Inner],
                  str0: String,
                  str1: String,
                  arr1: Array[String],
                  str2: String)
case class Inner(str0: String, id0: Int)
    
  def testDF[T](df: Dataset[T]): Unit = {
    df.printSchema()
    val schema = df.schema
    df.write
      .mode(SaveMode.Overwrite)
      .format("org.apache.spark.sql.redis")
      .option("table", "t")
      .option("model", "binary")
      .save()


    val df0 = session.read.format("org.apache.spark.sql.redis")
      .schema(schema)
      .option("table", "t")
      .option("model", "binary")
      .load()

    df0.printSchema()

    df0.show(false)
  }

testDF(Seq(
      Outer(
        arr0 = Array(Inner("str0", 0)),
        str0 = "str0",
        str1 = "str1",
        arr1 = Array("arr1"),
        str2 = "str2"
      )
    ).toDS())

That fails with Caused by: java.lang.IllegalArgumentException: The value (1) of the type (java.lang.String) cannot be converted to an array of structstr0:string,id0:int

The reason of that is:

In Redis we already have an object stored with attrs order arr0, str0, str1, arr1, str2
buildScan however gets the requiredColumns in another order str0, arr1, str1, arr0, str2
binary decoder didn't apply attrs position, just set the updated schema which is not enough
The proposed fix makes correct attr order for binary deserialized value

Also please note, without provided schema its difficult to deserealize the binary value as we dont have an initial order. Added warning for that

fe2s

Hi @comphead ,
Thanks a lot for this PR, sorry for the delay reviewing it.
The fix makes sense. Could you please take a look at my comment? Also it will be good to add the test that fails now.

fe2s · 2020-07-22T10:31:39Z

src/main/scala/org/apache/spark/sql/redis/BinaryRedisPersistence.scala

@@ -34,6 +35,10 @@ class BinaryRedisPersistence extends RedisPersistence[Array[Byte]] {
  override def decodeRow(keyMap: (String, String), value: Array[Byte], schema: StructType,
                         requiredColumns: Seq[String]): Row = {
    val valuesArray: Array[Any] = SerializationUtils.deserialize(value)
-    new GenericRowWithSchema(valuesArray, schema)
+    // Aligning column positions with what Catalyst expecting
+    val alignedSchema = SparkUtils.alignSchemaWithCatalyst(schema, requiredColumns)


We can optimize it by creating alignedSchema once for all rows. To achieve it I guess we will need to change the decodeRow() signature by adding a new parameter, it will have two schemas: requiredSchema and persistedSchema.

fixed binary serde issue

d808dae

comphead mentioned this pull request Jun 10, 2020

Issue 207: Add Sentinels support for spark-redis library #245

Open

gkorland requested a review from fe2s June 14, 2020 21:31

fe2s requested changes Jul 22, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed binary serde issue #252

fixed binary serde issue #252

comphead commented Jun 10, 2020

fe2s left a comment

fe2s Jul 22, 2020

fixed binary serde issue #252

Are you sure you want to change the base?

fixed binary serde issue #252

Conversation

comphead commented Jun 10, 2020

fe2s left a comment

Choose a reason for hiding this comment

fe2s Jul 22, 2020

Choose a reason for hiding this comment