Adding support for upserts of nested arrays #1838

masseyke · 2021-12-21T17:16:48Z

This commit adds support for upserts of nested array fields.
Closes #1190

masseyke · 2021-12-21T17:17:52Z

I'm marking this as a draft because it fixes the specific case described in #1190 but I'm not very familiar with this part of Elasticsearch, and I'm not sure it's general enough.

jbaiera

LGTM, small question about testing.

jbaiera · 2022-01-06T22:01:50Z

spark/sql-13/src/main/scala/org/elasticsearch/spark/sql/DataFrameValueWriter.scala

+      }
+      Result.SUCCESFUL()
+    }
+    Result.FAILED()


Is an empty array a failure scenario here for sure?

I can't remember for sure at this point, but I think that my thinking was that in this case we had been asked to write something, but hadn't actually written an array, so that's failure. Maybe that doesn't make sense though. I'll take a look at what's done with that result. Also I can't remember why I'm not just writing an empty array here.

It actually turns out that we ignore this result. See https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/serialization/bulk/AbstractBulkFactory.java#L153. We probably ought to be throwing an exception there. But an empty array actually is a failure (I think). I'm about to write more in the comment below.

jbaiera · 2022-01-06T22:03:08Z

spark/sql-20/src/itest/scala/org/elasticsearch/spark/integration/AbstractScalaEsSparkSQL.scala

+      "es.update.script.inline" -> update_script
+    )
+    val sqlContext = new SQLContext(sc)
+    var data = Seq(Row("1", List(Row("hello"), Row("world"))))


Can we test this with an empty array in the samples field?

I just tried this. If I use the code as-is, it doesn't write anything. So something gets tripped up when Elasticsearch tries to apply the script:

13:49:26.766 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] ERROR org.apache.spark.TaskContextImpl - Error in TaskCompletionListener org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: x_content_parse_exception: [1:74] [script] failed to parse field [params] {"update":{"_id":"1"}} {"script":{"source":"ctx._source.samples = params.new_samples","params":{"new_samples":}},"upsert":{"samples":[]}} at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:487) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:438) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:418) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:236) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:215) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:518) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:560) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:122) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT] at org.elasticsearch.spark.rdd.EsRDDWriter$$anon$1.onTaskCompletion(EsRDDWriter.scala:74) ~[elasticsearch-spark_2.12-8.1.0-SNAPSHOT-spark30scala212.jar:8.1.0-SNAPSHOT] at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135) ~[spark-core_2.12-3.2.0.jar:3.2.0] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.15.jar:?] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.15.jar:?] at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.scheduler.Task.run(Task.scala:147) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) ~[spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) [spark-core_2.12-3.2.0.jar:3.2.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) [spark-core_2.12-3.2.0.jar:3.2.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

If instead I write an empty array in DataFrameValueWriter, then when I read the data out the returned DataFrame is kind of useless. It has no columns. If you try do do something like resultDf.select("samples") you get:

'Project ['samples] +- Relation [] ElasticsearchRelation(Map(es.read.field.as.array.include -> samples, es.resource -> nested_fields_upsert_test),org.apache.spark.sql.SQLContext@5cfeb239,None) org.apache.spark.sql.AnalysisException: cannot resolve 'samples' given input columns: []; 'Project ['samples] +- Relation [] ElasticsearchRelation(Map(es.read.field.as.array.include -> samples, es.resource -> nested_fields_upsert_test),org.apache.spark.sql.SQLContext@5cfeb239,None)

Still trying to track down why that is.

OK I added support for empty arrays (and added them to the test). My initial problems writing out empty arrays were due to the fact that they were the first and only thing I was writing, so Elasticsearch did not infer the correct mappings.

jbaiera · 2022-01-06T22:13:21Z

In terms of draft status: I think this is decently written. There might be a case where we're returning a MapType and this can be tripped up. Perhaps it makes sense even more so to simply delegate any unknown typed values to the primitives method?

In the case of parameters and field extraction, most primitive values are handled in the FieldWriter class (https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/serialization/bulk/AbstractBulkFactory.java#L135-L142) instead of in each of the integrations value writer classes. This is because most of the time the data we're working with doesn't need to be unwrapped any further (at that point in the code, we usually just have an int or a String, etc). Anything that doesn't meet that linked conditional statement will be passed on to the integration specific serialization code to be converted to json to be added to the bulk request header.

jbaiera · 2022-01-06T22:19:03Z

Looking at the expected primitives in the DataFrameValueWriter at https://github.com/elastic/elasticsearch-hadoop/blob/master/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/DataFrameValueWriter.scala#L164-L174 and comparing them to what we handle in the FieldWriter (above) it looks like we might need special logic for the following values (testing will be needed to confirm if they blow up or if they're already managed):

BinaryType fields with Array[Byte] values
MapType fields with scala Map[_, _] values or java Map<?, ?> values

masseyke · 2022-01-14T21:53:40Z

OK I've updated it so that writes of maps don't fail. They don't exactly behave intuitively, but I've tried to show what they do in an itest. It's not perfect, but I think it's better than complete failure for now.

jbaiera

Quick question, but otherwise LGTM

jbaiera · 2022-01-20T00:43:54Z

spark/sql-13/src/main/scala/org/elasticsearch/spark/sql/DataFrameValueWriter.scala

+  private def inferType(value: Any): DataType = {
+    value match {
+      case _: String            => StringType
+      case Int                  => IntegerType


Scala question: Should this be _: Int ?

Whoops -- good catch! I'll fix all of those.

And I'm pretty sure the java ones are redundant, but I'll leave them in because they don't hurt anything.

Adding support for upserts of nested arrays

9727f07

masseyke added bug v8.1.0 labels Dec 21, 2021

masseyke requested a review from jbaiera December 21, 2021 17:16

jbaiera approved these changes Jan 6, 2022

View reviewed changes

masseyke added 4 commits January 13, 2022 17:20

Handling empty arrays

cd7a2be

Reverting unrelated change

73deca1

Adding some support for maps

f587f60

Porting array and map fixes to spark 2 and spark 1

faa8d24

masseyke marked this pull request as ready for review January 14, 2022 21:53

merging master

e48baa9

jbaiera approved these changes Jan 20, 2022

View reviewed changes

Correcting match statements

4a9bde6

masseyke merged commit 3c805a9 into elastic:master Jan 20, 2022

masseyke deleted the fix/nested-fields-upsert branch January 20, 2022 14:46

masseyke added the :Spark label Feb 17, 2022

masseyke mentioned this pull request Nov 8, 2022

Support for ArrayType in es.update.script.params #2036

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for upserts of nested arrays #1838

Adding support for upserts of nested arrays #1838

masseyke commented Dec 21, 2021

masseyke commented Dec 21, 2021

jbaiera left a comment

jbaiera Jan 6, 2022

masseyke Jan 6, 2022

masseyke Jan 12, 2022

jbaiera Jan 6, 2022

masseyke Jan 12, 2022

masseyke Jan 13, 2022

jbaiera commented Jan 6, 2022

jbaiera commented Jan 6, 2022

masseyke commented Jan 14, 2022

jbaiera left a comment

jbaiera Jan 20, 2022

masseyke Jan 20, 2022

masseyke Jan 20, 2022

Adding support for upserts of nested arrays #1838

Adding support for upserts of nested arrays #1838

Conversation

masseyke commented Dec 21, 2021

masseyke commented Dec 21, 2021

jbaiera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbaiera commented Jan 6, 2022

jbaiera commented Jan 6, 2022

masseyke commented Jan 14, 2022

jbaiera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment