-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for upserts of nested arrays #1838
Conversation
I'm marking this as a draft because it fixes the specific case described in #1190 but I'm not very familiar with this part of Elasticsearch, and I'm not sure it's general enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small question about testing.
} | ||
Result.SUCCESFUL() | ||
} | ||
Result.FAILED() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an empty array a failure scenario here for sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember for sure at this point, but I think that my thinking was that in this case we had been asked to write something, but hadn't actually written an array, so that's failure. Maybe that doesn't make sense though. I'll take a look at what's done with that result. Also I can't remember why I'm not just writing an empty array here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually turns out that we ignore this result. See https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/serialization/bulk/AbstractBulkFactory.java#L153. We probably ought to be throwing an exception there. But an empty array actually is a failure (I think). I'm about to write more in the comment below.
"es.update.script.inline" -> update_script | ||
) | ||
val sqlContext = new SQLContext(sc) | ||
var data = Seq(Row("1", List(Row("hello"), Row("world")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test this with an empty array in the samples
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried this. If I use the code as-is, it doesn't write anything. So something gets tripped up when Elasticsearch tries to apply the script:
13:49:26.766 [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] ERROR org.apache.spark.TaskContextImpl - Error in TaskCompletionListener
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: x_content_parse_exception: [1:74] [script] failed to parse field [params]
{"update":{"_id":"1"}}
{"script":{"source":"ctx._source.samples = params.new_samples","params":{"new_samples":}},"upsert":{"samples":[]}}
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:487) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:438) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:418) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:236) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:215) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:518) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:560) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:122) ~[elasticsearch-hadoop-mr-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
at org.elasticsearch.spark.rdd.EsRDDWriter$$anon$1.onTaskCompletion(EsRDDWriter.scala:74) ~[elasticsearch-spark_2.12-8.1.0-SNAPSHOT-spark30scala212.jar:8.1.0-SNAPSHOT]
at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?]
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.15.jar:?]
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.scheduler.Task.run(Task.scala:147) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) ~[spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) [spark-core_2.12-3.2.0.jar:3.2.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) [spark-core_2.12-3.2.0.jar:3.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
If instead I write an empty array in DataFrameValueWriter, then when I read the data out the returned DataFrame is kind of useless. It has no columns. If you try do do something like resultDf.select("samples") you get:
'Project ['samples]
+- Relation [] ElasticsearchRelation(Map(es.read.field.as.array.include -> samples, es.resource -> nested_fields_upsert_test),org.apache.spark.sql.SQLContext@5cfeb239,None)
org.apache.spark.sql.AnalysisException: cannot resolve 'samples' given input columns: [];
'Project ['samples]
+- Relation [] ElasticsearchRelation(Map(es.read.field.as.array.include -> samples, es.resource -> nested_fields_upsert_test),org.apache.spark.sql.SQLContext@5cfeb239,None)
Still trying to track down why that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I added support for empty arrays (and added them to the test). My initial problems writing out empty arrays were due to the fact that they were the first and only thing I was writing, so Elasticsearch did not infer the correct mappings.
In terms of draft status: I think this is decently written. There might be a case where we're returning a In the case of parameters and field extraction, most primitive values are handled in the |
Looking at the expected primitives in the
|
OK I've updated it so that writes of maps don't fail. They don't exactly behave intuitively, but I've tried to show what they do in an itest. It's not perfect, but I think it's better than complete failure for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question, but otherwise LGTM
private def inferType(value: Any): DataType = { | ||
value match { | ||
case _: String => StringType | ||
case Int => IntegerType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala question: Should this be _: Int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops -- good catch! I'll fix all of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'm pretty sure the java ones are redundant, but I'll leave them in because they don't hurt anything.
This commit adds support for upserts of nested array fields.
Closes #1190