SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

kalpit · 2014-04-25T17:48:10Z

Have added a unit test that validates the fix. We no longer NPE.

rxin · 2014-04-25T17:50:09Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

@@ -301,15 +301,25 @@ private[spark] object PythonRDD {
              throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
          }
        case other =>
-          throw new SparkException("Unexpected element type " + first.getClass)
+          Option(other) match {


It's more obvious to just do

if (other == null) { } else { }

then a pattern matching.

AmplabJenkins · 2014-04-25T17:52:55Z

Can one of the admins verify this patch?

rxin · 2014-04-25T17:54:30Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

-    val bytes = str.getBytes(UTF8)
-    dataOut.writeInt(bytes.length)
-    dataOut.write(bytes)
+    Option(str) match {


rxin · 2014-04-25T17:54:56Z

Thanks, @kalpit. This looks pretty good. I left a couple comments on style.

rxin · 2014-04-25T18:47:30Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+    if (str == null) {
+      logDebug("Encountered NULL string. We skip writing NULL to stream.")
+    } else {
+        val bytes = str.getBytes(UTF8)


the indent is off here (2-space indent)

Can you update this one also?

rxin · 2014-04-25T18:47:41Z

Thanks - just one more tiny thing about indent ...

mateiz · 2014-04-25T22:26:58Z

Jenkins, test this please

AmplabJenkins · 2014-04-25T22:27:58Z

Merged build triggered.

AmplabJenkins · 2014-04-25T22:28:06Z

Merged build started.

mateiz · 2014-04-25T22:28:06Z

I'm curious, when did you get nulls in practice? Wouldn't it be better to pass a null to Python and have it display as None?

AmplabJenkins · 2014-04-25T23:24:08Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-25T23:24:08Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14500/

kalpit · 2014-04-26T17:08:19Z

@mateiz I ran into this when my custom RDD produced nulls for some elements within a partition/split (during compute()).

It would indeed be better to pass a null to Python and have it display it as None. One solution is to a pick a TOKEN that we write into the tmp file and then translate it to a "None" during read. This, however, is not failsafe because there is a remote possibility of string data being identical to the TOKEN. Perhaps we could address that by fencing regular data by a special character and treating data lacking that fence as tokens.

In any case, the above solution (or an alternative) would be a relatively larger change, and I preferred fixing at least the NPEs in PythonRDD for short term (stack trace is in JIRA ticket).

What do you think ?

mateiz · 2014-04-26T23:31:47Z

But that means that the NPEs are only happening with your custom RDD, right? They won't happen for regular Spark users.

I think we should pass None here. One way to do it is to select a negative length (e.g. -3) to represent null, and pass that to Python. We already use other negative lengths for other special flags.

kalpit · 2014-04-27T03:34:50Z

I suspect that the NPEs will happen for any PySpark User who has an RDD that returns null for some input "x" based on the lambda/transform. Check out the test case I added to "PythonRDDSuite.scala" to reproduce the NPE.

I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?

mateiz · 2014-04-28T17:39:41Z

Lambdas in Python that return None will work fine because we use pickling for all data after that. The only way this problem can happen is if a Java RDD has null in it. Do you have an example in Python only (with the current PySpark) where this happens?

kalpit · 2014-04-28T18:34:44Z

I see your point. I don't have a Python-only use-case that can trigger the NPE.

My custom RDD implementation had a corner-case in which RDD's compute() method returned a "null" in the iterator stream. I have fixed my custom RDD implementation to not do that, so I don't run into this NPE anymore. However, should anyone else out there ever implement a custom RDD of similar nature (has nulls for some elements in a partition's iterator stream) and tries accessing such an RDD from PySpark, he/she would run into the NPE, so I thought it would be nicer if we handled nulls in the stream gracefully.

mateiz · 2014-04-28T20:30:48Z

Yeah, but in that case I think we have to figure out a way with the lengths. I haven't had time to look into it, but basically the UTF decoder in Python needs to deal with negative lengths sent from Scala.

kanzhang · 2014-05-08T20:52:44Z

I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?

@kalpit pls take a look at #644, where I propose to use null to signal end of stream instead of an empty array.

…che#554. SPARK-1056. Fix header comment in Executor to not imply that it's only u... ...sed for Mesos and Standalone. Author: Sandy Ryza <[email protected]> == Merge branch commits == commit 1f2443d902a26365a5c23e4af9077e1539ed2eab Author: Sandy Ryza <[email protected]> Date: Thu Feb 6 15:03:50 2014 -0800 SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone

JoshRosen · 2014-07-29T06:39:45Z

Hi @kalpit,

Since this PR has been superseded by #644, do you mind closing it? Thanks!

AmplabJenkins · 2014-08-06T02:22:51Z

Can one of the admins verify this patch?

mateiz · 2014-09-05T00:53:09Z

I've closed this since it was fixed separately. Thanks for sending a patch here.

SparkQA · 2014-09-05T23:46:59Z

Can one of the admins verify this patch?

Perform apt-get update before install

SPARK-1630: Make PythonRDD handle Null elements and strings gracefully

ff036d3

rxin reviewed Apr 25, 2014
View reviewed changes

SPARK-1630: Incorporated code-review feedback

8a4a0f9

rxin reviewed Apr 25, 2014
View reviewed changes

SPARK-1630: Fixed indentation

dddda9e

davies mentioned this pull request Jul 23, 2014

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

Closed

asfgit closed this in d112a6c Sep 21, 2014

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Perform apt-get update before install (apache#554)

78ef2ab

Perform apt-get update before install

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

kalpit commented Apr 25, 2014

rxin Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

rxin Apr 25, 2014

rxin commented Apr 25, 2014

rxin Apr 25, 2014

rxin Apr 25, 2014

rxin commented Apr 25, 2014

mateiz commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

mateiz commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

kalpit commented Apr 26, 2014

mateiz commented Apr 26, 2014

kalpit commented Apr 27, 2014

mateiz commented Apr 28, 2014

kalpit commented Apr 28, 2014

mateiz commented Apr 28, 2014

kanzhang commented May 8, 2014

JoshRosen commented Jul 29, 2014

AmplabJenkins commented Aug 6, 2014

mateiz commented Sep 5, 2014

SparkQA commented Sep 5, 2014

SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully #554

Conversation

kalpit commented Apr 25, 2014

rxin Apr 25, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Apr 25, 2014

rxin Apr 25, 2014

Choose a reason for hiding this comment

rxin commented Apr 25, 2014

rxin Apr 25, 2014

Choose a reason for hiding this comment

rxin Apr 25, 2014

Choose a reason for hiding this comment

rxin commented Apr 25, 2014

mateiz commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

mateiz commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

kalpit commented Apr 26, 2014

mateiz commented Apr 26, 2014

kalpit commented Apr 27, 2014

mateiz commented Apr 28, 2014

kalpit commented Apr 28, 2014

mateiz commented Apr 28, 2014

kanzhang commented May 8, 2014

JoshRosen commented Jul 29, 2014

AmplabJenkins commented Aug 6, 2014

mateiz commented Sep 5, 2014

SparkQA commented Sep 5, 2014