[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

davies · 2014-07-23T19:53:58Z

During serializing PythonRDD, it will cause an NPE if there null in it. This patch will handle it as None of Python.

This PR is based on #554, thanks to @kalpit.

… null

SparkQA · 2014-07-23T19:58:31Z

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17056/consoleFull

rxin · 2014-07-23T21:13:26Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+    if (str == null) {
+      dataOut.writeInt(SpecialLengths.NULL)
+    } else {
+        val bytes = str.getBytes(UTF8)


alignment is off here

SparkQA · 2014-07-23T21:40:23Z

QA results for PR 1551:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17056/consoleFull

SparkQA · 2014-07-23T22:08:22Z

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17065/consoleFull

davies · 2014-07-25T17:58:03Z

@rxin @mateiz, could you take a look at this?

rxin · 2014-07-26T04:45:04Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

@@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
              throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
          }
        case other =>
-          throw new SparkException("Unexpected element type " + first.getClass)
+          if (other == null) {
+            dataOut.writeInt(SpecialLengths.NULL)


maybe it doesn't matter much here, but would it make sense to write a byte instead of an int?

It's header of var-length field, it's better to keep this header has fixed length, or you will need to deal with special var-length encoding.

matei · 2014-07-26T10:15:44Z

Hi guys,
Could you please use the full username (e.g. @Mateixx instead if @matei) when referring to someone ? I keep getting subscribed to various conversations under this project :) thanks a lot!

rxin · 2014-07-26T17:34:33Z

Jenkins, retest this please.

SparkQA · 2014-07-26T17:38:38Z

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17225/consoleFull

SparkQA · 2014-07-26T18:23:21Z

QA results for PR 1551:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17225/consoleFull

JoshRosen · 2014-07-29T07:02:40Z

We aren't passing completely arbitrary iterators of Java objects to writeIteratorToStream; instead, we only handle iterators of strings and byte arrays. Nulls in data read from Hadoop input formats should already be converted to None by the Java pickling code. Do you have an example where PythonRDD receives a null element and it's not due to a bug? I'm worried that this patch will mask the presence of bugs.

JoshRosen · 2014-07-29T14:44:31Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

-          throw new SparkException("Unexpected element type " + first.getClass)
+          if (other == null) {
+            dataOut.writeInt(SpecialLengths.NULL)
+            writeIteratorToStream(iter, dataOut)


This method isn't tail-recursive, so this will cause a StackOverflow if you try to write an iterator with thousands of consecutive nulls.

It looks like we only have to worry about nulls when writing iterators from user-defined RDDs of strings. So, if we see an iterator that begins with null, we can assume that the remainder of the iterator contains only nulls or strings. Therefore, I think you can write out the first null followed by

iter.asInstanceOf[Iterator[String]].foreach { str => writeUTF(str, dataOut) }

to process the remainder of the stream.

I was wrong; this is tail-recursive. If we only expect nulls to occur in iterators of strings, then I think we should be able to remove the null checking here.

I think it's better to handle NPE as much as possible, until you can prove that NPE will not happen.

But this is what I didn't understand about the whole PR: user code is not meant to call PythonRDD directly. Note that the whole PythonRDD object is private[spark]. So where in the codebase today can we get nulls there?

Right, but that's a private API, it doesn't matter. Does our own code do it?

Basically I'm worried that this significantly complicates our code for something that shouldn't happen. I'd rather have an NPE if our own code later passes nulls here (cause it really shouldn't be doing that since we control everything we pass in).

If users want to call UDF in Java/Scala from PySpark, they have to use this private API to do it, so it's possible to have null in RDD[string] or RDD[Array[Byte]].

BTW, it will be helpful if we can skip some BAD rows during map/reduce, which was mentioned in MapReduce paper. This is not MUST have feature, but it really improve the robustness of whole framework, very useful for large scale jobs.

This PR try to improve the stability of PySpark, let users feel safer and happier to hack in PySpark.

Again, sorry, I don't think this improves stability:

Users are not supposed to call private APIs. In fact even Scala code can't call PythonRDD because that is private[spark] -- it's just an artifact of the way Scala implements package-private that the class becomes public in Java. If you'd like support for UDFs we need to add that as a separate, top-level feature.

This change would mask bugs in the current way we write Python converters. Our current converters only pass in Strings and arrays of bytes, which shouldn't be null. (For datasets that contain null they convert it to a picked form of None already). This means that if someone introduces a bug in one of our existing code paths, that bug will be harder to fix because instead of being an NPE, it will be some weird value coming out in Python.

BTW apart from the stability issue above with catching our own bugs, the reason I'm commenting is that this change also adds some moderately tricky code in a fairly important code path, increasing the chance of adding new bugs. That doesn't seem worth it to me.

OK, let's hold it.

SparkQA · 2014-07-29T21:04:20Z

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17380/consoleFull

JoshRosen · 2014-07-29T21:38:26Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

  def writeIteratorToStream[T](iter: Iterator[T], dataOut: DataOutputStream) {
    // The right way to implement this would be to use TypeTags to get the full
    // type of T.  Since I don't want to introduce breaking changes throughout the
    // entire Spark API, I have to use this hacky approach:
+    def writeBytes(bytes: Array[Byte]) {


Is there a legitimate case where a Iterator[Array[Byte]] will contain a null? I was hoping we'd only have to worry about nulls in Iterator[String].

Array[Byte] is similar to String, null can be generated by user's functions or RDDs, just like

RDD[String].map(x => if (x != null) x.toArray else x)

SparkQA · 2014-07-29T21:46:07Z

QA results for PR 1551:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17380/consoleFull

davies · 2014-07-29T22:08:51Z

The failed tests cases is not related to this PR, how to retest it?

JoshRosen · 2014-07-29T22:55:44Z

Jenkins, retest this please.

SparkQA · 2014-07-29T22:58:51Z

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17388/consoleFull

SparkQA · 2014-07-29T23:46:16Z

QA results for PR 1551:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17388/consoleFull

davies · 2014-07-30T21:27:05Z

Close this PR now, will reopen if needed.

davies · 2015-01-08T21:51:12Z

@mateiz We hit this issue when working on Python API for Kafka, it's a DStream[Array[Byte], Array[byte]], but the key in dstream is null, I will fix this in #3715

kalpit and others added 5 commits April 25, 2014 10:44

SPARK-1630: Make PythonRDD handle Null elements and strings gracefully

ff036d3

SPARK-1630: Incorporated code-review feedback

8a4a0f9

SPARK-1630: Fixed indentation

dddda9e

Merge branch 'pyspark/handleNullData' of github.com:kalpit/spark into…

55a077a

… null

turn Null of Java into None of Python

3af8b4d

rxin reviewed Jul 23, 2014
View reviewed changes

fix style, add new test case for list stats with nulls

00fa7f0

rxin reviewed Jul 26, 2014
View reviewed changes

JoshRosen mentioned this pull request Jul 29, 2014

Example pyspark-inputformat for Avro file format #1536

Closed

JoshRosen reviewed Jul 29, 2014
View reviewed changes

davies added 2 commits July 29, 2014 11:50

handle null in bytes

12c00d7

Merge branch 'master' into null

7ee5b7d

JoshRosen reviewed Jul 29, 2014
View reviewed changes

davies closed this Jul 30, 2014

davies deleted the null branch September 15, 2014 22:18

tdas mentioned this pull request Jan 29, 2015

[SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python #3715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

davies commented Jul 23, 2014

SparkQA commented Jul 23, 2014

rxin Jul 23, 2014

SparkQA commented Jul 23, 2014

SparkQA commented Jul 23, 2014

davies commented Jul 25, 2014

rxin Jul 26, 2014

davies Jul 26, 2014

matei commented Jul 26, 2014

rxin commented Jul 26, 2014

SparkQA commented Jul 26, 2014

SparkQA commented Jul 26, 2014

JoshRosen commented Jul 29, 2014

JoshRosen Jul 29, 2014

JoshRosen Jul 29, 2014

JoshRosen Jul 29, 2014

davies Jul 29, 2014

mateiz Jul 30, 2014

mateiz Jul 30, 2014

davies Jul 30, 2014

mateiz Jul 30, 2014

mateiz Jul 30, 2014

davies Jul 30, 2014

SparkQA commented Jul 29, 2014

JoshRosen Jul 29, 2014

davies Jul 29, 2014

SparkQA commented Jul 29, 2014

davies commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

davies commented Jul 30, 2014

davies commented Jan 8, 2015

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

Conversation

davies commented Jul 23, 2014

SparkQA commented Jul 23, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 23, 2014

SparkQA commented Jul 23, 2014

davies commented Jul 25, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matei commented Jul 26, 2014

rxin commented Jul 26, 2014

SparkQA commented Jul 26, 2014

SparkQA commented Jul 26, 2014

JoshRosen commented Jul 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 29, 2014

davies commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

davies commented Jul 30, 2014

davies commented Jan 8, 2015