Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

Closed
wants to merge 8 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Jul 23, 2014

During serializing PythonRDD, it will cause an NPE if there null in it. This patch will handle it as None of Python.

This PR is based on #554, thanks to @kalpit.

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17056/consoleFull

if (str == null) {
dataOut.writeInt(SpecialLengths.NULL)
} else {
val bytes = str.getBytes(UTF8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment is off here

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA results for PR 1551:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17056/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 23, 2014

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17065/consoleFull

@davies
Copy link
Contributor Author

davies commented Jul 25, 2014

@rxin @mateiz, could you take a look at this?

@@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
}
case other =>
throw new SparkException("Unexpected element type " + first.getClass)
if (other == null) {
dataOut.writeInt(SpecialLengths.NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it doesn't matter much here, but would it make sense to write a byte instead of an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's header of var-length field, it's better to keep this header has fixed length, or you will need to deal with special var-length encoding.

@matei
Copy link

matei commented Jul 26, 2014

Hi guys,
Could you please use the full username (e.g. @Mateixx instead if @matei) when referring to someone ? I keep getting subscribed to various conversations under this project :) thanks a lot!

@rxin
Copy link
Contributor

rxin commented Jul 26, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 26, 2014

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17225/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 26, 2014

QA results for PR 1551:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17225/consoleFull

@JoshRosen
Copy link
Contributor

We aren't passing completely arbitrary iterators of Java objects to writeIteratorToStream; instead, we only handle iterators of strings and byte arrays. Nulls in data read from Hadoop input formats should already be converted to None by the Java pickling code. Do you have an example where PythonRDD receives a null element and it's not due to a bug? I'm worried that this patch will mask the presence of bugs.

throw new SparkException("Unexpected element type " + first.getClass)
if (other == null) {
dataOut.writeInt(SpecialLengths.NULL)
writeIteratorToStream(iter, dataOut)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't tail-recursive, so this will cause a StackOverflow if you try to write an iterator with thousands of consecutive nulls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we only have to worry about nulls when writing iterators from user-defined RDDs of strings. So, if we see an iterator that begins with null, we can assume that the remainder of the iterator contains only nulls or strings. Therefore, I think you can write out the first null followed by

iter.asInstanceOf[Iterator[String]].foreach { str =>
  writeUTF(str, dataOut)
}

to process the remainder of the stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong; this is tail-recursive. If we only expect nulls to occur in iterators of strings, then I think we should be able to remove the null checking here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to handle NPE as much as possible, until you can prove that NPE will not happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is what I didn't understand about the whole PR: user code is not meant to call PythonRDD directly. Note that the whole PythonRDD object is private[spark]. So where in the codebase today can we get nulls there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that's a private API, it doesn't matter. Does our own code do it?

Basically I'm worried that this significantly complicates our code for something that shouldn't happen. I'd rather have an NPE if our own code later passes nulls here (cause it really shouldn't be doing that since we control everything we pass in).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users want to call UDF in Java/Scala from PySpark, they have to use this private API to do it, so it's possible to have null in RDD[string] or RDD[Array[Byte]].

BTW, it will be helpful if we can skip some BAD rows during map/reduce, which was mentioned in MapReduce paper. This is not MUST have feature, but it really improve the robustness of whole framework, very useful for large scale jobs.

This PR try to improve the stability of PySpark, let users feel safer and happier to hack in PySpark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, sorry, I don't think this improves stability:

  1. Users are not supposed to call private APIs. In fact even Scala code can't call PythonRDD because that is private[spark] -- it's just an artifact of the way Scala implements package-private that the class becomes public in Java. If you'd like support for UDFs we need to add that as a separate, top-level feature.
  2. This change would mask bugs in the current way we write Python converters. Our current converters only pass in Strings and arrays of bytes, which shouldn't be null. (For datasets that contain null they convert it to a picked form of None already). This means that if someone introduces a bug in one of our existing code paths, that bug will be harder to fix because instead of being an NPE, it will be some weird value coming out in Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW apart from the stability issue above with catching our own bugs, the reason I'm commenting is that this change also adds some moderately tricky code in a fairly important code path, increasing the chance of adding new bugs. That doesn't seem worth it to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's hold it.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17380/consoleFull

def writeIteratorToStream[T](iter: Iterator[T], dataOut: DataOutputStream) {
// The right way to implement this would be to use TypeTags to get the full
// type of T. Since I don't want to introduce breaking changes throughout the
// entire Spark API, I have to use this hacky approach:
def writeBytes(bytes: Array[Byte]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a legitimate case where a Iterator[Array[Byte]] will contain a null? I was hoping we'd only have to worry about nulls in Iterator[String].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array[Byte] is similar to String, null can be generated by user's functions or RDDs, just like

RDD[String].map(x => if (x != null) x.toArray else x)

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1551:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17380/consoleFull

@davies
Copy link
Contributor Author

davies commented Jul 29, 2014

The failed tests cases is not related to this PR, how to retest it?

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1551. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17388/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1551:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17388/consoleFull

@davies
Copy link
Contributor Author

davies commented Jul 30, 2014

Close this PR now, will reopen if needed.

@davies davies closed this Jul 30, 2014
@davies davies deleted the null branch September 15, 2014 22:18
@davies
Copy link
Contributor Author

davies commented Jan 8, 2015

@mateiz We hit this issue when working on Python API for Kafka, it's a DStream[Array[Byte], Array[byte]], but the key in dstream is null, I will fix this in #3715

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants