-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example pyspark-inputformat for Avro file format #1536
Conversation
Can one of the admins verify this patch? |
@ericgarcia thanks for submitting this, looks good and will be helpful for others wanting to use Avro from PySpark. Would you mind adding the python example code to the PR itself (perhaps just in the comments to the |
case v:java.lang.Float => value.asInstanceOf[java.lang.Float] | ||
case v:java.lang.Integer => value.asInstanceOf[java.lang.Integer] | ||
case v:java.lang.Long => value.asInstanceOf[java.lang.Long] | ||
case v:java.lang.String => value.asInstanceOf[java.lang.String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding case null => null
for null handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kanzhang I couldn't get Java/Scala null passed into Python. It crashed when I tried to do this so I encoded it as an empty string instead. If someone else can figure this out, that would be cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing you could try to avoid NullPointerException is to wrap any intermediate value
that could be null into Option(value)
and only at the end of transformation call .orNull
on it to get the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that #1551 will address this issue with passing nulls from Java/Scala to Python. I'll update this thread once I've merged that PR.
Is there a way to hand this off to anyone interested in finishing it and cleaning it up? My goal was merely to get it working for my application (accomplished!) and share the (unfinished) solution as a starting point. |
@ericgarcia If you open a new issue on the Spark JIRA and link to this pull request, I'm sure somebody will be able to pick this up and finish it. Thanks! |
Great! If you don't plan to work on this anytime soon, could you close this PR? Thanks! |
@ericgarcia @JoshRosen do you mind if I give it a try? |
@kanzhang Go ahead; I've assigned the JIRA to you. |
This is an example showing how to map an Avro file to a pyspark RDD.
Starting Pyspark with
SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 bin/pyspark
Note that not all data types are implemented.
Note also that null values from UNION(NULL, ...) fields will show up as the empty String "" in python.