Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example pyspark-inputformat for Avro file format #1536

Closed
wants to merge 2 commits into from

Conversation

ericgarcia
Copy link

This is an example showing how to map an Avro file to a pyspark RDD.

Starting Pyspark with
SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 bin/pyspark

avroRdd = sc.newAPIHadoopFile("/tmp/data.avro", 
  "org.apache.avro.mapreduce.AvroKeyInputFormat", 
  "org.apache.avro.mapred.AvroKey", 
  "org.apache.hadoop.io.NullWritable",
  keyConverter="org.apache.spark.examples.pythonconverters.AvroGenericConverter")

Note that not all data types are implemented.

Note also that null values from UNION(NULL, ...) fields will show up as the empty String "" in python.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@MLnick
Copy link
Contributor

MLnick commented Jul 22, 2014

@ericgarcia thanks for submitting this, looks good and will be helpful for others wanting to use Avro from PySpark.

Would you mind adding the python example code to the PR itself (perhaps just in the comments to the AvroGenericConverter will be fine)

case v:java.lang.Float => value.asInstanceOf[java.lang.Float]
case v:java.lang.Integer => value.asInstanceOf[java.lang.Integer]
case v:java.lang.Long => value.asInstanceOf[java.lang.Long]
case v:java.lang.String => value.asInstanceOf[java.lang.String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding case null => null for null handling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kanzhang I couldn't get Java/Scala null passed into Python. It crashed when I tried to do this so I encoded it as an empty string instead. If someone else can figure this out, that would be cool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you could try to avoid NullPointerException is to wrap any intermediate value that could be null into Option(value) and only at the end of transformation call .orNull on it to get the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that #1551 will address this issue with passing nulls from Java/Scala to Python. I'll update this thread once I've merged that PR.

@ericgarcia
Copy link
Author

Is there a way to hand this off to anyone interested in finishing it and cleaning it up? My goal was merely to get it working for my application (accomplished!) and share the (unfinished) solution as a starting point.

@JoshRosen
Copy link
Contributor

@ericgarcia If you open a new issue on the Spark JIRA and link to this pull request, I'm sure somebody will be able to pick this up and finish it. Thanks!

@ericgarcia
Copy link
Author

@JoshRosen
Copy link
Contributor

Great! If you don't plan to work on this anytime soon, could you close this PR? Thanks!

@ericgarcia ericgarcia closed this Jul 30, 2014
@kanzhang
Copy link
Contributor

@ericgarcia @JoshRosen do you mind if I give it a try?

@JoshRosen
Copy link
Contributor

@kanzhang Go ahead; I've assigned the JIRA to you.

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants