Example pyspark-inputformat for Avro file format #1536

ericgarcia · 2014-07-22T19:54:57Z

This is an example showing how to map an Avro file to a pyspark RDD.

Starting Pyspark with
SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 bin/pyspark

avroRdd = sc.newAPIHadoopFile("/tmp/data.avro", 
  "org.apache.avro.mapreduce.AvroKeyInputFormat", 
  "org.apache.avro.mapred.AvroKey", 
  "org.apache.hadoop.io.NullWritable",
  keyConverter="org.apache.spark.examples.pythonconverters.AvroGenericConverter")

Note that not all data types are implemented.

Note also that null values from UNION(NULL, ...) fields will show up as the empty String "" in python.

AmplabJenkins · 2014-07-22T19:56:59Z

Can one of the admins verify this patch?

MLnick · 2014-07-22T20:56:26Z

@ericgarcia thanks for submitting this, looks good and will be helpful for others wanting to use Avro from PySpark.

Would you mind adding the python example code to the PR itself (perhaps just in the comments to the AvroGenericConverter will be fine)

kanzhang · 2014-07-22T23:42:13Z

examples/src/main/scala/org/apache/spark/examples/pythonconverters/AvroGenericConverter.scala

+      case v:java.lang.Float => value.asInstanceOf[java.lang.Float]  
+      case v:java.lang.Integer => value.asInstanceOf[java.lang.Integer] 
+      case v:java.lang.Long => value.asInstanceOf[java.lang.Long] 
+      case v:java.lang.String => value.asInstanceOf[java.lang.String] 


How about adding case null => null for null handling?

@kanzhang I couldn't get Java/Scala null passed into Python. It crashed when I tried to do this so I encoded it as an empty string instead. If someone else can figure this out, that would be cool.

One thing you could try to avoid NullPointerException is to wrap any intermediate value that could be null into Option(value) and only at the end of transformation call .orNull on it to get the value.

I think that #1551 will address this issue with passing nulls from Java/Scala to Python. I'll update this thread once I've merged that PR.

ericgarcia · 2014-07-29T16:38:08Z

Is there a way to hand this off to anyone interested in finishing it and cleaning it up? My goal was merely to get it working for my application (accomplished!) and share the (unfinished) solution as a starting point.

JoshRosen · 2014-07-29T16:44:39Z

@ericgarcia If you open a new issue on the Spark JIRA and link to this pull request, I'm sure somebody will be able to pick this up and finish it. Thanks!

ericgarcia · 2014-07-29T20:46:23Z

Okay https://issues.apache.org/jira/browse/SPARK-2736

JoshRosen · 2014-07-30T04:58:24Z

Great! If you don't plan to work on this anytime soon, could you close this PR? Thanks!

kanzhang · 2014-07-30T17:53:03Z

@ericgarcia @JoshRosen do you mind if I give it a try?

JoshRosen · 2014-07-30T17:58:28Z

@kanzhang Go ahead; I've assigned the JIRA to you.

Example pyspark-inputformat for Avro files

3b4804b

ericgarcia mentioned this pull request Jul 22, 2014

SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats #455

Closed

example usage added in comments

b8cc12b

kanzhang reviewed Jul 22, 2014
View reviewed changes

ericgarcia closed this Jul 30, 2014

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

Update pom.xml (apache#1536)

1a944e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example pyspark-inputformat for Avro file format #1536

Example pyspark-inputformat for Avro file format #1536

ericgarcia commented Jul 22, 2014

AmplabJenkins commented Jul 22, 2014

MLnick commented Jul 22, 2014

kanzhang Jul 22, 2014

ericgarcia Jul 23, 2014

kanzhang Jul 23, 2014

JoshRosen Jul 29, 2014

ericgarcia commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

ericgarcia commented Jul 29, 2014

JoshRosen commented Jul 30, 2014

kanzhang commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

Example pyspark-inputformat for Avro file format #1536

Example pyspark-inputformat for Avro file format #1536

Conversation

ericgarcia commented Jul 22, 2014

AmplabJenkins commented Jul 22, 2014

MLnick commented Jul 22, 2014

kanzhang Jul 22, 2014

Choose a reason for hiding this comment

ericgarcia Jul 23, 2014

Choose a reason for hiding this comment

kanzhang Jul 23, 2014

Choose a reason for hiding this comment

JoshRosen Jul 29, 2014

Choose a reason for hiding this comment

ericgarcia commented Jul 29, 2014

JoshRosen commented Jul 29, 2014

ericgarcia commented Jul 29, 2014

JoshRosen commented Jul 30, 2014

kanzhang commented Jul 30, 2014

JoshRosen commented Jul 30, 2014