Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2736] PySpark converter and example script for reading Avro files #1916

Closed
wants to merge 7 commits into from

Conversation

kanzhang
Copy link
Contributor

JIRA: https://issues.apache.org/jira/browse/SPARK-2736

This patch includes:

  1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
  2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
  3. Fixing a classloading issue.

cc @MLnick @JoshRosen @mateiz

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1916. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18407/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1916:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AvroWrapperToJavaConverter extends Converter[Any, Any] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18407/consoleFull

@kanzhang
Copy link
Contributor Author

@MLnick I put together a draft for our blog at https://github.com/kanzhang/pyspark-converter-examples. Maybe using it as a starting point?

@JoshRosen @mateiz Let us know if you have any comments or suggestions on the above draft.

@kanzhang
Copy link
Contributor Author

SparkQA complains examples/src/main/resources/user.avsc doesn't have an Apache license. I saw other files in the same dir don't have it either. How can I exclude user.avsc from license checking?

@kanzhang
Copy link
Contributor Author

@ericgarcia would be great if you could check if this patch works for you?

@mateiz
Copy link
Contributor

mateiz commented Aug 13, 2014

Thanks for writing this up! Do you guys want to include this in 1.1? It's somewhat late in that the merge window for that has closed, but if you think it will be useful, we can probably still add it. The one part I'd like explained then is why Utils.classForName had to be used -- is it because some user classes are added at runtime and come through the Spark class loader?

@mateiz
Copy link
Contributor

mateiz commented Aug 13, 2014

BTW the blog post itself looks good but perhaps more technical than you might want for a general audience. I'd suggest starting it off with some simple examples that use SequenceFiles, then adding the Avro stuff later. It's also kind of unfortunate that understanding the Avro example requires understanding Avro and its various intricacies (i.e. the generic/specific/reflect stuff, the various data types in your match statement, etc). If you wanted a simpler example you could use HBase or Cassandra there, but Avro is fine too.

@MLnick
Copy link
Contributor

MLnick commented Aug 13, 2014

@kanzhang thanks for putting this together, and for the 1st draft of the blog post.

@mateiz I'd say this is not necessary for 1.1 as it's intended to be a more advanced example rather than core code - however if the Utils.classForName is a bug or issue we should perhaps fix that in a separate PR before 1.1?

I will go through the blog post and make some suggestions - from a quick read through it looks good though I agree with Matei that it could perhaps be made slightly more general.

@mateiz
Copy link
Contributor

mateiz commented Aug 13, 2014

Yeah, if Utils.classForName is fixing a bug then we should definitely add it into 1.1.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1916. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18464/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1916:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AvroWrapperToJavaConverter extends Converter[Any, Any] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18464/consoleFull

@kanzhang
Copy link
Contributor Author

@mateiz the class loading issue I encountered was when I was testing with PySpark shell. I used bin/pyspark --jars examples-jar to add the examples jar, but it still couldn't find AvroWrapperToJavaConverter. When I changed it to use Utils.classForName, it finds it. Note that this issue doesn't appear if I use bin/spark-submit --driver-class-path examples-jar ./examples/src/main/python/avro_inputformat.py command.

I personally would like to see this patch in 1.1, but I'm ok if it's not.

Thanks for your feedback on the draft. @MLnick feel free to checkout the draft and modify it any way you like.

Btw, any suggestion on what I can do to get SparkQA going?

@kanzhang
Copy link
Contributor Author

FYI, This class loading issue has already been reported by users, see http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html

We should try to squeeze this fix into 1.1 if possible.

@mateiz
Copy link
Contributor

mateiz commented Aug 14, 2014

Sure, let's actually put all of this in 1.1. I didn't realize that the rest was in examples -- that's much better than putting it as an official API. Of course later we might make it an official API.

@mateiz
Copy link
Contributor

mateiz commented Aug 14, 2014

BTW Jenkins had some issues recently, let me retry it

@mateiz
Copy link
Contributor

mateiz commented Aug 14, 2014

Jenkins, add to whitelist and test this please

@JoshRosen
Copy link
Contributor

FWIW, I don't think our test suites actually exercise our examples. We should probably manually test this, along with the other examples, during our release QA process.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1916. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18569/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1916:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AvroWrapperToJavaConverter extends Converter[Any, Any] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18569/consoleFull

@JoshRosen
Copy link
Contributor

This test failed because RAT couldn't find license headers in your example avro file:

Could not find Apache license headers in the following files:
 !????? /home/jenkins/workspace/SparkPullRequestBuilder@2/examples/src/main/resources/user.avsc

I'd add *.avsc to .rat-excludes so that RAT ignores avro files.

@kanzhang
Copy link
Contributor Author

I'd add *.avsc to .rat-excludes so that RAT ignores avro files.

Ah, I see. I was trying to locate this exclude file but didn't find it. Thx.

We should probably manually test this, along with the other examples, during our release QA process.

+1. Unfortunately, we don't have a framework for adding unit tests for examples right now. When that happens, I'm happy to add couple tests.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1916. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18578/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA results for PR 1916:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class AvroWrapperToJavaConverter extends Converter[Any, Any] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18578/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Aug 15, 2014

Alright, going to merge this. Thanks for putting this together!

@asfgit asfgit closed this in 9422a9b Aug 15, 2014
asfgit pushed a commit that referenced this pull request Aug 15, 2014
JIRA: https://issues.apache.org/jira/browse/SPARK-2736

This patch includes:
1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
3. Fixing a classloading issue.

cc @MLnick @JoshRosen @mateiz

Author: Kan Zhang <[email protected]>

Closes #1916 from kanzhang/SPARK-2736 and squashes the following commits:

02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
82cc505 [Kan Zhang] [SPARK-2736] Update data sample
0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter

(cherry picked from commit 9422a9b)
Signed-off-by: Matei Zaharia <[email protected]>
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
JIRA: https://issues.apache.org/jira/browse/SPARK-2736

This patch includes:
1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
3. Fixing a classloading issue.

cc @MLnick @JoshRosen @mateiz

Author: Kan Zhang <[email protected]>

Closes apache#1916 from kanzhang/SPARK-2736 and squashes the following commits:

02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
82cc505 [Kan Zhang] [SPARK-2736] Update data sample
0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
@kanzhang kanzhang deleted the SPARK-2736 branch December 12, 2014 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants