-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2736] PySpark converter and example script for reading Avro files #1916
Conversation
QA tests have started for PR 1916. This patch merges cleanly. |
QA results for PR 1916: |
@MLnick I put together a draft for our blog at https://github.com/kanzhang/pyspark-converter-examples. Maybe using it as a starting point? @JoshRosen @mateiz Let us know if you have any comments or suggestions on the above draft. |
SparkQA complains examples/src/main/resources/user.avsc doesn't have an Apache license. I saw other files in the same dir don't have it either. How can I exclude user.avsc from license checking? |
@ericgarcia would be great if you could check if this patch works for you? |
Thanks for writing this up! Do you guys want to include this in 1.1? It's somewhat late in that the merge window for that has closed, but if you think it will be useful, we can probably still add it. The one part I'd like explained then is why Utils.classForName had to be used -- is it because some user classes are added at runtime and come through the Spark class loader? |
BTW the blog post itself looks good but perhaps more technical than you might want for a general audience. I'd suggest starting it off with some simple examples that use SequenceFiles, then adding the Avro stuff later. It's also kind of unfortunate that understanding the Avro example requires understanding Avro and its various intricacies (i.e. the generic/specific/reflect stuff, the various data types in your match statement, etc). If you wanted a simpler example you could use HBase or Cassandra there, but Avro is fine too. |
@kanzhang thanks for putting this together, and for the 1st draft of the blog post. @mateiz I'd say this is not necessary for 1.1 as it's intended to be a more advanced example rather than core code - however if the I will go through the blog post and make some suggestions - from a quick read through it looks good though I agree with Matei that it could perhaps be made slightly more general. |
Yeah, if Utils.classForName is fixing a bug then we should definitely add it into 1.1. |
QA tests have started for PR 1916. This patch merges cleanly. |
QA results for PR 1916: |
@mateiz the class loading issue I encountered was when I was testing with PySpark shell. I used I personally would like to see this patch in 1.1, but I'm ok if it's not. Thanks for your feedback on the draft. @MLnick feel free to checkout the draft and modify it any way you like. Btw, any suggestion on what I can do to get SparkQA going? |
FYI, This class loading issue has already been reported by users, see http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html We should try to squeeze this fix into 1.1 if possible. |
Sure, let's actually put all of this in 1.1. I didn't realize that the rest was in examples -- that's much better than putting it as an official API. Of course later we might make it an official API. |
BTW Jenkins had some issues recently, let me retry it |
Jenkins, add to whitelist and test this please |
FWIW, I don't think our test suites actually exercise our examples. We should probably manually test this, along with the other examples, during our release QA process. |
QA tests have started for PR 1916. This patch merges cleanly. |
QA results for PR 1916: |
This test failed because RAT couldn't find license headers in your example avro file:
I'd add |
Ah, I see. I was trying to locate this exclude file but didn't find it. Thx.
+1. Unfortunately, we don't have a framework for adding unit tests for examples right now. When that happens, I'm happy to add couple tests. |
Jenkins, retest this please. |
QA tests have started for PR 1916. This patch merges cleanly. |
QA results for PR 1916: |
Alright, going to merge this. Thanks for putting this together! |
JIRA: https://issues.apache.org/jira/browse/SPARK-2736 This patch includes: 1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect). 2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter. 3. Fixing a classloading issue. cc @MLnick @JoshRosen @mateiz Author: Kan Zhang <[email protected]> Closes #1916 from kanzhang/SPARK-2736 and squashes the following commits: 02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className 82cc505 [Kan Zhang] [SPARK-2736] Update data sample 0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models 2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes 536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter (cherry picked from commit 9422a9b) Signed-off-by: Matei Zaharia <[email protected]>
JIRA: https://issues.apache.org/jira/browse/SPARK-2736 This patch includes: 1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect). 2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter. 3. Fixing a classloading issue. cc @MLnick @JoshRosen @mateiz Author: Kan Zhang <[email protected]> Closes apache#1916 from kanzhang/SPARK-2736 and squashes the following commits: 02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className 82cc505 [Kan Zhang] [SPARK-2736] Update data sample 0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models 2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes 536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
JIRA: https://issues.apache.org/jira/browse/SPARK-2736
This patch includes:
cc @MLnick @JoshRosen @mateiz