[SPARK-2953] Allow using short names for io compression codecs #1873

rxin · 2014-08-10T06:29:06Z

Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".

SparkQA · 2014-08-10T06:34:47Z

QA tests have started for PR 1873. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18273/consoleFull

SparkQA · 2014-08-10T07:23:54Z

QA results for PR 1873:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18273/consoleFull

mateiz · 2014-08-10T23:03:18Z

docs/configuration.md

  <td>
    The codec used to compress internal data such as RDD partitions and shuffle outputs.
    By default, Spark provides three codecs:  <code>org.apache.spark.io.LZ4CompressionCodec</code>,
    <code>org.apache.spark.io.LZFCompressionCodec</code>,
-    and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
+    and <code>org.apache.spark.io.SnappyCompressionCodec</code>. You can also use the short form: <code>lz4</code>, <code>lzf</code>, and <code>snappy</code>.


You should probably just list the short names first, and then say "you can alternatively list a class name".

Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".

rxin · 2014-08-13T04:48:46Z

I updated the documentation to put the short forms first before fully qualified class names.

SparkQA · 2014-08-13T04:54:59Z

QA tests have started for PR 1873. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18411/consoleFull

SparkQA · 2014-08-13T05:47:11Z

QA results for PR 1873:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18411/consoleFull

mateiz · 2014-08-13T05:50:05Z

Looks good to me

rxin · 2014-08-13T05:50:15Z

Thanks. Merging in master & branch-1.1.

Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy". Author: Reynold Xin <[email protected]> Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits: 9f50962 [Reynold Xin] Specify short-form compression codec names first. 63f78ee [Reynold Xin] Updated configuration documentation. 47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs (cherry picked from commit 676f982) Signed-off-by: Reynold Xin <[email protected]>

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: #195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: #195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext (cherry picked from commit 8856c3d) Signed-off-by: Michael Armbrust <[email protected]>

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: apache#195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes apache#2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy". Author: Reynold Xin <[email protected]> Closes apache#1873 from rxin/compressionCodecShortForm and squashes the following commits: 9f50962 [Reynold Xin] Specify short-form compression codec names first. 63f78ee [Reynold Xin] Updated configuration documentation. 47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs

…ting ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: apache#195 (diff) i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 apache#1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, apache#1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632). Author: chutium <[email protected]> Closes apache#2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

mateiz reviewed Aug 10, 2014
View reviewed changes

rxin added 3 commits August 12, 2014 21:46

[SPARK-2953] Allow using short names for io compression codecs

47b3848

Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".

Updated configuration documentation.

63f78ee

Specify short-form compression codec names first.

9f50962

asfgit closed this in 676f982 Aug 13, 2014

rxin deleted the compressionCodecShortForm branch August 13, 2014 05:54

chutium mentioned this pull request Aug 21, 2014

[SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext #2039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2953] Allow using short names for io compression codecs #1873

[SPARK-2953] Allow using short names for io compression codecs #1873

rxin commented Aug 10, 2014

SparkQA commented Aug 10, 2014

SparkQA commented Aug 10, 2014

mateiz Aug 10, 2014

rxin commented Aug 13, 2014

SparkQA commented Aug 13, 2014

SparkQA commented Aug 13, 2014

mateiz commented Aug 13, 2014

rxin commented Aug 13, 2014

[SPARK-2953] Allow using short names for io compression codecs #1873

[SPARK-2953] Allow using short names for io compression codecs #1873

Conversation

rxin commented Aug 10, 2014

SparkQA commented Aug 10, 2014

SparkQA commented Aug 10, 2014

mateiz Aug 10, 2014

Choose a reason for hiding this comment

rxin commented Aug 13, 2014

SparkQA commented Aug 13, 2014

SparkQA commented Aug 13, 2014

mateiz commented Aug 13, 2014

rxin commented Aug 13, 2014