[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938

zero323 · 2017-05-10T11:02:50Z

What changes were proposed in this pull request?

Add Scala, Python and Java examples for partitionBy, sortBy and bucketBy.
Add Bucketing, Sorting and Partitioning section to SQL Programming Guide
Remove bucketing from Unsupported Hive Functionalities.

How was this patch tested?

Manual tests, docs build.

SparkQA · 2017-05-10T11:21:48Z

Test build #76748 has finished for PR 17938 at commit 20c7ca6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-10T11:36:21Z

@zero323, what do you think about opening a JIRA or turning this as a followup for your previous PR? I know it is a doc fix but it sounds pretty important and non-trivial fix.

zero323 · 2017-05-10T11:47:30Z

@HyukjinKwon Sounds good. SPARK-20694.

Should we document the difference between buckets (metastore based) and partitions (file system based)? The latter one could by done by referencing Partition Discover.

HyukjinKwon · 2017-05-10T11:50:33Z

(I think I am not supposed to decide this and probably the best is the confirmation from a commiter)

gatorsmile · 2017-05-11T00:08:13Z

docs/sql-programming-guide.md

-**Major Hive Features**
-
-* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
-  doesn't support buckets yet.


We do support buckets, but it is slightly different from Hive. See the ongoing PR: #17644

Could you document the difference too? Thanks!

Lets keep this until SPARK-19256 gets resolved

gatorsmile · 2017-05-11T00:08:47Z

@cloud-fan @tejasapatil Could you please help review this PR?

tejasapatil · 2017-05-11T01:22:14Z

docs/sql-programming-guide.md

@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

 Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and and sort or partition the output. 


nit: bucket and and sort : double and

tejasapatil · 2017-05-11T01:43:13Z

docs/sql-programming-guide.md

@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

 Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

+### Bucketing, Sorting and Partitioning


I feel that examples are missing writing to partitioned + bucketed table. eg.

my_dataframe.write.format("orc").partitionBy("i").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("my_table")

There could be multiple possible orderings of partitionBy, bucketBy and sortBy calls. Not all of them are supported, not all of them would produce correct outputs. I have not done any exhaustive study of the same but I think this should be documented to guide people while using these APIs

shall we emphasize partitioning? I think it's more widely used than bucketing.

@tejasapatil

There could be multiple possible orderings of partitionBy, bucketBy and sortBy calls. Not all of them are supported, not all of them would produce correct outputs.

Shouldn't the output be the same no matter the order? sortBy is not applicable for partitionBy and takes precedence over bucketBy, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here?

@cloud-fan I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea.

Shouldn't the output be the same no matter the order?

Theoretically yes. Practically I don't know what happens. Since you are documenting, it will be worthwhile to check that and record if it works as expected (or if there is any weirdness).

Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (partitionBy + sortBy without bucketBy or overlapping bucketBy and partitionBy columns) give enough feedback to diagnose the issue.

I haven't tested this with large datasets though, so there can be hidden issues.

cloud-fan · 2017-05-11T07:44:38Z

docs/sql-programming-guide.md

+
+</div>
+
+while partitioning can be used with both `save` and `saveAsTable`:


like @tejasapatil suggested, we should give one more example about partitioned and bucketed table, so that users know they can use bucketing and partitioning at the same time

SparkQA · 2017-05-11T14:50:35Z

Test build #76813 has finished for PR 17938 at commit a14296a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-11T19:24:13Z

Test build #76825 has finished for PR 17938 at commit 7bf4bbc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-11T20:15:23Z

Test build #76830 has finished for PR 17938 at commit 606f1e3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-11T20:17:11Z

Test build #76828 has finished for PR 17938 at commit cc1bfcf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-13T11:09:50Z

When you omit USING, it's hive style CREATE TABLE syntax, which is very different from Spark. We should encourage users to use the spark style CREATE TABLE syntax and only document it(with USING statement).

SparkQA · 2017-05-13T11:20:32Z

Test build #76899 has finished for PR 17938 at commit b5babf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-05-13T11:31:49Z

@cloud-fan Thanks for the clarification. Just a thought - shouldn't we either support it consistently or don't support at all? Current behaviour is quite confusing and I don't think that documentation alone will cut it.

SparkQA · 2017-05-13T11:40:24Z

Test build #76900 has finished for PR 17938 at commit 92fb3b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-13T12:42:42Z

we are going to support bucketing in hive style CREATE TABLE syntax soon.

gatorsmile · 2017-05-14T03:21:33Z

In the current 2.2 docs, we already updated all the syntax to CREATE TABLE ... USING.... This is the new change delivered in 2.2

Thus, it is OK to document like what you just committed. Let me review them carefully now. Thanks for your work!

gatorsmile · 2017-05-14T03:35:14Z

docs/sql-programming-guide.md

+  favorite_color STRING,
+  favorite_NUMBERS array<integer>
+) USING parquet 
+CLUSTERED BY(name) INTO 42 BUCKETS;


To be consistent with the example in the other APIs, it is missing the SORTED BY clause.

Could you please use the same table names people_bucketed with the same column names in the example? Thanks!

@zero323 Could you also resolve this? Thanks!

gatorsmile · 2017-05-14T03:38:11Z

docs/sql-programming-guide.md

@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

 Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or partition the output. 


Nit, For file-based data source it -> For file-based data source, it

gatorsmile · 2017-05-14T03:38:33Z

docs/sql-programming-guide.md

+### Bucketing, Sorting and Partitioning
+
+For file-based data source it is also possible to bucket and sort or partition the output. 
+Bucketing and sorting is applicable only to persistent tables:


is applicable -> are applicable

gatorsmile · 2017-05-14T03:40:52Z

docs/sql-programming-guide.md

+
+</div>
+
+It is possible to use both partitions and buckets for a single table:


partitions and buckets -> partitioning and bucketing

gatorsmile · 2017-05-14T03:42:26Z

docs/sql-programming-guide.md

+
+</div>
+
+while partitioning can be used with both `save` and `saveAsTable`:


Nit:

both `save` and `saveAsTable`

->

both `save` and `saveAsTable` when using the Dataset APIs.

gatorsmile · 2017-05-14T03:46:57Z

docs/sql-programming-guide.md

+</div>
+
+`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
+Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes


Because of that it has -> Thus, it has

In contrast bucketBy distributes -> In contrast, bucketBy distributes

gatorsmile · 2017-05-14T03:48:56Z

docs/sql-programming-guide.md

+
+`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
+Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes
+data across fixed number of buckets and can be used if a number of unique values is unbounded.


used if -> used when

gatorsmile · 2017-05-14T03:50:16Z

LGTM except a few minor comments.

cc @tejasapatil @cloud-fan

SparkQA · 2017-05-14T12:15:33Z

Test build #76905 has finished for PR 17938 at commit 65ac310.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-14T18:35:28Z

Test build #76911 has finished for PR 17938 at commit 3a8b6e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-26T06:46:32Z

docs/sql-programming-guide.md

+`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
+Thus, it has limited applicability to columns with high cardinality. In contrast 
+ `bucketBy` distributes
+data across fixed number of buckets and can be used when a number of unique values is unbounded.


Nit: fixed number of -> a fixed number of

gatorsmile · 2017-05-26T06:47:27Z

Will merge it when my minor comment is resolved.

Thanks for working on it!

SparkQA · 2017-05-26T21:15:35Z

Test build #77436 has finished for PR 17938 at commit bea0676.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…By and sortBy in SQL guide ## What changes were proposed in this pull request? - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`. - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide - Remove bucketing from Unsupported Hive Functionalities. ## How was this patch tested? Manual tests, docs build. Author: zero323 <[email protected]> Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING. (cherry picked from commit ae33abf) Signed-off-by: Xiao Li <[email protected]>

zero323 · 2017-05-26T22:05:03Z

Thanks @gatorsmile

zero323 mentioned this pull request May 10, 2017

[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy #17077

Closed

zero323 changed the title ~~[DOCS][SQL] Document bucketing and partitioning in SQL guide~~ [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide May 10, 2017

gatorsmile reviewed May 11, 2017

View reviewed changes

tejasapatil reviewed May 11, 2017

View reviewed changes

cloud-fan reviewed May 11, 2017

View reviewed changes

zero323 force-pushed the DOCS-BUCKETING-AND-PARTITIONING branch from 20c7ca6 to a14296a Compare May 11, 2017 14:32

zero323 added 10 commits May 11, 2017 21:05

Add Scala examples

4a74328

Add Python examples

573b0b9

Add Java examples

90ad3f3

Add examples to sql guide

563a7e8

Remove duplicated and

f9621d9

Add Python example for artitionBy + bucketBy

01cbfad

Add Java example for artitionBy + bucketBy

72806f1

Add Scala example for artitionBy + bucketBy

0294e47

Add partitionBy + bucketBy to SQL Guide

f76b113

Add cardinality note

7bf4bbc

zero323 force-pushed the DOCS-BUCKETING-AND-PARTITIONING branch from a14296a to 7bf4bbc Compare May 11, 2017 19:21

zero323 added 2 commits May 11, 2017 21:57

Fix scala style

cc1bfcf

Missing drop

606f1e3

Python style

a7aff81

Add SQL examples for CLUSTERED BY

b5babf6

Update PARTITION BY example to Spark syntax

92fb3b3

gatorsmile reviewed May 14, 2017

View reviewed changes

Include changes requested by gatorsmile

65ac310

zero323 added 2 commits May 14, 2017 20:18

Add SORTED BY

f7b6f43

Constitent case for favorite_numbers

3a8b6e9

gatorsmile reviewed May 26, 2017

View reviewed changes

Missing article

bea0676

asfgit closed this in ae33abf May 26, 2017

zero323 deleted the DOCS-BUCKETING-AND-PARTITIONING branch February 2, 2020 17:45

		@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

		Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

		### Bucketing, Sorting and Partitioning


		</div>

		while partitioning can be used with both `save` and `saveAsTable`:


		</div>

		It is possible to use both partitions and buckets for a single table:

[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938

[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938

Conversation

zero323 commented May 10, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 10, 2017

HyukjinKwon commented May 10, 2017

zero323 commented May 10, 2017

HyukjinKwon commented May 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2017

SparkQA commented May 11, 2017

SparkQA commented May 11, 2017

SparkQA commented May 11, 2017

cloud-fan commented May 13, 2017

SparkQA commented May 13, 2017

zero323 commented May 13, 2017

SparkQA commented May 13, 2017

cloud-fan commented May 13, 2017

gatorsmile commented May 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile May 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 14, 2017

SparkQA commented May 14, 2017

SparkQA commented May 14, 2017

Choose a reason for hiding this comment

gatorsmile commented May 26, 2017

SparkQA commented May 26, 2017

zero323 commented May 26, 2017

zero323 commented May 10, 2017 •

edited

Loading

gatorsmile May 14, 2017 •

edited

Loading