Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938

Closed
wants to merge 20 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented May 10, 2017

What changes were proposed in this pull request?

  • Add Scala, Python and Java examples for partitionBy, sortBy and bucketBy.
  • Add Bucketing, Sorting and Partitioning section to SQL Programming Guide
  • Remove bucketing from Unsupported Hive Functionalities.

How was this patch tested?

Manual tests, docs build.

@SparkQA
Copy link

SparkQA commented May 10, 2017

Test build #76748 has finished for PR 17938 at commit 20c7ca6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@zero323, what do you think about opening a JIRA or turning this as a followup for your previous PR? I know it is a doc fix but it sounds pretty important and non-trivial fix.

@zero323 zero323 changed the title [DOCS][SQL] Document bucketing and partitioning in SQL guide [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide May 10, 2017
@zero323
Copy link
Member Author

zero323 commented May 10, 2017

@HyukjinKwon Sounds good. SPARK-20694.

Should we document the difference between buckets (metastore based) and partitions (file system based)? The latter one could by done by referencing Partition Discover.

@HyukjinKwon
Copy link
Member

(I think I am not supposed to decide this and probably the best is the confirmation from a commiter)

**Major Hive Features**

* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
doesn't support buckets yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support buckets, but it is slightly different from Hive. See the ongoing PR: #17644

Could you document the difference too? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep this until SPARK-19256 gets resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gatorsmile
Copy link
Member

@cloud-fan @tejasapatil Could you please help review this PR?

@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

### Bucketing, Sorting and Partitioning

For file-based data source it is also possible to bucket and and sort or partition the output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bucket and and sort : double and

@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

### Bucketing, Sorting and Partitioning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that examples are missing writing to partitioned + bucketed table. eg.

my_dataframe.write.format("orc").partitionBy("i").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("my_table")

There could be multiple possible orderings of partitionBy, bucketBy and sortBy calls. Not all of them are supported, not all of them would produce correct outputs. I have not done any exhaustive study of the same but I think this should be documented to guide people while using these APIs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we emphasize partitioning? I think it's more widely used than bucketing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasapatil

There could be multiple possible orderings of partitionBy, bucketBy and sortBy calls. Not all of them are supported, not all of them would produce correct outputs.

Shouldn't the output be the same no matter the order? sortBy is not applicable for partitionBy and takes precedence over bucketBy, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the output be the same no matter the order?

Theoretically yes. Practically I don't know what happens. Since you are documenting, it will be worthwhile to check that and record if it works as expected (or if there is any weirdness).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (partitionBy + sortBy without bucketBy or overlapping bucketBy and partitionBy columns) give enough feedback to diagnose the issue.

I haven't tested this with large datasets though, so there can be hidden issues.


</div>

while partitioning can be used with both `save` and `saveAsTable`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like @tejasapatil suggested, we should give one more example about partitioned and bucketed table, so that users know they can use bucketing and partitioning at the same time

@zero323 zero323 force-pushed the DOCS-BUCKETING-AND-PARTITIONING branch from 20c7ca6 to a14296a Compare May 11, 2017 14:32
@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76813 has finished for PR 17938 at commit a14296a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323 zero323 force-pushed the DOCS-BUCKETING-AND-PARTITIONING branch from a14296a to 7bf4bbc Compare May 11, 2017 19:21
@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76825 has finished for PR 17938 at commit 7bf4bbc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76830 has finished for PR 17938 at commit 606f1e3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76828 has finished for PR 17938 at commit cc1bfcf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

When you omit USING, it's hive style CREATE TABLE syntax, which is very different from Spark. We should encourage users to use the spark style CREATE TABLE syntax and only document it(with USING statement).

@SparkQA
Copy link

SparkQA commented May 13, 2017

Test build #76899 has finished for PR 17938 at commit b5babf6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented May 13, 2017

@cloud-fan Thanks for the clarification. Just a thought - shouldn't we either support it consistently or don't support at all? Current behaviour is quite confusing and I don't think that documentation alone will cut it.

@SparkQA
Copy link

SparkQA commented May 13, 2017

Test build #76900 has finished for PR 17938 at commit 92fb3b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

we are going to support bucketing in hive style CREATE TABLE syntax soon.

@gatorsmile
Copy link
Member

In the current 2.2 docs, we already updated all the syntax to CREATE TABLE ... USING.... This is the new change delivered in 2.2

Thus, it is OK to document like what you just committed. Let me review them carefully now. Thanks for your work!

favorite_color STRING,
favorite_NUMBERS array<integer>
) USING parquet
CLUSTERED BY(name) INTO 42 BUCKETS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the example in the other APIs, it is missing the SORTED BY clause.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the same table names people_bucketed with the same column names in the example? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 Could you also resolve this? Thanks!

@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

### Bucketing, Sorting and Partitioning

For file-based data source it is also possible to bucket and sort or partition the output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, For file-based data source it -> For file-based data source, it

### Bucketing, Sorting and Partitioning

For file-based data source it is also possible to bucket and sort or partition the output.
Bucketing and sorting is applicable only to persistent tables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is applicable -> are applicable


</div>

It is possible to use both partitions and buckets for a single table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitions and buckets -> partitioning and bucketing


</div>

while partitioning can be used with both `save` and `saveAsTable`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

both `save` and `saveAsTable`

->

both `save` and `saveAsTable` when using the Dataset APIs. 

</div>

`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes
Copy link
Member

@gatorsmile gatorsmile May 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of that it has -> Thus, it has

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast bucketBy distributes -> In contrast, bucketBy distributes


`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes
data across fixed number of buckets and can be used if a number of unique values is unbounded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used if -> used when

@gatorsmile
Copy link
Member

LGTM except a few minor comments.

cc @tejasapatil @cloud-fan

@SparkQA
Copy link

SparkQA commented May 14, 2017

Test build #76905 has finished for PR 17938 at commit 65ac310.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2017

Test build #76911 has finished for PR 17938 at commit 3a8b6e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
Thus, it has limited applicability to columns with high cardinality. In contrast
`bucketBy` distributes
data across fixed number of buckets and can be used when a number of unique values is unbounded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: fixed number of -> a fixed number of

@gatorsmile
Copy link
Member

Will merge it when my minor comment is resolved.

Thanks for working on it!

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77436 has finished for PR 17938 at commit bea0676.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 26, 2017
…By and sortBy in SQL guide

## What changes were proposed in this pull request?

- Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
- Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
- Remove bucketing from Unsupported Hive Functionalities.

## How was this patch tested?

Manual tests, docs build.

Author: zero323 <[email protected]>

Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.

(cherry picked from commit ae33abf)
Signed-off-by: Xiao Li <[email protected]>
@asfgit asfgit closed this in ae33abf May 26, 2017
@zero323
Copy link
Member Author

zero323 commented May 26, 2017

Thanks @gatorsmile

@zero323 zero323 deleted the DOCS-BUCKETING-AND-PARTITIONING branch February 2, 2020 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants