-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938
Conversation
Test build #76748 has finished for PR 17938 at commit
|
@zero323, what do you think about opening a JIRA or turning this as a followup for your previous PR? I know it is a doc fix but it sounds pretty important and non-trivial fix. |
@HyukjinKwon Sounds good. SPARK-20694. Should we document the difference between buckets (metastore based) and partitions (file system based)? The latter one could by done by referencing Partition Discover. |
(I think I am not supposed to decide this and probably the best is the confirmation from a commiter) |
docs/sql-programming-guide.md
Outdated
**Major Hive Features** | ||
|
||
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL | ||
doesn't support buckets yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do support buckets, but it is slightly different from Hive. See the ongoing PR: #17644
Could you document the difference too? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep this until SPARK-19256 gets resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@cloud-fan @tejasapatil Could you please help review this PR? |
docs/sql-programming-guide.md
Outdated
@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat | |||
|
|||
Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. | |||
|
|||
### Bucketing, Sorting and Partitioning | |||
|
|||
For file-based data source it is also possible to bucket and and sort or partition the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bucket and and sort
: double and
@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat | |||
|
|||
Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. | |||
|
|||
### Bucketing, Sorting and Partitioning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that examples are missing writing to partitioned + bucketed table. eg.
my_dataframe.write.format("orc").partitionBy("i").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("my_table")
There could be multiple possible orderings of partitionBy
, bucketBy
and sortBy
calls. Not all of them are supported, not all of them would produce correct outputs. I have not done any exhaustive study of the same but I think this should be documented to guide people while using these APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we emphasize partitioning? I think it's more widely used than bucketing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be multiple possible orderings of
partitionBy,
bucketBy
andsortBy
calls. Not all of them are supported, not all of them would produce correct outputs.
Shouldn't the output be the same no matter the order? sortBy
is not applicable for partitionBy
and takes precedence over bucketBy
, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the output be the same no matter the order?
Theoretically yes. Practically I don't know what happens. Since you are documenting, it will be worthwhile to check that and record if it works as expected (or if there is any weirdness).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (partitionBy
+ sortBy
without bucketBy
or overlapping bucketBy
and partitionBy
columns) give enough feedback to diagnose the issue.
I haven't tested this with large datasets though, so there can be hidden issues.
docs/sql-programming-guide.md
Outdated
|
||
</div> | ||
|
||
while partitioning can be used with both `save` and `saveAsTable`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like @tejasapatil suggested, we should give one more example about partitioned and bucketed table, so that users know they can use bucketing and partitioning at the same time
20c7ca6
to
a14296a
Compare
Test build #76813 has finished for PR 17938 at commit
|
a14296a
to
7bf4bbc
Compare
Test build #76825 has finished for PR 17938 at commit
|
Test build #76830 has finished for PR 17938 at commit
|
Test build #76828 has finished for PR 17938 at commit
|
When you omit |
Test build #76899 has finished for PR 17938 at commit
|
@cloud-fan Thanks for the clarification. Just a thought - shouldn't we either support it consistently or don't support at all? Current behaviour is quite confusing and I don't think that documentation alone will cut it. |
Test build #76900 has finished for PR 17938 at commit
|
we are going to support bucketing in hive style CREATE TABLE syntax soon. |
In the current 2.2 docs, we already updated all the syntax to Thus, it is OK to document like what you just committed. Let me review them carefully now. Thanks for your work! |
docs/sql-programming-guide.md
Outdated
favorite_color STRING, | ||
favorite_NUMBERS array<integer> | ||
) USING parquet | ||
CLUSTERED BY(name) INTO 42 BUCKETS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the example in the other APIs, it is missing the SORTED BY
clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the same table names people_bucketed
with the same column names in the example? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323 Could you also resolve this? Thanks!
docs/sql-programming-guide.md
Outdated
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat | |||
|
|||
Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. | |||
|
|||
### Bucketing, Sorting and Partitioning | |||
|
|||
For file-based data source it is also possible to bucket and sort or partition the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, For file-based data source it
-> For file-based data source, it
docs/sql-programming-guide.md
Outdated
### Bucketing, Sorting and Partitioning | ||
|
||
For file-based data source it is also possible to bucket and sort or partition the output. | ||
Bucketing and sorting is applicable only to persistent tables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is applicable
-> are applicable
docs/sql-programming-guide.md
Outdated
|
||
</div> | ||
|
||
It is possible to use both partitions and buckets for a single table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitions and buckets
-> partitioning and bucketing
docs/sql-programming-guide.md
Outdated
|
||
</div> | ||
|
||
while partitioning can be used with both `save` and `saveAsTable`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
both `save` and `saveAsTable`
->
both `save` and `saveAsTable` when using the Dataset APIs.
docs/sql-programming-guide.md
Outdated
</div> | ||
|
||
`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. | ||
Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of that it has
-> Thus, it has
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast
bucketBy distributes
-> In contrast,
bucketBy distributes
docs/sql-programming-guide.md
Outdated
|
||
`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. | ||
Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes | ||
data across fixed number of buckets and can be used if a number of unique values is unbounded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used if
-> used when
LGTM except a few minor comments. |
Test build #76905 has finished for PR 17938 at commit
|
Test build #76911 has finished for PR 17938 at commit
|
docs/sql-programming-guide.md
Outdated
`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. | ||
Thus, it has limited applicability to columns with high cardinality. In contrast | ||
`bucketBy` distributes | ||
data across fixed number of buckets and can be used when a number of unique values is unbounded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: fixed number of
-> a fixed number of
Will merge it when my minor comment is resolved. Thanks for working on it! |
Test build #77436 has finished for PR 17938 at commit
|
…By and sortBy in SQL guide ## What changes were proposed in this pull request? - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`. - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide - Remove bucketing from Unsupported Hive Functionalities. ## How was this patch tested? Manual tests, docs build. Author: zero323 <[email protected]> Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING. (cherry picked from commit ae33abf) Signed-off-by: Xiao Li <[email protected]>
Thanks @gatorsmile |
What changes were proposed in this pull request?
partitionBy
,sortBy
andbucketBy
.How was this patch tested?
Manual tests, docs build.