[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy #17077

zero323 · 2017-02-27T05:17:49Z

What changes were proposed in this pull request?

Adds Python wrappers for DataFrameWriter.bucketBy and DataFrameWriter.sortBy (SPARK-16931)

How was this patch tested?

Unit tests covering new feature.

Note: Based on work of @GregBowyer (f49b9a2)

CC @HyukjinKwon

HyukjinKwon · 2017-02-27T05:25:21Z

python/pyspark/sql/readwriter.py

@@ -545,6 +545,55 @@ def partitionBy(self, *cols):
        self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
        return self

+    @since(2.1)


Maybe it should be 2.2 :)

HyukjinKwon · 2017-02-27T05:25:51Z

Thanks for cc'ing me. Let me please cc @davies as he was reviewing it and it seems close to be merged, and also @holdenk.

HyukjinKwon · 2017-02-27T05:27:10Z

python/pyspark/sql/tests.py

+        # Test write with one bucketing column
+        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
+        self.assertEqual(
+            len([c for c in self.spark.catalog.listColumns("pyspark_bucket") if c.name == "x" and c.isBucket]),


Oh, BTW, I assume it exceeds 100 length limit?

SparkQA · 2017-02-27T17:10:40Z

Test build #73504 has finished for PR 17077 at commit 4477493.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T17:12:42Z

Test build #73522 has finished for PR 17077 at commit 0ef84fb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T17:53:07Z

Test build #73523 has finished for PR 17077 at commit 9fde39f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T18:33:04Z

Test build #73527 has finished for PR 17077 at commit 18c709c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T22:28:56Z

Test build #73535 has finished for PR 17077 at commit ae93166.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-04-09T11:14:19Z

This looks like an important improvement that might make sense to try and get in for 2.2 so I'll try and get some reviewing in.

holdenk

First quick pass through, thanks for working on this :)

holdenk · 2017-04-09T11:15:23Z

python/pyspark/sql/readwriter.py

@@ -545,6 +545,57 @@ def partitionBy(self, *cols):
        self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
        return self

+    @since(2.2)
+    def bucketBy(self, numBuckets, *cols):
+        """Buckets the output by the given columns on the file system.


So the bucketBy description in the scaladoc is a bit more in depth, you might just want to copy that.

Also generally our style for multi-line doc string is to have the open """ on its own line.

Regarding style I had a similar exchange with @jkbradley lately (#17218 (review)). If a single convention is desired a believe it should be documented and the remaining docstrings should be adjusted. Personally I am indifferent thought PEP 8 and PEP 257 seem to prefer this convention over placing opening quotes in a separate line.

you might just want to copy that.

Do you mean this? I wonder if should rather document that it is allowed only with saveAsTable. What do you think?

Both

""" ... """

or

"""... """

comply pep8 for multiple-line docstring up to my knowledge although I don't think a specific way has been preferred in this case. (Just as a personal taste, I prefer the first case).

I think just copying it from Scala doc is good enough to prevent overhead of sweeping the documentation when we start to support other operations later.

holdenk · 2017-04-09T11:16:52Z

python/pyspark/sql/readwriter.py

+
+    @since(2.2)
+    def sortBy(self, *cols):
+        """Sorts the output in each bucket by the given columns on the file system.


Same comment as above with regards to the docstring

HyukjinKwon · 2017-04-09T15:04:46Z

python/pyspark/sql/tests.py

+        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
+        self.assertEqual(
+            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
+                 if c.name == "x" and c.isBucket]),


BTW, maybe, we should break this into multiple lines (or simply another variable for this list comprehension) if more commits should be pushed. It seems not readable.

We can simplify this to

catalog = self.spark.catalog sum(c.name == "x" and c.isBucket for c in catalog.listColumns("pyspark_bucket"))

f you think this is more readable but i am not convinced that it makes sense to use a separate variable here. We have a few tests like this, don't care about the sequence itself, and I think it would only pollute the scope. But if you have strong feelings about I am happy to adjust it.

Regarding the comment style... Right now (excluding bucket by and sortBy) we have

23 docstrings with:

""".... """

7 docstrings:

"""" .... """"

in readwriter. As you said both are valid, but if we want to keep only one convention it would be a good idea to adjust a whole module.

I am fine with it. I dont strongly feel about both.

Thanks for taking a look for the related ones and trying it out.

SparkQA · 2017-04-10T11:01:44Z

Test build #75658 has finished for PR 17077 at commit 7b93482.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-10T11:38:47Z

Test build #75659 has finished for PR 17077 at commit 845ee87.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

There are all from me @zero323. It looks quite good to me except for few comments I left.

HyukjinKwon · 2017-04-10T11:22:41Z

python/pyspark/sql/readwriter.py

@@ -545,6 +545,57 @@ def partitionBy(self, *cols):
        self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
        return self

+    @since(2.2)
+    def bucketBy(self, numBuckets, *cols):
+        """Buckets the output by the given columns on the file system.


I think just copying it from Scala doc is good enough to prevent overhead of sweeping the documentation when we start to support other operations later.

HyukjinKwon · 2017-04-10T11:34:51Z

python/pyspark/sql/tests.py

+        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
+        self.assertEqual(
+            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
+                 if c.name in ("x", "y") and c.isBucket]),


I am sorry. What do you think about something like this one below?:

cols = self.spark.catalog.listColumns("pyspark_bucket") num = len([c for c in cols if c.name in ("x", "y") and c.isBucket]) self.assertEqual(num, 2)

If you think it is better I'll trust your judgment. But let's keep it DRY and use a helper.

Copying docs from Scala docs directly could be confusing since we won't support this in 2.0 and 2.1 and changes since 2.0 doesn't really affect us here.

Thank you for taking my opinion into account. Yea, we should remove or change the version. I meant to follow the rest of contents.

Generally, the contents in documentation has been matched among APIs in different languages up to my knowledge. I don't think this is a kind of a must but I think it is safer to avoid getting blamed for any reason in the future and confusion for the users.

I have seen several minor PRs fixing documentations (e.g., typos) that has to identically be fixed for other APIs in different language and I also made some PRs to match the documentations, e.g., #17429

HyukjinKwon · 2017-04-10T11:48:49Z

python/pyspark/sql/tests.py

+        # Test write with bucket and sort with multiple columns
+        (df.write.bucketBy(2, "x")
+            .sortBy("y", "z")
+            .mode("overwrite").saveAsTable("pyspark_bucket"))


@zero323, should we drop the table before or after this test?

I don't think that dropping before is necessary. We override on each write and name clashes are unlikely.

We can drop down after the tests but I am not sure how to do it right. SQLTests is overgrown and I am not sure if we should add tearDown only for this but adding DROP TABLE in test itself doesn't look right.

SparkQA · 2017-04-10T16:18:48Z

Test build #75666 has finished for PR 17077 at commit 481416d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-10T16:51:09Z

Test build #75667 has finished for PR 17077 at commit 71c9e0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-04-11T00:27:21Z

python/pyspark/sql/tests.py

+            .mode("overwrite").saveAsTable("pyspark_bucket"))
+        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
+
+        self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")


Yea, I think this is a correct way to drop the table.

If we're going to drop the table here we should probably put it in a final block.

@holdenk Do you suggest adding tearDown? I thought about it but right now tests are so inflated (sadly not much support for SPARK-19224) it will be completely detached from the context.

From the other hand adding artificial try ... finally seems wrong.

HyukjinKwon · 2017-04-11T00:40:50Z

(I think we need @holdenk's sign-off and further review.)

holdenk · 2017-04-12T17:34:52Z

Thanks for helping with the review @HyukjinKwon :)

holdenk · 2017-04-12T17:40:01Z

One minor comment but otherwise looking in very good shape.

zero323 · 2017-04-26T01:15:19Z

@holdenk, @HyukjinKwon Do we retarget this to 2.3?

HyukjinKwon · 2017-04-26T01:30:16Z

I think we should because branch-2.2 is cut out.

zero323 · 2017-04-26T01:45:29Z

🙁

…tion

zero323 · 2017-05-07T14:49:15Z

@gatorsmile

Could you also update the SQL document?

Sure, but I'll need some guidance here. Somewhere in the Generic Load/Save Functions, right? But I guess we'll need a separate section for that. And should probably document partitionByas well.

SparkQA · 2017-05-07T15:12:21Z

Test build #76545 has finished for PR 17077 at commit c996828.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-07T17:52:55Z

python/pyspark/sql/readwriter.py

+        :param cols: additional names (optional). If `col` is a list it should be empty.
+
+        .. note:: Applicable for file-based data sources in combination with
+                  :py:meth:`DataFrameWriter.saveAsTable`.


This is not accurate. We also can use save to store the bucked tables without saving its metadata in metastore.

@gatorsmile Can we?

➜ spark git:(master) git rev-parse HEAD 2cf83c47838115f71419ba5b9296c69ec1d746cd ➜ spark git:(master) bin/spark-shell Spark context Web UI available at http://192.168.1.101:4041 Spark context available as 'sc' (master = local[*], app id = local-1494184109262). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121) Type in expressions to have them evaluated. Type :help for more information. scala> Seq(("a", 1, 3)).toDF("x", "y", "z").write.bucketBy(3, "x", "y").format("parquet").save("/tmp/foo") org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:305) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:231) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) ... 48 elided

`

uh. Yes. Bucket info is not part of the file/directory names, unlike partitioning info.

gatorsmile · 2017-05-07T17:54:13Z

Yes. You can create a new section to explain how to create a bucket tables.

gatorsmile · 2017-05-07T20:46:25Z

python/pyspark/sql/readwriter.py

@@ -563,6 +563,63 @@ def partitionBy(self, *cols):
        self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
        return self

+    @since(2.3)
+    def bucketBy(self, numBuckets, col, *cols):
+        """Buckets the output by the given columns.If specified,


Nit: columns.If -> columns. If

gatorsmile · 2017-05-07T20:48:09Z

LGTM

[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy -> [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy and sortBy

gatorsmile · 2017-05-07T20:49:38Z

The SQL document update can be a separate PR. Thanks for your work!

HyukjinKwon · 2017-05-08T01:25:28Z

LGTM too.

cloud-fan · 2017-05-08T02:58:47Z

thanks, merging to master!

zero323 · 2017-05-10T11:06:35Z

@gatorsmile #17938

## What changes were proposed in this pull request? Adds Python wrappers for `DataFrameWriter.bucketBy` and `DataFrameWriter.sortBy` ([SPARK-16931](https://issues.apache.org/jira/browse/SPARK-16931)) ## How was this patch tested? Unit tests covering new feature. __Note__: Based on work of GregBowyer (f49b9a2) CC HyukjinKwon Author: zero323 <[email protected]> Author: Greg Bowyer <[email protected]> Closes apache#17077 from zero323/SPARK-16931.

HyukjinKwon reviewed Feb 27, 2017

View reviewed changes

zero323 force-pushed the SPARK-16931 branch from 72c04a3 to 4477493 Compare February 27, 2017 05:42

zero323 force-pushed the SPARK-16931 branch from 9fde39f to 18c709c Compare February 27, 2017 18:20

zero323 force-pushed the SPARK-16931 branch 2 times, most recently from e2bad95 to ae93166 Compare February 27, 2017 21:45

holdenk reviewed Apr 9, 2017

View reviewed changes

HyukjinKwon reviewed Apr 9, 2017

View reviewed changes

zero323 force-pushed the SPARK-16931 branch from ae93166 to 7b93482 Compare April 10, 2017 10:07

HyukjinKwon reviewed Apr 10, 2017

View reviewed changes

zero323 force-pushed the SPARK-16931 branch from 481416d to 71c9e0f Compare April 10, 2017 16:23

HyukjinKwon reviewed Apr 11, 2017

View reviewed changes

GregBowyer and others added 12 commits May 7, 2017 16:41

[SPARK-16931][PYTHON] PySpark APIS for bucketBy and sortBy

c2840ae

Add tests for bucketed writes

8c4e761

Check input types in sortBy / bucketBy

a9571db

Correct since annotation

d38da88

Correct doctests

fab0ef4

Add note about applicable scenarios

23b0ef8

Remove trailing whitespace

75d748d

Split test comprehensions into multiple steps and extract into a func…

e40987a

…tion

DROP TABLE after the tests

6f0a795

Retarget to 2.3 😞

0684d92

Add tearDown method to SQLTests

cbfbac4

Copy bucketBy description from Scala docs

8eac959

zero323 force-pushed the SPARK-16931 branch from 5da0e0d to 9c2b97e Compare May 7, 2017 14:43

Change signatures to identical to Scala

c996828

zero323 force-pushed the SPARK-16931 branch from 9c2b97e to c996828 Compare May 7, 2017 14:44

gatorsmile reviewed May 7, 2017

View reviewed changes

asfgit closed this in f53a820 May 8, 2017

zero323 deleted the SPARK-16931 branch May 8, 2017 09:10

[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy #17077

[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy #17077

Conversation

zero323 commented Feb 27, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

HyukjinKwon commented Feb 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 27, 2017

SparkQA commented Feb 27, 2017

SparkQA commented Feb 27, 2017

SparkQA commented Feb 27, 2017

SparkQA commented Feb 27, 2017

holdenk commented Apr 9, 2017

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2017

SparkQA commented Apr 10, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

zero323 Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2017

SparkQA commented Apr 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 11, 2017

holdenk commented Apr 12, 2017

holdenk commented Apr 12, 2017

zero323 commented Apr 26, 2017

HyukjinKwon commented Apr 26, 2017

zero323 commented Apr 26, 2017

zero323 commented May 7, 2017

SparkQA commented May 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 7, 2017

Choose a reason for hiding this comment

gatorsmile commented May 7, 2017

gatorsmile commented May 7, 2017

HyukjinKwon commented May 8, 2017

cloud-fan commented May 8, 2017

zero323 commented May 10, 2017

HyukjinKwon commented Feb 27, 2017 •

edited

Loading

HyukjinKwon Apr 9, 2017 •

edited

Loading

HyukjinKwon Apr 9, 2017 •

edited

Loading

HyukjinKwon Apr 10, 2017 •

edited

Loading

HyukjinKwon Apr 10, 2017 •

edited

Loading

zero323 Apr 10, 2017 •

edited

Loading