[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

kaka1992 · 2015-05-03T06:15:21Z

Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
def dropDuplicates(): DataFrame
def dropDuplicates(subset: Seq[String]): DataFrame

AmplabJenkins · 2015-05-03T06:17:10Z

Can one of the admins verify this patch?

viirya · 2015-05-04T07:12:20Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+   * Returns a new [[DataFrame]] without duplicates under the given columns.
+   * @group dfops
+   */
+  def dropDuplicates(subset: Seq[String]): DataFrame = {


Suppose that distinct is as same as dropDuplicates for removing duplicate rows? If they are the same, which implementation is better? GroupedData or Distinct node?

@viirya No, dropDuplicates is used to remove duplicate rows that are the same in some columns or in all columns (default) . The default version is as same as distinct.

You can also select subset of columns and then do distinct?

If you do this, you can't get all columns.

kaka1992 · 2015-05-05T02:44:21Z

@viirya Please test this.

kaka1992 · 2015-05-06T01:58:58Z

@rxin Please test this.

rxin · 2015-05-06T02:04:17Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+   */
+  def dropDuplicates(subset: Seq[String]): DataFrame = {
+    import org.apache.spark.sql.functions.{first => columnFirst}
+    new GroupedData(this, subset.map(colName => resolve(colName))).agg(columns.map(columnFirst))


this is just

groupBy(subset : _*).agg(columns.map(columnFirst) : _*)

(you might need to take the head and then vararg the tail)

rxin · 2015-05-06T02:05:44Z

Jenkins, test this please.

AmplabJenkins · 2015-05-06T02:07:10Z

Merged build triggered.

AmplabJenkins · 2015-05-06T02:07:16Z

Merged build started.

SparkQA · 2015-05-06T02:09:13Z

Test build #31935 has started for PR 5870 at commit b6f1879.

SparkQA · 2015-05-06T04:00:35Z

Test build #31935 has finished for PR 5870 at commit b6f1879.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-06T04:00:39Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-06T04:00:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31935/
Test PASSed.

kaka1992 · 2015-05-06T05:44:27Z

@rxin Please retest this.

rxin · 2015-05-06T06:28:42Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+  def dropDuplicates(subset: Seq[String]): DataFrame = {
+    import org.apache.spark.sql.functions.{first => columnFirst}
+    val columnFirsts = columns.map(columnFirst)
+    groupBy(subset.head, subset.tail : _*).agg(columnFirsts.head, columnFirsts.tail : _*)


we should also check if subset.size == 0 or columns.size == 0, then simply return an empty data frame (there is one in SQLContext).

adrian-wang · 2015-05-07T03:04:59Z

I think keep the takeFirst parameter would make this better to understand.

This should also close #5870 Author: Reynold Xin <[email protected]> Closes #6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates (cherry picked from commit b6bf4f7) Signed-off-by: Michael Armbrust <[email protected]>

This should also close apache#5870 Author: Reynold Xin <[email protected]> Closes apache#6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates

云峤 added 8 commits May 1, 2015 23:50

[SPARK-7294] ADD BETWEEN

d11d5b9

[SPARK-7294] ADD BETWEEN

baf839b

[SPARK-7294] ADD BETWEEN

7d62368

Merge remote-tracking branch 'remotes/upstream/master'

76f0c51

update pep8

f080f8d

undo

c6e49bc

undo

d6cc28d

update

aab51ef

update

b6f1879

viirya reviewed May 4, 2015
View reviewed changes

rxin reviewed May 6, 2015
View reviewed changes

Remove useless code.

571869e

rxin reviewed May 6, 2015
View reviewed changes

rxin mentioned this pull request May 11, 2015

[SPARK-7324][SQL] DataFrame.dropDuplicates #6066

Closed

Update

1de8791

asfgit closed this in b6bf4f7 May 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

kaka1992 commented May 3, 2015

AmplabJenkins commented May 3, 2015

viirya May 4, 2015

kaka1992 May 4, 2015

viirya May 4, 2015

kaka1992 May 6, 2015

kaka1992 commented May 5, 2015

kaka1992 commented May 6, 2015

rxin May 6, 2015

rxin commented May 6, 2015

AmplabJenkins commented May 6, 2015

AmplabJenkins commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

AmplabJenkins commented May 6, 2015

AmplabJenkins commented May 6, 2015

kaka1992 commented May 6, 2015

rxin May 6, 2015

adrian-wang commented May 7, 2015

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

Conversation

kaka1992 commented May 3, 2015

AmplabJenkins commented May 3, 2015

viirya May 4, 2015

Choose a reason for hiding this comment

kaka1992 May 4, 2015

Choose a reason for hiding this comment

viirya May 4, 2015

Choose a reason for hiding this comment

kaka1992 May 6, 2015

Choose a reason for hiding this comment

kaka1992 commented May 5, 2015

kaka1992 commented May 6, 2015

rxin May 6, 2015

Choose a reason for hiding this comment

rxin commented May 6, 2015

AmplabJenkins commented May 6, 2015

AmplabJenkins commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

AmplabJenkins commented May 6, 2015

AmplabJenkins commented May 6, 2015

kaka1992 commented May 6, 2015

rxin May 6, 2015

Choose a reason for hiding this comment

adrian-wang commented May 7, 2015