Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

Closed
wants to merge 11 commits into from

Conversation

kaka1992
Copy link
Contributor

@kaka1992 kaka1992 commented May 3, 2015

Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
def dropDuplicates(): DataFrame
def dropDuplicates(subset: Seq[String]): DataFrame

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

* Returns a new [[DataFrame]] without duplicates under the given columns.
* @group dfops
*/
def dropDuplicates(subset: Seq[String]): DataFrame = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose that distinct is as same as dropDuplicates for removing duplicate rows? If they are the same, which implementation is better? GroupedData or Distinct node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya No, dropDuplicates is used to remove duplicate rows that are the same in some columns or in all columns (default) . The default version is as same as distinct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also select subset of columns and then do distinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this, you can't get all columns.

@kaka1992
Copy link
Contributor Author

kaka1992 commented May 5, 2015

@viirya Please test this.

@kaka1992
Copy link
Contributor Author

kaka1992 commented May 6, 2015

@rxin Please test this.

*/
def dropDuplicates(subset: Seq[String]): DataFrame = {
import org.apache.spark.sql.functions.{first => columnFirst}
new GroupedData(this, subset.map(colName => resolve(colName))).agg(columns.map(columnFirst))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just

groupBy(subset : _*).agg(columns.map(columnFirst) : _*)

(you might need to take the head and then vararg the tail)

@rxin
Copy link
Contributor

rxin commented May 6, 2015

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31935 has started for PR 5870 at commit b6f1879.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31935 has finished for PR 5870 at commit b6f1879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31935/
Test PASSed.

@kaka1992
Copy link
Contributor Author

kaka1992 commented May 6, 2015

@rxin Please retest this.

def dropDuplicates(subset: Seq[String]): DataFrame = {
import org.apache.spark.sql.functions.{first => columnFirst}
val columnFirsts = columns.map(columnFirst)
groupBy(subset.head, subset.tail : _*).agg(columnFirsts.head, columnFirsts.tail : _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check if subset.size == 0 or columns.size == 0, then simply return an empty data frame (there is one in SQLContext).

@adrian-wang
Copy link
Contributor

I think keep the takeFirst parameter would make this better to understand.

@asfgit asfgit closed this in b6bf4f7 May 12, 2015
asfgit pushed a commit that referenced this pull request May 12, 2015
This should also close #5870

Author: Reynold Xin <[email protected]>

Closes #6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates

(cherry picked from commit b6bf4f7)
Signed-off-by: Michael Armbrust <[email protected]>
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This should also close apache#5870

Author: Reynold Xin <[email protected]>

Closes apache#6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This should also close apache#5870

Author: Reynold Xin <[email protected]>

Closes apache#6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This should also close apache#5870

Author: Reynold Xin <[email protected]>

Closes apache#6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants