-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870
Conversation
Can one of the admins verify this patch? |
* Returns a new [[DataFrame]] without duplicates under the given columns. | ||
* @group dfops | ||
*/ | ||
def dropDuplicates(subset: Seq[String]): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose that distinct
is as same as dropDuplicates
for removing duplicate rows? If they are the same, which implementation is better? GroupedData
or Distinct
node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya No, dropDuplicates is used to remove duplicate rows that are the same in some columns or in all columns (default) . The default version is as same as distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also select subset of columns and then do distinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this, you can't get all columns.
@viirya Please test this. |
@rxin Please test this. |
*/ | ||
def dropDuplicates(subset: Seq[String]): DataFrame = { | ||
import org.apache.spark.sql.functions.{first => columnFirst} | ||
new GroupedData(this, subset.map(colName => resolve(colName))).agg(columns.map(columnFirst)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just
groupBy(subset : _*).agg(columns.map(columnFirst) : _*)
(you might need to take the head and then vararg the tail)
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
Test build #31935 has started for PR 5870 at commit |
Test build #31935 has finished for PR 5870 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
@rxin Please retest this. |
def dropDuplicates(subset: Seq[String]): DataFrame = { | ||
import org.apache.spark.sql.functions.{first => columnFirst} | ||
val columnFirsts = columns.map(columnFirst) | ||
groupBy(subset.head, subset.tail : _*).agg(columnFirsts.head, columnFirsts.tail : _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also check if subset.size == 0 or columns.size == 0, then simply return an empty data frame (there is one in SQLContext).
I think keep the |
This should also close #5870 Author: Reynold Xin <[email protected]> Closes #6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates (cherry picked from commit b6bf4f7) Signed-off-by: Michael Armbrust <[email protected]>
This should also close apache#5870 Author: Reynold Xin <[email protected]> Closes apache#6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
This should also close apache#5870 Author: Reynold Xin <[email protected]> Closes apache#6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
This should also close apache#5870 Author: Reynold Xin <[email protected]> Closes apache#6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
def dropDuplicates(): DataFrame
def dropDuplicates(subset: Seq[String]): DataFrame