-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444
Conversation
I set the tag [WIP] because dataframe.py needs a lot of updates. I will add some additional changes to this PR in the upcoming days. Please review and provide some feedback. |
Can one of the admins verify this patch? |
@Transurgeon is this still WIP? If there are too many to fix, feel free to split into multiple PRs. |
No not WIP anymore, I wanted to get some feedback to see if I was making good changes before I continue working on it. Should I remove the WIP tag? |
>>> df.drop(df.age).collect() | ||
[Row(name='Alice'), Row(name='Bob')] | ||
[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')] | ||
|
||
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what these 3 inner joins do exactly. I dont see anywhere an instantiation of df2..
What should I do with these 3 examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's showing a common example that join and drop the join key.
@HyukjinKwon I made some additional changes. I think we can start by merging this PR, then I will make another one for the rest of the changes. I have a list of all the functions I made changes to in this PR, should I add it to the JIRA ticket to avoid duplicate changes? |
I think you can reuse the same JIRA, and make a followup. |
python/pyspark/sql/dataframe.py
Outdated
>>> df | ||
DataFrame[age: int, name: string] | ||
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), | ||
... (16, "Bob")], ["age", "name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for indentation
python/pyspark/sql/dataframe.py
Outdated
Fill all null values with 50 when the data type of the column is an integer | ||
|
||
>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"), | ||
... (None, None, "Tom"), (None, None, None)], ["age", "height", "name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentation
We should make the tests passed before merging it in (https://github.com/Transurgeon/spark/runs/7981691501). cc @dcoliversun @khalidmammadov FYI if you guys find some time to review, and work on the rest of API. |
python/pyspark/sql/dataframe.py
Outdated
>>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() | ||
Replace all instances of Alice to 'A' and Bob to 'B' under the name column | ||
|
||
>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to create duplicate dataframe here? If we don't need it, better to delete
>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),(None, None, "Tom"), (None, None, None)], ["age", "height", "name"])
>>> df.show()
+----+------+-----+
| age|height| name|
+----+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
|null| null| Tom|
|null| null| null|
+----+------+-----+
>>> df.na.replace('Alice', None).show()
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
| 5| null| Bob|
|null| null| Tom|
|null| null|null|
+----+------+----+
>>> df.show()
+----+------+-----+
| age|height| name|
+----+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
|null| null| Tom|
|null| null| null|
+----+------+-----+
>>> df.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80| A|
| 5| null| B|
|null| null| Tom|
|null| null|null|
+----+------+----+
@Transurgeon Hi. Look like that CI is disabled in your fork repo. |
Im gonna take this over if the PR author gets inactive few more days - this is the last task left for the umbrella task. |
Co-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Qian.Sun <[email protected]>
Hi Hyukjin and Oliver, thanks all for your feedback. I have created a commit with all your suggestions and allowed all jobs to be run in git Actions for my fork. I will make one last commit for further minor changes. |
python/pyspark/sql/dataframe.py
Outdated
@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame": | |||
Parameters | |||
---------- | |||
cols : str | |||
new column names | |||
new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Transurgeon mind running ./dev/lint-python
script and fix the line length, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes will do, sorry about that
python/pyspark/sql/dataframe.py
Outdated
>>> df.schema | ||
StructType([StructField('age', IntegerType(), True), | ||
StructField('name', StringType(), True)]) | ||
StructType([StructField('age', IntegerType(), True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the space in the end
python/pyspark/sql/dataframe.py
Outdated
+---+-----+ | ||
only showing top 2 rows | ||
|
||
Show DataFrame where the maximum number of characters is 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`DataFrame`
python/pyspark/sql/dataframe.py
Outdated
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), | ||
... (16, "Bob")], ["age", "name"]) | ||
>>> df.show() | ||
+---+-----+ | ||
|age| name| | ||
+---+-----+ | ||
| 14| Tom| | ||
| 23|Alice| | ||
| 16| Bob| | ||
+---+-----+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to create a new DataFrame here, since drop()
doesn't remove the column in-place.
e.g.
>>> df.drop('age').show()
+-----+
| name|
+-----+
| Tom|
|Alice|
| Bob|
+-----+
>>> df.drop(df.age).show()
+-----+
| name|
+-----+
| Tom|
|Alice|
| Bob|
+-----+
python/pyspark/sql/dataframe.py
Outdated
@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame": | |||
Parameters | |||
---------- | |||
cols : str | |||
new column names | |||
new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it exceeds the 100 lines, which violates flake8
rule.
starting flake8 test...
flake8 checks failed:
./python/pyspark/sql/dataframe.py:4250:101: E501 line too long (128 > 100 characters)
"""
Returns a best-effort snapshot of the files that compose this :class:`DataFrame`.
This method simply asks each constituent BaseRelation for its respective files and
takes the union of all results. Depending on the source relations, this may not find
all input files. Duplicates are removed.
new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame`
.. versionadded:: 3.1.0
Returns
-------
list
List of file paths.
Examples
--------
>>> df = spark.read.load("examples/src/main/resources/people.json", format="json")
>>> len(df.inputFiles())
1
"""
^
1 E501 line too long (128 > 100 characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run dev/lint-python
to check if the static analysis is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyspark/sql/dataframe.py
Outdated
StructType([StructField('age', IntegerType(), True), | ||
StructField('name', StringType(), True)]) | ||
StructType([StructField('age', IntegerType(), True), | ||
StructField('name', StringType(), True)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid using tabs
Hey, let's co-author this change. I will create another PR on the top of this PR to speed this up. |
What changes were proposed in this pull request?
This PR proposes to improve the examples in pyspark.sql.dataframe by making each example self-contained with more realistic examples
Why are the changes needed?
To make the documentation more readable and able to copy and paste directly in PySpark shell.
Does this PR introduce any user-facing change?
Yes, Documentation changes only
How was this patch tested?
Built documentation on local