Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444

Closed
wants to merge 8 commits into from

Conversation

Transurgeon
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to improve the examples in pyspark.sql.dataframe by making each example self-contained with more realistic examples

Why are the changes needed?

To make the documentation more readable and able to copy and paste directly in PySpark shell.

Does this PR introduce any user-facing change?

Yes, Documentation changes only

How was this patch tested?

Built documentation on local

@Transurgeon
Copy link
Contributor Author

I set the tag [WIP] because dataframe.py needs a lot of updates. I will add some additional changes to this PR in the upcoming days.

Please review and provide some feedback.
Thanks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

@Transurgeon is this still WIP? If there are too many to fix, feel free to split into multiple PRs.

@Transurgeon
Copy link
Contributor Author

@HyukjinKwon.

No not WIP anymore, I wanted to get some feedback to see if I was making good changes before I continue working on it.

Should I remove the WIP tag?

>>> df.drop(df.age).collect()
[Row(name='Alice'), Row(name='Bob')]
[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]

>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what these 3 inner joins do exactly. I dont see anywhere an instantiation of df2..

What should I do with these 3 examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's showing a common example that join and drop the join key.

@Transurgeon Transurgeon changed the title [SPARK-40012][PYTHON][DOCS][WIP] Make pyspark.sql.dataframe examples self-contained [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) Aug 23, 2022
@Transurgeon
Copy link
Contributor Author

@HyukjinKwon I made some additional changes. I think we can start by merging this PR, then I will make another one for the rest of the changes.

I have a list of all the functions I made changes to in this PR, should I add it to the JIRA ticket to avoid duplicate changes?

@HyukjinKwon
Copy link
Member

I think you can reuse the same JIRA, and make a followup.

>>> df
DataFrame[age: int, name: string]
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
... (16, "Bob")], ["age", "name"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for indentation

Fill all null values with 50 when the data type of the column is an integer

>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
... (None, None, "Tom"), (None, None, None)], ["age", "height", "name"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentation

@HyukjinKwon
Copy link
Member

We should make the tests passed before merging it in (https://github.com/Transurgeon/spark/runs/7981691501).

cc @dcoliversun @khalidmammadov FYI if you guys find some time to review, and work on the rest of API.

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
>>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
Replace all instances of Alice to 'A' and Bob to 'B' under the name column

>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to create duplicate dataframe here? If we don't need it, better to delete

>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),(None, None, "Tom"), (None, None, None)], ["age", "height", "name"])
>>> df.show()
+----+------+-----+                                                             
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
|   5|  null|  Bob|
|null|  null|  Tom|
|null|  null| null|
+----+------+-----+

>>> df.na.replace('Alice', None).show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
|   5|  null| Bob|
|null|  null| Tom|
|null|  null|null|
+----+------+----+

>>> df.show()
+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
|   5|  null|  Bob|
|null|  null|  Tom|
|null|  null| null|
+----+------+-----+

>>> df.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|   A|
|   5|  null|   B|
|null|  null| Tom|
|null|  null|null|
+----+------+----+

@dcoliversun
Copy link
Contributor

dcoliversun commented Aug 24, 2022

@Transurgeon Hi. Look like that CI is disabled in your fork repo.
image
Maybe the doc can help you :)

@HyukjinKwon
Copy link
Member

Im gonna take this over if the PR author gets inactive few more days - this is the last task left for the umbrella task.

Co-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Qian.Sun <[email protected]>
@Transurgeon
Copy link
Contributor Author

Hi Hyukjin and Oliver, thanks all for your feedback.

I have created a commit with all your suggestions and allowed all jobs to be run in git Actions for my fork.

I will make one last commit for further minor changes.

@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
Parameters
----------
cols : str
new column names
new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Transurgeon mind running ./dev/lint-python script and fix the line length, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will do, sorry about that

>>> df.schema
StructType([StructField('age', IntegerType(), True),
StructField('name', StringType(), True)])
StructType([StructField('age', IntegerType(), True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the space in the end

+---+-----+
only showing top 2 rows

Show DataFrame where the maximum number of characters is 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`DataFrame`

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
Comment on lines 4216 to 4225
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
... (16, "Bob")], ["age", "name"])
>>> df.show()
+---+-----+
|age| name|
+---+-----+
| 14| Tom|
| 23|Alice|
| 16| Bob|
+---+-----+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to create a new DataFrame here, since drop() doesn't remove the column in-place.

e.g.

>>> df.drop('age').show()
+-----+
| name|
+-----+
|  Tom|
|Alice|
|  Bob|
+-----+

>>> df.drop(df.age).show()
+-----+
| name|
+-----+
|  Tom|
|Alice|
|  Bob|
+-----+

@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
Parameters
----------
cols : str
new column names
new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it exceeds the 100 lines, which violates flake8 rule.

starting flake8 test...
flake8 checks failed:
./python/pyspark/sql/dataframe.py:4250:101: E501 line too long (128 > 100 characters)
        """
        Returns a best-effort snapshot of the files that compose this :class:`DataFrame`.
        This method simply asks each constituent BaseRelation for its respective files and
        takes the union of all results. Depending on the source relations, this may not find
        all input files. Duplicates are removed.

        new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame`

        .. versionadded:: 3.1.0

        Returns
        -------
        list
            List of file paths.

        Examples
        --------
        >>> df = spark.read.load("examples/src/main/resources/people.json", format="json")
        >>> len(df.inputFiles())
        1
        """

                                                                                        ^
1     E501 line too long (128 > 100 characters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run dev/lint-python to check if the static analysis is passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

StructType([StructField('age', IntegerType(), True),
StructField('name', StringType(), True)])
StructType([StructField('age', IntegerType(), True),
StructField('name', StringType(), True)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using tabs

@HyukjinKwon
Copy link
Member

Hey, let's co-author this change. I will create another PR on the top of this PR to speed this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants