Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34887][PYTHON] Port Koalas dependencies into PySpark #32386

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions python/docs/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,14 +152,17 @@ To install PySpark from source, refer to |building_spark|_.

Dependencies
------------
============= ========================= ================
============= ========================= ============================
Package Minimum supported version Note
============= ========================= ================
============= ========================= ============================
`pandas` 0.23.2 Optional for SQL
`NumPy` 1.7 Required for ML
`pyarrow` 1.0.0 Optional for SQL
`Py4J` 0.10.9.2 Required
============= ========================= ================
`pandas` 0.23.2 Required for pandas-on-Spark
`pyarrow` 1.0.0 Required for pandas-on-Spark
`Numpy` 1.14(<1.20.0) Required for pandas-on-Spark
============= ========================= ============================

Note that PySpark requires Java 8 or later with ``JAVA_HOME`` properly set.
If using JDK 11, set ``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features and refer
Expand Down
14 changes: 13 additions & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,13 @@ def run(self):
'pyspark.bin',
'pyspark.sbin',
'pyspark.jars',
'pyspark.pandas',
'pyspark.pandas.indexes',
'pyspark.pandas.missing',
'pyspark.pandas.plot',
'pyspark.pandas.spark',
'pyspark.pandas.typedef',
'pyspark.pandas.usage_logging',
'pyspark.python.pyspark',
'pyspark.python.lib',
'pyspark.data',
Expand Down Expand Up @@ -257,7 +264,12 @@ def run(self):
'sql': [
'pandas>=%s' % _minimum_pandas_version,
'pyarrow>=%s' % _minimum_pyarrow_version,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pyarrow minimun version is introduced in _minimum_pyarrow_version, but if we introduce the pyarrow>=0.10 for pandas_on_spark in here, maybe we could change the _minimum_pyarrow_version to something like _minimum_sql_pyarrow_version

[1] https://github.com/apache/spark/pull/32386/files#diff-eb8b42d9346d0a5d371facf21a8bfa2d16fb49e213ae7c21f03863accebe0fcfR115

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! There is only one usage of _minimum_pyarrow_version. How about removing the variable and using the value instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can use the same lower bound, I prefer just use _minimum_pyarrow_version directly as suggestion in #32386 (comment)

'pandas_on_spark': [
    'pandas>=%s' % _minimum_pandas_version,
    'pyarrow>=%s', % _minimum_pyarrow_version,
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

]
],
'pandas_on_spark': [
'pandas>=%s' % _minimum_pandas_version,
'pyarrow>=%s' % _minimum_pyarrow_version,
'numpy>=1.14,<1.20.0',
],
},
python_requires='>=3.6',
classifiers=[
Expand Down