-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect #40525
Conversation
The remaining task at hand is to address numerous mypy annotation issues. If you have any good ideas for resolving linter, please feel free to let me know at any time :-) |
CI passed. cc @HyukjinKwon @ueshin @xinrong-meng @zhengruifeng PTAL when you find some time. I summarized key changes into PR description for review. |
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_binary_ops.py
Show resolved
Hide resolved
Merged to master. |
@itholic Thank you, great work :) After this PR ModuleNotFoundError Traceback (most recent call last) ModuleNotFoundError: No module named 'grpc' The above exception was the direct cause of the following exception: ImportError Traceback (most recent call last) File /opt/spark/python/pyspark/pandas/init.py:59 File /opt/spark/python/pyspark/pandas/frame.py:88 File /opt/spark/python/pyspark/pandas/_typing.py:25 File /opt/spark/python/pyspark/sql/connect/column.py:19 File /opt/spark/python/pyspark/sql/connect/utils.py:35, in check_dependencies(mod_name) File /opt/spark/python/pyspark/sql/connect/utils.py:47, in require_minimum_grpc_version() ImportError: grpc >= 1.48.1 must be installed; however, it was not found.
Collecting grpc × python setup.py egg_info did not run successfully. note: This error originates from a subprocess, and is likely not a problem with pip. × Encountered error while generating package metadata. note: This is an issue with the package mentioned above, not pip. After I don't think that ever pandas users that try pandas API on spark will use spark connect. So can we change this back? |
Thank you for the feedback, @bjornjorgensen ! IMHO, it seems more reasonable to add From my understanding, the purpose of Spark Connect is to allow users to use existing PySpark project without any code changes through a remote client. Therefore, if a user is using the So I think we should support all the functionality of PySpark as much as possible including pandas API on Spark, since nobody can not sure whether existing PySpark users will use the Pandas API on Spark through Spark Connect or not at this point, not only the existing pandas users. Alternatively, we might be able to create completely separate package path for the Pandas API on Spark for Spark Connect. This would allow the existing Pandas API on Spark to be used without installing WDYT? also cc @HyukjinKwon @grundprinzip @ueshin @zhengruifeng FYI |
The problem is that you don't use Spark Connect but it complains that it needs |
For example, you can't just import |
Got it. Then I think we need to modify the current code to import the Before from pyspark.sql.column import Column as PySparkColumn
from pyspark.sql.connect.column import Column as ConnectColumn
from pyspark.sql.utils import is_remote
def example():
Column = ConnectColumn if is_remote() else PySparkColumn
return Column.__eq__ After from pyspark.sql.column import Column as PySparkColumn
from pyspark.sql.utils import is_remote
def example():
if is_remote():
from pyspark.sql.connect.column import Column as ConnectColumn
Column = ConnectColumn
else:
Column = PySparkColumn
return Column.__eq__ Also, we should remove Can you happen to think of a better solution? As of now, I haven't come up with a better way other than this. Thanks! |
"The problem is that you don't use Spark Connect but it complains that it needs grpcio" The is_remote() can probably work. |
We need to be careful in changing the imports too early or too late. The way that the is remote functionality works checks for an environment variable. If it's not set during the import setting it later yields undesired consequences. Is it so had to add the dependency for grpc when using the pandas API? What are we achieving with this? GRPC is a stable protocol and not a random library. It's available throughout all platforms. What's the benefit of trying this pure approach? |
One of the key thing with pandas API on spark is that users only have to change Does this PR introduce any user-facing change? |
That's only partially true, they still have to install PySpark. If they pip install PySpark[connect] the dependencies are properly resolved. |
Actually even today Pandas on Spark users have to install specific dependencies that are not part of the default PySpark installation. See https://github.com/apache/spark/blob/master/python/setup.py Now, the quickest fix is to simply add grpcio to the pandas on spark dependency list. Given that we're forcing Pandas users into specific Java versions etc adding grpcio does not make it worse in my opinion. The user surface remains the same. |
Those modules that pandas Api on spark needs are pandas, pyarrow, numpy. Witch every pandas users already have installed. Why does pandas API on spark need such high coupling on the connect module? |
IIUC only grpcio is needed. The rest is localized to the spark connect client. |
It's not super hard. But it's a bit odd to add this alone to pandas API on Spark. We should probably think about adding grpcio as a hard dependency for whole PySpark project, but definitely not alone for pandas API on Spark.
So for the current status, we're trying to add the dependencies that the module need so users won't need to install the unnecessary dependency. In addition, adding the dependency breaks existing applications when they migrate from 3.4 to 3.5. It matters when PySpark is installed without |
So my proposal is to keep it consistent for now (that dose not require |
I think this point is particularly crucial even though all of your opinions make sense to me. |
@@ -778,6 +778,68 @@ def __hash__(self): | |||
# ml unittests | |||
"pyspark.ml.tests.connect.test_connect_function", | |||
"pyspark.ml.tests.connect.test_parity_torch_distributor", | |||
# pandas-on-Spark unittests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyspark-connect
takes 2~3 hours after adding the PS related unittests, I guess we can add new modules pyspark-connect-pandas
and pyspark-connect-pandas-slow
@itholic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me address within follow-up PR. Thanks!
### What changes were proposed in this pull request? add missing UTs ### Why are the changes needed? the two UTs were missing in #40525 ### Does this PR introduce _any_ user-facing change? no. test-only ### How was this patch tested? updated CI Closes #42262 from zhengruifeng/ps_test_ewm. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request? add missing UTs ### Why are the changes needed? the two UTs were missing in #40525 ### Does this PR introduce _any_ user-facing change? no. test-only ### How was this patch tested? updated CI Closes #42262 from zhengruifeng/ps_test_ewm. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]> (cherry picked from commit 6891f2c) Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request? add missing UTs ### Why are the changes needed? the two UTs were missing in apache#40525 ### Does this PR introduce _any_ user-facing change? no. test-only ### How was this patch tested? updated CI Closes apache#42262 from zhengruifeng/ps_test_ewm. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
This PR proposes to support pandas API on Spark for Spark Connect. This PR includes minimal changes to support basic functionality of the pandas API in Spark Connect, and sets up a testing environment into
pyspark/pandas/tests/connect
using all existing pandas API on Spark test bases to test the functionality of the pandas API on Spark in a remote Spark session.Here is a summary of the key tasks:
python/pyspark/pandas/tests/
directory can now be performed in Spark Connect by adding corresponding tests to thepython/pyspark/pandas/tests/connect/
directory.python/pyspark/sql/connect
for Spark Connect, so I modified the existing files ofpyspark.pandas
. This allows users to use the existing pandas-on-Spark code as it is on Spark Connect.python/pyspark/pandas/_typing.py
for addressing both PySpark Column and Spark Connect Column in the single path.GenericColumn
for typing both PySpark Column and Spark Connect Column.GenericDataFrame
for typing both PySpark DataFrame and Spark Connect DataFrame.Why are the changes needed?
By supporting the pandas API in Spark Connect, it can significantly improve the usability for existing PySpark and pandas users.
Does this PR introduce any user-facing change?
No, because it is designed to allow existing code for regular Spark sessions to be used without any user-facing changes other than switching the regular Spark session to remote Spark session. However, since some features of the existing pandas API on Spark are not fully supported yet, some features may be limited.
How was this patch tested?
A testing bed has been set up to reproduce all existing pandas-on-Spark tests for Spark Connect, ensuring that the existing tests can be replicated in Spark Connect. The current result for all tests as below: