Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 615 map partitions with index callable from java #16

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Feb 27, 2014

No description provided.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12894/

@mateiz
Copy link
Contributor

mateiz commented Mar 4, 2014

Hey Holden, wait on this a bit until #17 is merged. Then we'll also want to make sure it works with Java 8 (you'll need to make the class an interface and such).

@pwendell
Copy link
Contributor

pwendell commented Mar 8, 2014

@holdenk mind bumping this now that #17 is in? You'll have to change extends to with... since the function classes are now interfaces rather than abstract classes.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 8, 2014

Sure, I'll give this a shot today :)

On Sat, Mar 8, 2014 at 11:24 AM, Patrick Wendell
[email protected]:

@holdenk https://github.com/holdenk mind bumping this now that #17https://github.com/apache/spark/pull/17is in? You'll have to change
extends to with... since the function classes are now interfaces rather
than abstract classes.

Reply to this email directly or view it on GitHubhttps://github.com//pull/16#issuecomment-37107006
.

Cell : 425-233-8271

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13075/

@JoshRosen
Copy link
Contributor

Sorry to necro the oldest open PR, but do you mind closing this now that mapPartitionsWithIndex has been fixed? Thanks!

@holdenk holdenk closed this Aug 24, 2014
jackylk pushed a commit to jackylk/spark that referenced this pull request Nov 8, 2014
JasonMWhite pushed a commit to JasonMWhite/spark that referenced this pull request Dec 2, 2015
Fix java.util.MissingFormatArgumentException in statsd module
asfgit pushed a commit that referenced this pull request Mar 29, 2016
## What changes were proposed in this pull request?

This PR brings the support for chained Python UDFs, for example

```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```

Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.

For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#10 AS double(double(1))#9]
:     +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
   +- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
:     +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
   +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
      +- !BatchPythonEvaluation double(1), [pythonUDF#17]
         +- Scan OneRowRelation[]
```

TODO: will support multiple unrelated Python UDFs in one batch (another PR).

## How was this patch tested?

Added new unit tests for chained UDFs.

Author: Davies Liu <[email protected]>

Closes #12014 from davies/py_udfs.
Parth-Brahmbhatt pushed a commit to Parth-Brahmbhatt/spark that referenced this pull request Jul 25, 2016
…-1656 to netflix/1.6.1

* commit '5b54d2fbb11b45298440d77deb06514f12c47b40':
  [DSEPLAT-1656] Upgrade the version of metacat client, benjamin and bdurl.
AnthonyTruchet added a commit to AnthonyTruchet/spark that referenced this pull request Dec 12, 2016
Fix dev tools and add some new, Criteo specific ones.
lins05 pushed a commit to lins05/spark that referenced this pull request Jan 17, 2017
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
lins05 pushed a commit to lins05/spark that referenced this pull request Apr 23, 2017
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
icexelloss added a commit to icexelloss/spark that referenced this pull request Apr 28, 2017
Move column writers to Arrow.scala

Add support for more types; Switch to arrow NullableVector

closes apache#16
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
* Documentation for the current state of the world.

* Adding navigation links from other pages

* Address comments, add TODO for things that should be fixed

* Address comments, mostly making images section clearer

* Virtual runtime -> container runtime
sven0726 pushed a commit to sven0726/spark that referenced this pull request Dec 3, 2018
hn5092 added a commit to hn5092/spark that referenced this pull request Apr 25, 2019
 upgrade spark version to 2.4.1-kylin-r5
hn5092 added a commit to hn5092/spark that referenced this pull request Jul 17, 2019
 upgrade spark version to 2.4.1-kylin-r5
hn5092 added a commit to hn5092/spark that referenced this pull request Jul 18, 2019
SirOibaf added a commit to SirOibaf/spark that referenced this pull request Jun 11, 2020
ringtail added a commit to ringtail/spark that referenced this pull request Jan 21, 2021
redsanket pushed a commit to redsanket/spark that referenced this pull request Feb 16, 2021
[YSPARK-1523] Cleanup hbaseread.py
HyukjinKwon pushed a commit that referenced this pull request Apr 22, 2023
…onnect

### What changes were proposed in this pull request?
Implement Arrow-optimized Python UDFs in Spark Connect.

Please see #39384 for motivation and  performance improvements of Arrow-optimized Python UDFs.

### Why are the changes needed?
Parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. In Spark Connect Python Client, users can:

1. Set `useArrow` parameter True to enable Arrow optimization for a specific Python UDF.

```sh
>>> df = spark.range(2)
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show()
+------------+
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#18 AS <lambda>(id)#16]
+- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200
   +- *(1) Range (0, 2, step=1, splits=1)
```

2. Enable `spark.sql.execution.pythonUDF.arrow.enabled` Spark Conf to make all Python UDFs Arrow-optimized.

```sh
>>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)
>>> df.select(udf(lambda x : x + 1)('id')).show()
+------------+
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#30 AS <lambda>(id)#28]
+- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200
   +- *(1) Range (0, 2, step=1, splits=1)

```

### How was this patch tested?
Parity unit tests.

Closes #40725 from xinrong-meng/connect_arrow_py_udf.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
risyomei pushed a commit to risyomei/spark that referenced this pull request Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants