Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1652: Remove incorrect deprecation warning in spark-submit #578

Closed
wants to merge 3 commits into from

Conversation

pwendell
Copy link
Contributor

This is a straightforward fix.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14525/

@@ -124,6 +124,8 @@ object SparkSubmit {
}
} else if (clusterManager == YARN) {
childMainClass = "org.apache.spark.deploy.yarn.Client"
// Special flag to avoid deprecation warnings at the client
sysProps("SPARK_SUBMIT_YARN") = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is YARN relevant? Could call it SPARK_SUBMIT and use it for standalone with cluster deploy mode as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem we have with existing configs and envs is that someone creates it for one purpose and then it gets used for a zillion other purposes and it becomes really hard to understand what it's for.

So in general I'm biased at this point towards creating fairly narrowly scoped internal options whenever possible and generalizing them (lazily) when it's necessary. In this case I'd actually hope we can just remove this thing if/when we decide to remove the YARN clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza if you feel really strongly I could make it more general. I'm a bit scarred from some cases where options like this have taken on lives of their own :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's a hack / workaround, I think it's probably better to work with the assumption that it'll end up sticking around for a while. With that in mind, a system property that signifies we started something from spark-submit makes more sense to me than a system property that signifying we started from spark-submit AND happen to be running YARN. Especially if we'll need a similar one for standalone mode. Do you think SPARK_SUBMIT_YARN would be less likely to be abused than SPARK_SUBMIT?

That said, it's a tiny issue, and I don't feel super strongly, so if you feel differently, go ahead with the current patch.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@pwendell
Copy link
Contributor Author

Alright @sryza I updated it. I think the fact we'll need this for standalone mode justifies generalizing it.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14532/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14536/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14543/

@pwendell
Copy link
Contributor Author

Jenkins, retest this pleas.e

@pwendell
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14547/

@asfgit asfgit closed this in 9f7a095 Apr 29, 2014
asfgit pushed a commit that referenced this pull request Apr 29, 2014
This is a straightforward fix.

Author: Patrick Wendell <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <[email protected]>

Closes #578 from pwendell/spark-submit-yarn and squashes the following commits:

96027c7 [Patrick Wendell] Test fixes
b5be173 [Patrick Wendell] Review feedback
4ac9cac [Patrick Wendell] SPARK-1652: spark-submit for yarn prints warnings even though calling as expected
(cherry picked from commit 9f7a095)

Signed-off-by: Patrick Wendell <[email protected]>
pwendell pushed a commit to pwendell/spark that referenced this pull request May 12, 2014
SPARK-1076: zipWithIndex and zipWithUniqueId to RDD

Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel.

The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job.

https://spark-project.atlassian.net/browse/SPARK-1076

== update ==
Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field.

Author: Xiangrui Meng <[email protected]>

Closes apache#578 and squashes the following commits:

52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates
756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition
630868c [Xiangrui Meng] newline
21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
pwendell pushed a commit to pwendell/spark that referenced this pull request May 12, 2014
SPARK-1076: Convert Int to Long to avoid overflow

Patch for PR apache#578.

Author: Xiangrui Meng <[email protected]>

Closes apache#589 and squashes the following commits:

98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow
pwendell pushed a commit to pwendell/spark that referenced this pull request May 12, 2014
SPARK-1076: [Fix apache#578] add @transient to some vals

I'll try to be more careful next time.

Author: Xiangrui Meng <[email protected]>

Closes apache#591 and squashes the following commits:

2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This is a straightforward fix.

Author: Patrick Wendell <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <[email protected]>

Closes apache#578 from pwendell/spark-submit-yarn and squashes the following commits:

96027c7 [Patrick Wendell] Test fixes
b5be173 [Patrick Wendell] Review feedback
4ac9cac [Patrick Wendell] SPARK-1652: spark-submit for yarn prints warnings even though calling as expected
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
SPARK-1076: zipWithIndex and zipWithUniqueId to RDD

Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel.

The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job.

https://spark-project.atlassian.net/browse/SPARK-1076

== update ==
Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field.

Author: Xiangrui Meng <[email protected]>

Closes apache#578 and squashes the following commits:

52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates
756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition
630868c [Xiangrui Meng] newline
21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
SPARK-1076: Convert Int to Long to avoid overflow

Patch for PR apache#578.

Author: Xiangrui Meng <[email protected]>

Closes apache#589 and squashes the following commits:

98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
SPARK-1076: [Fix apache#578] add @transient to some vals

I'll try to be more careful next time.

Author: Xiangrui Meng <[email protected]>

Closes apache#591 and squashes the following commits:

2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/PartitionwiseSampledRDD.scala
helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Aug 20, 2019
…taFrame to R DataFrame (apache#578)

## What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
collect(createDataFrame(mtcars))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
```

**R code**

```r
df <- cache(createDataFrame(read.csv("500000.csv")))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  collect(df)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 221.32014 secs
```

```
Time difference of 15.51145 secs
```

The performance improvement was around **1426%**.

### Limitations:

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Existing tests related with Arrow optimization cover this change. Also, manually tested.
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
1. Remove deprecated role "install-docker-ce"
2. Apply role "install-openjdk" instead of installing OpenJDK in run.yaml
3. Add option "--r" for command "make-distribution.sh" in minikube job
   to avoid following docker build error because R code can't be found
   in docker file.

Related-Bug: theopenlab/openlab#310
Related-Bug: theopenlab/openlab#308
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants