SPARK-1652: Remove incorrect deprecation warning in spark-submit #578

pwendell · 2014-04-28T05:22:08Z

This is a straightforward fix.

… as expected

AmplabJenkins · 2014-04-28T05:22:57Z

Merged build triggered.

AmplabJenkins · 2014-04-28T05:23:04Z

Merged build started.

AmplabJenkins · 2014-04-28T05:58:32Z

Merged build finished.

AmplabJenkins · 2014-04-28T05:58:32Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14525/

sryza · 2014-04-28T06:02:48Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

@@ -124,6 +124,8 @@ object SparkSubmit {
      }
    } else if (clusterManager == YARN) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
+      // Special flag to avoid deprecation warnings at the client
+      sysProps("SPARK_SUBMIT_YARN") = "true"


Nit: is YARN relevant? Could call it SPARK_SUBMIT and use it for standalone with cluster deploy mode as well.

One problem we have with existing configs and envs is that someone creates it for one purpose and then it gets used for a zillion other purposes and it becomes really hard to understand what it's for.

So in general I'm biased at this point towards creating fairly narrowly scoped internal options whenever possible and generalizing them (lazily) when it's necessary. In this case I'd actually hope we can just remove this thing if/when we decide to remove the YARN clients.

@sryza if you feel really strongly I could make it more general. I'm a bit scarred from some cases where options like this have taken on lives of their own :)

Even if it's a hack / workaround, I think it's probably better to work with the assumption that it'll end up sticking around for a while. With that in mind, a system property that signifies we started something from spark-submit makes more sense to me than a system property that signifying we started from spark-submit AND happen to be running YARN. Especially if we'll need a similar one for standalone mode. Do you think SPARK_SUBMIT_YARN would be less likely to be abused than SPARK_SUBMIT?

That said, it's a tiny issue, and I don't feel super strongly, so if you feel differently, go ahead with the current patch.

AmplabJenkins · 2014-04-28T17:22:57Z

Merged build triggered.

AmplabJenkins · 2014-04-28T17:23:03Z

Merged build started.

pwendell · 2014-04-28T17:32:52Z

Alright @sryza I updated it. I think the fact we'll need this for standalone mode justifies generalizing it.

AmplabJenkins · 2014-04-28T17:59:44Z

Merged build finished.

AmplabJenkins · 2014-04-28T17:59:45Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14532/

AmplabJenkins · 2014-04-28T18:57:57Z

Merged build triggered.

AmplabJenkins · 2014-04-28T18:58:03Z

Merged build started.

AmplabJenkins · 2014-04-28T19:34:28Z

Merged build finished.

AmplabJenkins · 2014-04-28T19:34:28Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14536/

AmplabJenkins · 2014-04-28T22:57:58Z

Merged build triggered.

AmplabJenkins · 2014-04-28T22:58:06Z

Merged build started.

AmplabJenkins · 2014-04-29T00:24:00Z

Merged build finished.

AmplabJenkins · 2014-04-29T00:24:00Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14543/

pwendell · 2014-04-29T00:25:55Z

Jenkins, retest this pleas.e

pwendell · 2014-04-29T00:26:07Z

Jenkins, retest this please.

AmplabJenkins · 2014-04-29T00:27:57Z

Merged build triggered.

AmplabJenkins · 2014-04-29T00:28:05Z

Merged build started.

AmplabJenkins · 2014-04-29T01:34:52Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-29T01:34:52Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14547/

This is a straightforward fix. Author: Patrick Wendell <[email protected]> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <[email protected]> Closes #578 from pwendell/spark-submit-yarn and squashes the following commits: 96027c7 [Patrick Wendell] Test fixes b5be173 [Patrick Wendell] Review feedback 4ac9cac [Patrick Wendell] SPARK-1652: spark-submit for yarn prints warnings even though calling as expected (cherry picked from commit 9f7a095) Signed-off-by: Patrick Wendell <[email protected]>

SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <[email protected]> Closes apache#578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD

SPARK-1076: Convert Int to Long to avoid overflow Patch for PR apache#578. Author: Xiangrui Meng <[email protected]> Closes apache#589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow

@transient

SPARK-1076: [Fix apache#578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <[email protected]> Closes apache#591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD

This is a straightforward fix. Author: Patrick Wendell <[email protected]> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <[email protected]> Closes apache#578 from pwendell/spark-submit-yarn and squashes the following commits: 96027c7 [Patrick Wendell] Test fixes b5be173 [Patrick Wendell] Review feedback 4ac9cac [Patrick Wendell] SPARK-1652: spark-submit for yarn prints warnings even though calling as expected

SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <[email protected]> Closes apache#578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD

SPARK-1076: Convert Int to Long to avoid overflow Patch for PR apache#578. Author: Xiangrui Meng <[email protected]> Closes apache#589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow

@transient

SPARK-1076: [Fix apache#578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <[email protected]> Closes apache#591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD Conflicts: core/src/main/scala/org/apache/spark/rdd/PartitionwiseSampledRDD.scala

…pache#578)

…taFrame to R DataFrame (apache#578) ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r df <- cache(createDataFrame(read.csv("500000.csv"))) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() collect(df) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 221.32014 secs ``` ``` Time difference of 15.51145 secs ``` The performance improvement was around **1426%**. ### Limitations: - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Existing tests related with Arrow optimization cover this change. Also, manually tested.

1. Remove deprecated role "install-docker-ce" 2. Apply role "install-openjdk" instead of installing OpenJDK in run.yaml 3. Add option "--r" for command "make-distribution.sh" in minikube job to avoid following docker build error because R code can't be found in docker file. Related-Bug: theopenlab/openlab#310 Related-Bug: theopenlab/openlab#308

…pache#577)" (apache#578) This reverts commit c996bb0.

SPARK-1652: spark-submit for yarn prints warnings even though calling…

4ac9cac

… as expected

sryza reviewed Apr 28, 2014
View reviewed changes

Review feedback

b5be173

Test fixes

96027c7

asfgit closed this in 9f7a095 Apr 29, 2014

rvesse pushed a commit to rvesse/spark that referenced this pull request Mar 2, 2018

Append HADOOP_CONF_DIR to SPARK_CLASS in driver/executor Dockerfiles (a…

6428bb9

…pache#578)

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

Revert "MapR [SPARK-595] Spark cannot access hs2 through zookeeper (a…

395999e

…pache#577)" (apache#578) This reverts commit c996bb0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1652: Remove incorrect deprecation warning in spark-submit #578

SPARK-1652: Remove incorrect deprecation warning in spark-submit #578

pwendell commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

sryza Apr 28, 2014

pwendell Apr 28, 2014

pwendell Apr 28, 2014

sryza Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

pwendell commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

pwendell commented Apr 29, 2014

pwendell commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

SPARK-1652: Remove incorrect deprecation warning in spark-submit #578

SPARK-1652: Remove incorrect deprecation warning in spark-submit #578

Conversation

pwendell commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

sryza Apr 28, 2014

Choose a reason for hiding this comment

pwendell Apr 28, 2014

Choose a reason for hiding this comment

pwendell Apr 28, 2014

Choose a reason for hiding this comment

sryza Apr 28, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

pwendell commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 28, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

pwendell commented Apr 29, 2014

pwendell commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014

AmplabJenkins commented Apr 29, 2014