[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable #9390

sun-rui · 2015-10-31T10:30:01Z

The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.

…table

SparkQA · 2015-10-31T12:36:14Z

Test build #44734 has finished for PR 9390 at commit 3eb8743.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-10-31T21:34:25Z

cc @brkyvz @andrewor14

SparkQA · 2015-11-01T16:30:41Z

Test build #44753 has finished for PR 9390 at commit 3fbb2db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-11-02T19:41:19Z

@sun-rui Thanks for the PR. It looks like this changes a lot of parts of the code though, so we'll need to review it carefully.

Are there any alternate designs you have in mind that might not change our JVM to R protocol etc. ?

sun-rui · 2015-11-03T02:51:10Z

@shivaram, there are two changes to JVM to R protocol:

Env variable format used to convey SparkR package path to R. Previously a single path for the SparkR package is conveyed, in this PR, a comma-separated path list is conveyed. The first element is the path for SparkR package, the other is for additional R packages specified via spark-submit command line options.
Within sparkR.init(), after launching a JVM backend, a path for additional R packages is passed from the JVM backend to R. The path is then added into .libPaths() so that the additional R packages can be loaded within R environment.

So the basic change is that we have separate path for the SparkR package itself and additional R packages. An alternative design is:
When there are additional R packages, zip them and the SparkR packge into an archive in a temporary directory. Then we can distribute only one file to cluster and the change No.1 is not needed.
However, I think this alternative design has dis-advantages:
a. Each time there are additional R packages, the SparkR packge has to be re-zipped.
b. With standalone mode cluster, actually only additional R packages need to be distributed. But in this design, the SparkR packge will be redistributed together, which is not necessary, and waste of network traffic.

So I think the current PR is more efficient and potentially more flexible.

No matter which design, change No.2 is necessary is for SparkR shell, because there needs to be a way for SparkR shell to access the additional R packages (the location of these packages is available after the R shell launched.)

felixcheung · 2015-11-03T03:09:39Z

Why not keep the existing SPARKR_PACKAGE_DIR and add another one for additional packages?

sun-rui · 2015-11-03T04:36:31Z

I could. But I think it is simpler to pass a list path using the existing env var instead of adding a new env var.

felixcheung · 2015-11-04T06:36:35Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+    }
+
+    // TODO: Support distributing R packages with standalone cluster
+    if (args.isR && clusterManager == STANDALONE && !RUtils.rPackages.isEmpty) {


what about MESOS?

sun-rui · 2015-11-05T08:05:28Z

submitted https://issues.apache.org/jira/browse/SPARK-11525: Support spark packages containing R source code in Standalone mode

SparkQA · 2015-11-05T10:34:29Z

Test build #45105 has finished for PR 9390 at commit 865471c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-11-05T16:39:49Z

@sun-rui Do R packages not work in standalone mode right now or is this something we are breaking as a result of this change ?

Ping @brkyvz to take a look at this PR

sun-rui · 2015-11-06T01:11:48Z

@shivaram, R packages not work in standalone mode is an existing bug, not a result of this PR.

sun-rui · 2015-11-06T01:12:05Z

I will test this PR on YARN cluster when I am free.

brkyvz · 2015-11-06T21:57:41Z

R/install-dev.bat

@@ -25,3 +25,9 @@ set SPARK_HOME=%~dp0..
 MKDIR %SPARK_HOME%\R\lib

 R.exe CMD INSTALL --library="%SPARK_HOME%\R\lib"  %SPARK_HOME%\R\pkg\
+
+rem Zip the SparkR package so that it can be distributed to worker nodes on YARN
+pushd %SPARK_HOME%\R\lib


If R/lib is not writable, how are you going to write it here?

This happens in build process, not runtime.

shivaram · 2015-11-10T22:53:59Z

@brkyvz could you take another look ? It'll be good to get this PR in for 1.6. I'll also try to do a pass tonight

sun-rui · 2015-11-11T07:45:47Z

I tested this PR on yarn-client/yarn-cluster with/without R source packages. All work fine.

Let's merge it. I may investigate later adding test cases for YARN mode in another PR.

brkyvz · 2015-11-11T17:12:19Z

@sun-rui I tested this package locally. I failed to use an R Spark Package. The problem is that if that package depends on SparkR, it can't find it during R CMD INSTALL since they don't reside in the same folder anymore.

brkyvz · 2015-11-11T18:16:10Z

@sun-rui With this patch, you should be able to enable this test:

spark/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

Line 372 in 27524a3

/*

This way you can make sure that you don't break any existing behavior.
Regarding libraries not being able to find SparkR, one option would be to copy SparkR under R/lib to the temporary folder you create for building all other packages. Then you can zip everything except SparkR in that folder for Yarn support, and continue

sun-rui · 2015-11-12T04:44:26Z

Thanks to @brkyvz for catching this issue. I think a better solution is as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html:
The library search path is initialized at startup from the environment variable R_LIBS (which should be a colon-separated list of directories at which R library trees are rooted) followed by those in environment variable R_LIBS_USER. Only directories which exist at the time will be included.

Before installing R sources, R_LIBS can be set to point to <SPARK_HOME>/R/lib.

felixcheung · 2015-11-12T07:32:11Z

I don't fully understand the issue, but why we have to use .libPaths and not library(... lib.loc= )?

sun-rui · 2015-11-12T14:35:13Z

@felixcheung, "R CMD INSTALL" is used to install R packages. Now if any R package depends on SparkR package, the installation will fail as the SparkR package path is not put into the R search library paths.

sun-rui · 2015-11-12T14:41:06Z

@brkyvz, setting env variable like 'R_LIBS' before running "R CMD INSALL" can fix this. But I will try to be consistent with the existing style that use "R_PROFILE_USER" env variable to specify an R script that will be automatically executed upon R is started. That is, using <SPARK_HOME>/R/lib/SparkR/profile/general.R to add the SparkR binary package path into .libPaths().

SparkQA · 2015-11-12T17:37:30Z

Test build #45741 has finished for PR 9390 at commit 4530e7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-13T05:44:05Z

Test build #45818 has finished for PR 9390 at commit 65b3d16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-11-13T05:53:45Z

Jenkins, retest this please

SparkQA · 2015-11-13T07:50:09Z

Test build #45836 has finished for PR 9390 at commit 65b3d16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-11-13T08:36:45Z

Jenkins, retest this please

SparkQA · 2015-11-13T10:40:11Z

Test build #45849 has finished for PR 9390 at commit 65b3d16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-11-14T01:23:52Z

Jenkins, retest this please

sun-rui · 2015-11-14T01:30:45Z

@brkyvz, the infinite loop if the environment is not cleared is because the inheritence of the R_PROFILE_USER env var set for launching SparkR, which causes recursive execution of sparkR.init(). By resetting R_PROFILE_USER to point to $SPARK_HOME/R/lib/profile/general.R, the SparkR lib path is added into the R lib search paths, as well as elimination of the infinite loop. "one stone, two birds":)

SparkQA · 2015-11-14T03:12:17Z

Test build #45914 has finished for PR 9390 at commit 65b3d16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-11-14T04:46:11Z

don't know what's going error with Jenkins

shivaram · 2015-11-14T05:06:30Z

Jenkins, retest this please

SparkQA · 2015-11-14T07:16:48Z

Test build #45921 has finished for PR 9390 at commit 65b3d16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-11-16T03:28:37Z

Thanks @sun-rui and @brkyvz -- Sorry for the delay in getting to this. LGTM. Merging this to master and branch-1.6

…table The basic idea is that: The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable. When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip. sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes. The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR. Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R. Author: Sun Rui <[email protected]> Closes #9390 from sun-rui/SPARK-10500. (cherry picked from commit 835a79d) Signed-off-by: Shivaram Venkataraman <[email protected]>

JoshRosen · 2015-11-16T20:28:33Z

It looks like the "org.apache.spark.deploy.SparkSubmitSuite.correctly builds R packages included in a jar with --packages" test is now extremely flaky in the Maven builds: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven-with-YARN/4146/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/junit/org.apache.spark.deploy/SparkSubmitSuite/correctly_builds_R_packages_included_in_a_jar_with___packages/

shivaram · 2015-11-16T20:41:57Z

@JoshRosen I'm taking a look. If I can't find anything soon I'll send a PR to disable the test again.

shivaram · 2015-11-16T20:47:12Z

Ok I got the error log from log4j and saved it to https://gist.github.com/shivaram/3a2fecce60768a603dac

From the logs I see
@brkyvz @sun-rui From the logs I see ERROR: dependency 'SparkR' is not available for package 'sparkPackageTest' -- Looks like basic error is that the SparkR package is not available to be listed as a dependency

I'm not sure of a fix, so I'll send a PR to disable the test and we can investigate this further

See #9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <[email protected]> Closes #9744 from shivaram/sparkr-package-test-disable. (cherry picked from commit ea6f53e) Signed-off-by: Andrew Or <[email protected]>

See #9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <[email protected]> Closes #9744 from shivaram/sparkr-package-test-disable.

sun-rui · 2015-11-17T03:10:21Z

This is probably because when the test is being executed, SparkR library is not available under $SPARK_HOME/R/lib. Does the build has "-Psparkr" specified?

shivaram · 2015-11-17T03:47:22Z

Great catch @sun-rui -- I think it doesn't have -Psparkr -- Looking at the config xml file I see the command being run as

 build/mvn --force -DzincPort=$ZINC_PORT -DskipTests -P${HADOOP_PROFILE} -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl clean package

@JoshRosen Could you add -Psparkr to that and other QA builds ?

…table Backport #9390 and #9744 to branch-1.5. Author: Sun Rui <[email protected]> Author: Shivaram Venkataraman <[email protected]> Closes #10372 from sun-rui/SPARK-10500-branch-1.5.

See apache/spark#9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <[email protected]> Closes #9744 from shivaram/sparkr-package-test-disable.

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwri…

3eb8743

…table

Support accessing the built R packages in SparkR shell.

3fbb2db

felixcheung reviewed Nov 4, 2015
View reviewed changes

Exit when running SparkR with mesos cluster.

865471c

sun-rui changed the title ~~[SPARK-10500][SPARKR][WIP] sparkr.zip cannot be created if /R/lib is unwritable~~ [SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable Nov 5, 2015

brkyvz reviewed Nov 6, 2015
View reviewed changes

Fix R package installation failure if it depends on SparkR.

4530e7f

Fix comments.

65b3d16

asfgit closed this in 835a79d Nov 16, 2015

shivaram mentioned this pull request Nov 16, 2015

[SPARKR] [HOTFIX] Disable flaky SparkR package build test #9744

Closed

sun-rui mentioned this pull request Dec 18, 2015

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable #10372

Closed

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable #9390

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable #9390

Conversation

sun-rui commented Oct 31, 2015

SparkQA commented Oct 31, 2015

shivaram commented Oct 31, 2015

SparkQA commented Nov 1, 2015

shivaram commented Nov 2, 2015

sun-rui commented Nov 3, 2015

felixcheung commented Nov 3, 2015

sun-rui commented Nov 3, 2015

felixcheung Nov 4, 2015

Choose a reason for hiding this comment

sun-rui commented Nov 5, 2015

SparkQA commented Nov 5, 2015

shivaram commented Nov 5, 2015

sun-rui commented Nov 6, 2015

sun-rui commented Nov 6, 2015

brkyvz Nov 6, 2015

Choose a reason for hiding this comment

sun-rui Nov 7, 2015

Choose a reason for hiding this comment

shivaram commented Nov 10, 2015

sun-rui commented Nov 11, 2015

brkyvz commented Nov 11, 2015

brkyvz commented Nov 11, 2015

sun-rui commented Nov 12, 2015

felixcheung commented Nov 12, 2015

sun-rui commented Nov 12, 2015

sun-rui commented Nov 12, 2015

SparkQA commented Nov 12, 2015

SparkQA commented Nov 13, 2015

sun-rui commented Nov 13, 2015

SparkQA commented Nov 13, 2015

sun-rui commented Nov 13, 2015

SparkQA commented Nov 13, 2015

sun-rui commented Nov 14, 2015

sun-rui commented Nov 14, 2015

SparkQA commented Nov 14, 2015

sun-rui commented Nov 14, 2015

shivaram commented Nov 14, 2015

SparkQA commented Nov 14, 2015

shivaram commented Nov 16, 2015

JoshRosen commented Nov 16, 2015

shivaram commented Nov 16, 2015

shivaram commented Nov 16, 2015

sun-rui commented Nov 17, 2015

shivaram commented Nov 17, 2015