Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable #9390

Closed
wants to merge 5 commits into from

Conversation

sun-rui
Copy link
Contributor

@sun-rui sun-rui commented Oct 31, 2015

The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.

@SparkQA
Copy link

SparkQA commented Oct 31, 2015

Test build #44734 has finished for PR 9390 at commit 3eb8743.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @brkyvz @andrewor14

@SparkQA
Copy link

SparkQA commented Nov 1, 2015

Test build #44753 has finished for PR 9390 at commit 3fbb2db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Nov 2, 2015

@sun-rui Thanks for the PR. It looks like this changes a lot of parts of the code though, so we'll need to review it carefully.

Are there any alternate designs you have in mind that might not change our JVM to R protocol etc. ?

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 3, 2015

@shivaram, there are two changes to JVM to R protocol:

  1. Env variable format used to convey SparkR package path to R. Previously a single path for the SparkR package is conveyed, in this PR, a comma-separated path list is conveyed. The first element is the path for SparkR package, the other is for additional R packages specified via spark-submit command line options.
  2. Within sparkR.init(), after launching a JVM backend, a path for additional R packages is passed from the JVM backend to R. The path is then added into .libPaths() so that the additional R packages can be loaded within R environment.

So the basic change is that we have separate path for the SparkR package itself and additional R packages. An alternative design is:
When there are additional R packages, zip them and the SparkR packge into an archive in a temporary directory. Then we can distribute only one file to cluster and the change No.1 is not needed.
However, I think this alternative design has dis-advantages:
a. Each time there are additional R packages, the SparkR packge has to be re-zipped.
b. With standalone mode cluster, actually only additional R packages need to be distributed. But in this design, the SparkR packge will be redistributed together, which is not necessary, and waste of network traffic.

So I think the current PR is more efficient and potentially more flexible.

No matter which design, change No.2 is necessary is for SparkR shell, because there needs to be a way for SparkR shell to access the additional R packages (the location of these packages is available after the R shell launched.)

@felixcheung
Copy link
Member

Why not keep the existing SPARKR_PACKAGE_DIR and add another one for additional packages?

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 3, 2015

I could. But I think it is simpler to pass a list path using the existing env var instead of adding a new env var.

}

// TODO: Support distributing R packages with standalone cluster
if (args.isR && clusterManager == STANDALONE && !RUtils.rPackages.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about MESOS?

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 5, 2015

submitted https://issues.apache.org/jira/browse/SPARK-11525: Support spark packages containing R source code in Standalone mode

@sun-rui sun-rui changed the title [SPARK-10500][SPARKR][WIP] sparkr.zip cannot be created if /R/lib is unwritable [SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable Nov 5, 2015
@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45105 has finished for PR 9390 at commit 865471c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Nov 5, 2015

@sun-rui Do R packages not work in standalone mode right now or is this something we are breaking as a result of this change ?

Ping @brkyvz to take a look at this PR

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 6, 2015

@shivaram, R packages not work in standalone mode is an existing bug, not a result of this PR.

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 6, 2015

I will test this PR on YARN cluster when I am free.

@@ -25,3 +25,9 @@ set SPARK_HOME=%~dp0..
MKDIR %SPARK_HOME%\R\lib

R.exe CMD INSTALL --library="%SPARK_HOME%\R\lib" %SPARK_HOME%\R\pkg\

rem Zip the SparkR package so that it can be distributed to worker nodes on YARN
pushd %SPARK_HOME%\R\lib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If R/lib is not writable, how are you going to write it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens in build process, not runtime.

@shivaram
Copy link
Contributor

@brkyvz could you take another look ? It'll be good to get this PR in for 1.6. I'll also try to do a pass tonight

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 11, 2015

I tested this PR on yarn-client/yarn-cluster with/without R source packages. All work fine.

Let's merge it. I may investigate later adding test cases for YARN mode in another PR.

@brkyvz
Copy link
Contributor

brkyvz commented Nov 11, 2015

@sun-rui I tested this package locally. I failed to use an R Spark Package. The problem is that if that package depends on SparkR, it can't find it during R CMD INSTALL since they don't reside in the same folder anymore.

@brkyvz
Copy link
Contributor

brkyvz commented Nov 11, 2015

@sun-rui With this patch, you should be able to enable this test:


This way you can make sure that you don't break any existing behavior.
Regarding libraries not being able to find SparkR, one option would be to copy SparkR under R/lib to the temporary folder you create for building all other packages. Then you can zip everything except SparkR in that folder for Yarn support, and continue

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 12, 2015

Thanks to @brkyvz for catching this issue. I think a better solution is as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html:
The library search path is initialized at startup from the environment variable R_LIBS (which should be a colon-separated list of directories at which R library trees are rooted) followed by those in environment variable R_LIBS_USER. Only directories which exist at the time will be included.

Before installing R sources, R_LIBS can be set to point to <SPARK_HOME>/R/lib.

@felixcheung
Copy link
Member

I don't fully understand the issue, but why we have to use .libPaths and not library(... lib.loc= )?

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 12, 2015

@felixcheung, "R CMD INSTALL" is used to install R packages. Now if any R package depends on SparkR package, the installation will fail as the SparkR package path is not put into the R search library paths.

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 12, 2015

@brkyvz, setting env variable like 'R_LIBS' before running "R CMD INSALL" can fix this. But I will try to be consistent with the existing style that use "R_PROFILE_USER" env variable to specify an R script that will be automatically executed upon R is started. That is, using <SPARK_HOME>/R/lib/SparkR/profile/general.R to add the SparkR binary package path into .libPaths().

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45741 has finished for PR 9390 at commit 4530e7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45818 has finished for PR 9390 at commit 65b3d16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 13, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45836 has finished for PR 9390 at commit 65b3d16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 13, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45849 has finished for PR 9390 at commit 65b3d16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 14, 2015

Jenkins, retest this please

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 14, 2015

@brkyvz, the infinite loop if the environment is not cleared is because the inheritence of the R_PROFILE_USER env var set for launching SparkR, which causes recursive execution of sparkR.init(). By resetting R_PROFILE_USER to point to $SPARK_HOME/R/lib/profile/general.R, the SparkR lib path is added into the R lib search paths, as well as elimination of the infinite loop. "one stone, two birds":)

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45914 has finished for PR 9390 at commit 65b3d16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 14, 2015

don't know what's going error with Jenkins

@shivaram
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45921 has finished for PR 9390 at commit 65b3d16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Thanks @sun-rui and @brkyvz -- Sorry for the delay in getting to this. LGTM. Merging this to master and branch-1.6

asfgit pushed a commit that referenced this pull request Nov 16, 2015
…table

The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages)  so that these package can be accessed in R.

Author: Sun Rui <[email protected]>

Closes #9390 from sun-rui/SPARK-10500.

(cherry picked from commit 835a79d)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@asfgit asfgit closed this in 835a79d Nov 16, 2015
@JoshRosen
Copy link
Contributor

It looks like the "org.apache.spark.deploy.SparkSubmitSuite.correctly builds R packages included in a jar with --packages" test is now extremely flaky in the Maven builds: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven-with-YARN/4146/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/junit/org.apache.spark.deploy/SparkSubmitSuite/correctly_builds_R_packages_included_in_a_jar_with___packages/

@shivaram
Copy link
Contributor

@JoshRosen I'm taking a look. If I can't find anything soon I'll send a PR to disable the test again.

@shivaram
Copy link
Contributor

Ok I got the error log from log4j and saved it to https://gist.github.com/shivaram/3a2fecce60768a603dac

From the logs I see
@brkyvz @sun-rui From the logs I see ERROR: dependency 'SparkR' is not available for package 'sparkPackageTest' -- Looks like basic error is that the SparkR package is not available to be listed as a dependency

I'm not sure of a fix, so I'll send a PR to disable the test and we can investigate this further

asfgit pushed a commit that referenced this pull request Nov 17, 2015
See #9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information

Author: Shivaram Venkataraman <[email protected]>

Closes #9744 from shivaram/sparkr-package-test-disable.

(cherry picked from commit ea6f53e)
Signed-off-by: Andrew Or <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 17, 2015
See #9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information

Author: Shivaram Venkataraman <[email protected]>

Closes #9744 from shivaram/sparkr-package-test-disable.
@sun-rui
Copy link
Contributor Author

sun-rui commented Nov 17, 2015

This is probably because when the test is being executed, SparkR library is not available under $SPARK_HOME/R/lib. Does the build has "-Psparkr" specified?

@shivaram
Copy link
Contributor

Great catch @sun-rui -- I think it doesn't have -Psparkr -- Looking at the config xml file I see the command being run as

 build/mvn --force -DzincPort=$ZINC_PORT -DskipTests -P${HADOOP_PROFILE} -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl clean package

@JoshRosen Could you add -Psparkr to that and other QA builds ?

asfgit pushed a commit that referenced this pull request Dec 18, 2015
…table

Backport #9390 and #9744 to branch-1.5.

Author: Sun Rui <[email protected]>
Author: Shivaram Venkataraman <[email protected]>

Closes #10372 from sun-rui/SPARK-10500-branch-1.5.
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
See apache/spark#9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information

Author: Shivaram Venkataraman <[email protected]>

Closes #9744 from shivaram/sparkr-package-test-disable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants