-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable #9390
Conversation
Test build #44734 has finished for PR 9390 at commit
|
Test build #44753 has finished for PR 9390 at commit
|
@sun-rui Thanks for the PR. It looks like this changes a lot of parts of the code though, so we'll need to review it carefully. Are there any alternate designs you have in mind that might not change our JVM to R protocol etc. ? |
@shivaram, there are two changes to JVM to R protocol:
So the basic change is that we have separate path for the SparkR package itself and additional R packages. An alternative design is: So I think the current PR is more efficient and potentially more flexible. No matter which design, change No.2 is necessary is for SparkR shell, because there needs to be a way for SparkR shell to access the additional R packages (the location of these packages is available after the R shell launched.) |
Why not keep the existing |
I could. But I think it is simpler to pass a list path using the existing env var instead of adding a new env var. |
} | ||
|
||
// TODO: Support distributing R packages with standalone cluster | ||
if (args.isR && clusterManager == STANDALONE && !RUtils.rPackages.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about MESOS?
submitted https://issues.apache.org/jira/browse/SPARK-11525: Support spark packages containing R source code in Standalone mode |
Test build #45105 has finished for PR 9390 at commit
|
@shivaram, R packages not work in standalone mode is an existing bug, not a result of this PR. |
I will test this PR on YARN cluster when I am free. |
@@ -25,3 +25,9 @@ set SPARK_HOME=%~dp0.. | |||
MKDIR %SPARK_HOME%\R\lib | |||
|
|||
R.exe CMD INSTALL --library="%SPARK_HOME%\R\lib" %SPARK_HOME%\R\pkg\ | |||
|
|||
rem Zip the SparkR package so that it can be distributed to worker nodes on YARN | |||
pushd %SPARK_HOME%\R\lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If R/lib
is not writable, how are you going to write it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens in build process, not runtime.
@brkyvz could you take another look ? It'll be good to get this PR in for 1.6. I'll also try to do a pass tonight |
I tested this PR on yarn-client/yarn-cluster with/without R source packages. All work fine. Let's merge it. I may investigate later adding test cases for YARN mode in another PR. |
@sun-rui I tested this package locally. I failed to use an R Spark Package. The problem is that if that package depends on SparkR, it can't find it during R CMD INSTALL since they don't reside in the same folder anymore. |
@sun-rui With this patch, you should be able to enable this test: This way you can make sure that you don't break any existing behavior. Regarding libraries not being able to find SparkR, one option would be to copy SparkR under R/lib to the temporary folder you create for building all other packages. Then you can zip everything except SparkR in that folder for Yarn support, and continue
|
Thanks to @brkyvz for catching this issue. I think a better solution is as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html:
|
I don't fully understand the issue, but why we have to use |
@felixcheung, "R CMD INSTALL" is used to install R packages. Now if any R package depends on SparkR package, the installation will fail as the SparkR package path is not put into the R search library paths. |
@brkyvz, setting env variable like 'R_LIBS' before running "R CMD INSALL" can fix this. But I will try to be consistent with the existing style that use "R_PROFILE_USER" env variable to specify an R script that will be automatically executed upon R is started. That is, using <SPARK_HOME>/R/lib/SparkR/profile/general.R to add the SparkR binary package path into .libPaths(). |
Test build #45741 has finished for PR 9390 at commit
|
Test build #45818 has finished for PR 9390 at commit
|
Jenkins, retest this please |
Test build #45836 has finished for PR 9390 at commit
|
Jenkins, retest this please |
Test build #45849 has finished for PR 9390 at commit
|
Jenkins, retest this please |
@brkyvz, the infinite loop if the environment is not cleared is because the inheritence of the R_PROFILE_USER env var set for launching SparkR, which causes recursive execution of sparkR.init(). By resetting R_PROFILE_USER to point to $SPARK_HOME/R/lib/profile/general.R, the SparkR lib path is added into the R lib search paths, as well as elimination of the infinite loop. "one stone, two birds":) |
Test build #45914 has finished for PR 9390 at commit
|
don't know what's going error with Jenkins |
Jenkins, retest this please |
Test build #45921 has finished for PR 9390 at commit
|
…table The basic idea is that: The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable. When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip. sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes. The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR. Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R. Author: Sun Rui <[email protected]> Closes #9390 from sun-rui/SPARK-10500. (cherry picked from commit 835a79d) Signed-off-by: Shivaram Venkataraman <[email protected]>
It looks like the "org.apache.spark.deploy.SparkSubmitSuite.correctly builds R packages included in a jar with --packages" test is now extremely flaky in the Maven builds: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven-with-YARN/4146/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/junit/org.apache.spark.deploy/SparkSubmitSuite/correctly_builds_R_packages_included_in_a_jar_with___packages/ |
@JoshRosen I'm taking a look. If I can't find anything soon I'll send a PR to disable the test again. |
Ok I got the error log from log4j and saved it to https://gist.github.com/shivaram/3a2fecce60768a603dac From the logs I see I'm not sure of a fix, so I'll send a PR to disable the test and we can investigate this further |
See #9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <[email protected]> Closes #9744 from shivaram/sparkr-package-test-disable. (cherry picked from commit ea6f53e) Signed-off-by: Andrew Or <[email protected]>
See #9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <[email protected]> Closes #9744 from shivaram/sparkr-package-test-disable.
This is probably because when the test is being executed, SparkR library is not available under $SPARK_HOME/R/lib. Does the build has "-Psparkr" specified? |
Great catch @sun-rui -- I think it doesn't have
@JoshRosen Could you add |
…table Backport #9390 and #9744 to branch-1.5. Author: Sun Rui <[email protected]> Author: Shivaram Venkataraman <[email protected]> Closes #10372 from sun-rui/SPARK-10500-branch-1.5.
See apache/spark#9390 (comment) and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <[email protected]> Closes #9744 from shivaram/sparkr-package-test-disable.
The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.
When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.
sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.
The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.
Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.