[SPARK-18590][SPARKR] build R source package when making distribution #16014

felixcheung · 2016-11-25T23:10:03Z

What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

more details

These are the additional steps in make-distribution; please see here on what's going to a CRAN release, which is now run during make-distribution.sh.

package needs to be installed because the first code block in vignettes is library(SparkR) without lib path
R CMD build will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
R CMD check on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
(will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
R CMD Install on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling R CMD INSTALL --build pkg instead.
But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

How was this patch tested?

Manually, CI.

felixcheung · 2016-11-25T23:14:15Z

R/pkg/DESCRIPTION

 Version: 2.1.0
-Date: 2016-11-06


this is removed - I tried but haven't found a way to update this automatically, (I guess this could be in the release-tag script though)
But more importantly, seems like many (most?) packages do not have this in their DESCRIPTION file.

In any case, release date is stamped when releasing to CRAN.

felixcheung · 2016-11-25T23:16:10Z

R/pkg/NAMESPACE

@@ -3,7 +3,7 @@
 importFrom("methods", "setGeneric", "setMethod", "setOldClass")
 importFrom("methods", "is", "new", "signature", "show")
 importFrom("stats", "gaussian", "setNames")
-importFrom("utils", "download.file", "object.size", "packageVersion", "untar")
+importFrom("utils", "download.file", "object.size", "packageVersion", "tail", "untar")


This was regressed from a recent commit. check-cran.sh actually is flagging this by appending to an existing NOTE but we only check for # of NOTE (which is still 1), and so this went in undetected.

felixcheung · 2016-11-25T23:16:49Z

dev/create-release/release-build.sh

@@ -189,6 +189,9 @@ if [[ "$1" == "package" ]]; then
      SHA512 $PYTHON_DIST_NAME > \
      $PYTHON_DIST_NAME.sha

+    echo "Copying R source package"
+    cp spark-$SPARK_VERSION-bin-$NAME/R/SparkR_$SPARK_VERSION.tar.gz .


this is the source package we should release to CRAN

For clarity, this is the heart of the change? we were including R source before in releases, right, at least the source release? does this add something different?

Thanks @srowen for asking. I've updated the PR description above to clarify this.

"
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
"

felixcheung · 2016-11-26T00:14:25Z

@shivaram Can you please have a look?

SparkQA · 2016-11-26T01:38:24Z

Test build #69175 has finished for PR 16014 at commit 7977139.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram

Thanks @felixcheung - Change mostly looks good. The one thing I want to do is check if this works locally (we should also check this on Jenkins ideally to avoid any surprises while cutting the release cc @rxin)

Also do you think there are any changes in terms of what goes into the actual release zip / tar.gz built ?

shivaram · 2016-11-28T18:40:42Z

R/check-cran.sh

@@ -82,4 +83,20 @@ else
  # This will run tests and/or build vignettes, and require SPARK_HOME
  SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
 fi
+
+# Install source package to get it to generate vignettes rds files, etc.
+if [ -n "$CLEAN_INSTALL" ]


Isn't this already done by install-dev.sh ? I'm a bit confused as to why we need to call install again.

This is as mentioned above:

include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

R CMD Install on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)

Apparently the output is different with R CMD install versus what devtools is doing. I'll dig through the content and list them here.

So I did the diff. Here are the new files in the output of make-distribution in master branch with this change vs. 2.0.0
Files Added:

- R/lib/SparkR/Meta/vignette.rds - /R/lib/SparkR/doc/ - /R/lib/SparkR/doc/index.html - /R/lib/SparkR/doc/sparkr-vignettes.R - /R/lib/SparkR/doc/sparkr-vignettes.Rmd - /R/lib/SparkR/doc/sparkr-vignettes.html

Files removed: A bunch of HTML files starting from

/R/lib/SparkR/html/AFTSurvivalRegressionModel-class.html ... /R/lib/SparkR/html/year.html

So it looks like we lost the knitted HTML files in the SparkR package with this change. FWIW this may not be bad as the html files are not usually used locally and only used for the website and I think the docs creation part of the build should pick that up. (Verifying that now)

shivaram · 2016-11-28T18:45:46Z

dev/make-distribution.sh

+  # Build source package and run full checks
+  # Install source package to get it to generate vignettes, etc.
+  # Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME
+  NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh


Its a little awkward that we use check-cran.sh to build, install the package. I think it points to the fact that we can refactor the scripts more. But that can be done in a future PR

I agree. I think it is somewhat debatable whether we should run R CMD check in make-distribution.sh - but I feel there are gaps with what we check in Jenkins that it is worthwhile to repeat that here.

For everything else it's just convenient to call R from here. We could factor out the R environment stuff and have a separate install.sh (possibly replacing install-dev.sh since this does more with the source package? What do you think?)

Yeah longer term that sounds like a good idea.

felixcheung · 2016-11-28T23:50:14Z

I thought we don't run make-distribution.sh on Jenkins, so I have been testing it manually/locally.
As per here the output of install is different - I'll come back to list that in a bit.

shivaram · 2016-11-29T01:31:22Z

Just to clarify the Jenkins comment, we use Jenkins to test PRs and that doesn't run make-distribution.sh -- But the main release builds (say 2.0.2[2]) are also built on Jenkins by a separate job that calls just release-build.sh package[1] with the appropriate environment variables.

[1] https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L149
[2]https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-package/57/consoleFull

rxin · 2016-11-29T05:54:59Z

Why is this a blocker thing?

felixcheung · 2016-11-29T08:24:01Z

@shivaram Cool, I did know about release-build but I didn't know it's running on Jenkins. I think we should be ok but might want to check Jenkins has "e1071" and "survival" which are optional for compatibility tests but R CMD check is enforcing/requiring it.

@rxin This PR updates what goes into the Spark binary release to match what we (intend to) release on CRAN for the R package

As for the diff, this is the delta between this PR and Spark 2.0.2 under the R/lib/SparkR directory. It turns out R CMD check also depends on Rd file generation in install-dev.sh (ie. devtools::document(pkg="./pkg", roclets=c("rd")) }).. this is going to take more time to untangle this in a follow up.

what's additional

SparkR/
-rw-r--r--   INDEX
drwxr-xr-x   doc

SparkR/Meta/
-rw-r--r--   vignette.rds

SparkR/doc/
-rw-r--r--   sparkr-vignettes.Rmd
-rw-r--r--   sparkr-vignettes.R
-rw-r--r--   sparkr-vignettes.html
-rw-r--r--   index.html

These new files are what are required for vignettes to work. If you recall, this is the conversation that prompted this change.

what's omitted

SparkR/html/
-rw-r--r--  R.css
-rw-r--r--  00Index.html

What it used to have year.html write.parquet.html sparkR.session.html etc, the html directory now only has 2 files. My understanding is these knitr html output files are actually not used at runtime, but only as static pages we serve on http://spark.apache.org/docs/latest/api/R/index.html. I check that ?sparkR.session in the sparkR shell is still working correctly.

SparkQA · 2016-11-29T11:12:10Z

Test build #69315 has finished for PR 16014 at commit c9c9802.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-30T05:13:03Z

Any more thought on this? Without this we don't really have a signed tarball in the official release to release to CRAN...

shivaram · 2016-11-30T05:13:39Z

Sorry I was caught up with some other stuff today. Will take a final look tomm morning.

shivaram

Sorry for the delay @felixcheung - There was a bunch of other stuff I got caught up with. I think this change itself looks good. The only minor things I found were the issue with the html files and the need to change release-build.sh.

I can merge this to master and branch-2.1 unless you think we should fix those above issues in this change.

shivaram · 2016-12-04T23:50:03Z

dev/make-distribution.sh

+  # Build source package and run full checks
+  # Install source package to get it to generate vignettes, etc.
+  # Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME
+  NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh


Yeah longer term that sounds like a good idea.

shivaram · 2016-12-05T00:58:14Z

R/check-cran.sh

@@ -82,4 +83,20 @@ else
  # This will run tests and/or build vignettes, and require SPARK_HOME
  SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
 fi
+
+# Install source package to get it to generate vignettes rds files, etc.
+if [ -n "$CLEAN_INSTALL" ]


So I did the diff. Here are the new files in the output of make-distribution in master branch with this change vs. 2.0.0
Files Added:

- R/lib/SparkR/Meta/vignette.rds - /R/lib/SparkR/doc/ - /R/lib/SparkR/doc/index.html - /R/lib/SparkR/doc/sparkr-vignettes.R - /R/lib/SparkR/doc/sparkr-vignettes.Rmd - /R/lib/SparkR/doc/sparkr-vignettes.html

Files removed: A bunch of HTML files starting from

/R/lib/SparkR/html/AFTSurvivalRegressionModel-class.html ... /R/lib/SparkR/html/year.html

shivaram · 2016-12-05T01:07:18Z

R/check-cran.sh

@@ -82,4 +83,20 @@ else
  # This will run tests and/or build vignettes, and require SPARK_HOME
  SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
 fi
+
+# Install source package to get it to generate vignettes rds files, etc.
+if [ -n "$CLEAN_INSTALL" ]


So it looks like we lost the knitted HTML files in the SparkR package with this change. FWIW this may not be bad as the html files are not usually used locally and only used for the website and I think the docs creation part of the build should pick that up. (Verifying that now)

shivaram · 2016-12-05T01:09:03Z

dev/make-distribution.sh

@@ -71,6 +72,9 @@ while (( "$#" )); do
    --pip)
      MAKE_PIP=true
      ;;
+    --r)
+      MAKE_R=true


FWIW if you want this to get picked up by the official release building procedure we also need to edit release-build.sh [1]. Can you coordinate this with @rxin ?

[1]

spark/dev/create-release/release-build.sh

Line 220 in d9eb4c7

make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &

felixcheung · 2016-12-05T01:39:09Z

@shivaram yes, that is correct and what I've found here

Basically I don't think it is the R source package process to include the html files (they are still here getting created), and as explained and you pointed out, they are not actually used by R.

felixcheung · 2016-12-05T02:03:40Z

As for release-build.sh - I had the change in there but I've changed it to make it more clear.

shivaram · 2016-12-05T04:23:17Z

Ah got it - I didn't notice that. BTW the way @rxin handled this for pip is a bit different in 37e52f8

Might be good to merge with upstream and handle pip, R similarly ?

SparkQA · 2016-12-05T04:31:23Z

Test build #69655 has finished for PR 16014 at commit c17601b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… the distribution

SparkQA · 2016-12-06T08:03:37Z

Test build #69712 has finished for PR 16014 at commit 6ef26fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-06T08:25:03Z

thanks, updated. @rxin could you please review release-build.sh?

shivaram · 2016-12-06T18:02:32Z

dev/create-release/release-build.sh

  FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
  make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
  make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
  make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" &
  make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" &
-  make_binary_release "without-hadoop" "--r -Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" &
+  make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" "withr" &


Any specific reason to use the without-hadoop build for the R package ? Just wondering if this will affect the users in any fashion

It was mostly to use a "separate profile" from "withpip"

Running this R CMD build would run some Spark code (mainly in vignettes since we turn off tests in R CMD check), but nothing that depends on the file system etc.

Also the Spark jar, while loaded and called into during that process, will not be packaged into the resulting R source package, so I thought it didn't matter which build profile we would run this in.

@shivaram what do you think about this

I'd like to merge this to branch-2.1 to see if we could make it to 2.1.0 if at all possible.

I think it sounds fine. I was waiting to see if @rxin (or @JoshRosen ?) would take a look because I have not reviewed changes to this file before. Let me take another closer look and then we can merge it to branch-2.1 -- We'll see what happens to the RC process after that

shivaram · 2016-12-08T19:05:52Z

dev/create-release/release-build.sh

+      cd ..
+
+      echo "Copying and signing R source package"
+      R_DIST_NAME=SparkR_$SPARK_VERSION.tar.gz


Just to clarify this is the tgz that we will upload to CRAN right ?

shivaram · 2016-12-08T19:28:05Z

LGTM. I took another look at release-build.sh and I think it looks fine. Merging this into master, branch-2.1. I'll also see if I can test this out somehow on jenkins

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. Manually, CI. Author: Felix Cheung <[email protected]> Closes #16014 from felixcheung/rdist. (cherry picked from commit c3d3a9d) Signed-off-by: Shivaram Venkataraman <[email protected]>

shivaram · 2016-12-08T20:18:49Z

@felixcheung I triggered a nightly build on branch-2.1 and this doesn't work correctly in the no-hadoop build. While building the vignettes we run into an error -- I'm going to change this to use another of the hadoop builds

~/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-without-hadoop/R
* checking for file '/home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-without-hadoop/R/pkg/DESCRIPTION' ... OK
* preparing 'SparkR':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

    as.data.frame, colnames, colnames<-, drop, intersect, rank,
    rbind, sample, subset, summary, transform, union

Spark package found in SPARK_HOME: /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-without-hadoop
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:118)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:104)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 7 more
Quitting from lines 31-32 (sparkr-vignettes.Rmd)
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
JVM is not ready after 10 seconds
Execution halted

This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in #16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes #16218 from shivaram/fix-sparkr-release-build. (cherry picked from commit 202fcd2) Signed-off-by: Shivaram Venkataraman <[email protected]>

This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in #16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes #16218 from shivaram/fix-sparkr-release-build.

shivaram · 2016-12-08T21:54:48Z

Hmm still doesn't work

* DONE (SparkR)
+ popd
+ mkdir /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/conf
+ cp /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/docker.properties.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/fairscheduler.xml.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/log4j.properties.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/metrics.properties.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/slaves.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/spark-defaults.conf.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/spark-env.sh.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/conf
+ cp /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/README.md /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/bin /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/python /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/sbin /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ '[' -d /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/lib/SparkR ']'
+ mkdir -p /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/R/lib
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/lib/SparkR /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/R/lib
+ cp /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/lib/sparkr.zip /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/R/lib
+ '[' true == true ']'
+ TARDIR_NAME=spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ TARDIR=/home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ rm -rf /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ tar czf spark-2.1.1-SNAPSHOT-bin-hadoop2.6.tgz -C /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6 spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ rm -rf /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
Copying and signing R source package
cp: cannot stat `spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/SparkR_2.1.1-SNAPSHOT.tar.gz': No such file or directory

shivaram · 2016-12-08T22:03:14Z

Hmm - The problem is that the directory we want to get the file from gets deleted in the previous step ? I'm going to debug this locally now.

felixcheung · 2016-12-08T22:31:53Z

This just means the source tgz file is not created? Copying and signing R source package cp: cannot stat `spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/SparkR_2.1.1-SNAPSHOT.tar.gz': No such file or directory

felixcheung · 2016-12-08T22:34:19Z

I'm guessing maybe the path needs to be a full path?

shivaram · 2016-12-08T22:36:36Z

No I found the problem - its two fold

The tgz is not copied into the dist directory in make-distribution.sh [1]. This is relatively easy to fix. It only works in Python because the entire python directory is copied into the dist directory.
The second problem is one of naming - Our jar file is called SparkR_2.1.0.tar.gz as that is the version in DESCRIPTION file in R. I think we need to do a copy at some point to create SparkR_2.1.1-SNAPSHOT.tar.gz

[1]

spark/dev/make-distribution.sh

Line 244 in 202fcd2

mkdir -p "$DISTDIR"/R/lib

felixcheung · 2016-12-08T22:39:49Z

For the 2nd point, I thought make-tag.sh should update the version in DESCRIPTION file. Was that not the case when we make preparations for 2.1.1?

felixcheung · 2016-12-08T22:40:36Z

https://github.com/apache/spark/blob/202fcd21ce01393fa6dfaa1c2126e18e9b85ee96/dev/create-release/release-tag.sh I mean.

shivaram · 2016-12-08T22:42:44Z

I think release-tag.sh is only run while building RCs and final releases - I dont think we run it for nightly builds - So thats not getting run as a part of my test.

shivaram · 2016-12-08T23:05:03Z

Hmm looking more closely I dont think the directory being deleted / being in dist/ is a real issue. This is because in release-build.sh we do cd ..[1] before trying to look for the file. I think just fixing the file naming thing should be sufficient. I'm going to try that first.
[1]

spark/dev/create-release/release-build.sh

Line 184 in 202fcd2

cd ..

shivaram · 2016-12-08T23:16:27Z

New PR in #16221

…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in #16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes #16221 from shivaram/fix-sparkr-release-build-name. (cherry picked from commit 4ac8b20) Signed-off-by: Shivaram Venkataraman <[email protected]>

…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16221 from shivaram/fix-sparkr-release-build-name.

## What changes were proposed in this pull request? This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. Author: Felix Cheung <[email protected]> Closes apache#16014 from felixcheung/rdist.

This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16218 from shivaram/fix-sparkr-release-build.

…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16221 from shivaram/fix-sparkr-release-build-name.

## What changes were proposed in this pull request? This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. Author: Felix Cheung <[email protected]> Closes apache#16014 from felixcheung/rdist.

This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16218 from shivaram/fix-sparkr-release-build.

…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16221 from shivaram/fix-sparkr-release-build-name.

felixcheung commented Nov 25, 2016

View reviewed changes

shivaram reviewed Nov 28, 2016

View reviewed changes

shivaram reviewed Dec 5, 2016

View reviewed changes

felixcheung added 4 commits December 5, 2016 21:09

build source package in make-distribution, and take that as a part of…

b3e4d21

… the distribution

build R package only once

e579b23

change how --r is called in release-build.sh

afacbf3

rebase and updating to match single build with pip

6ef26fe

felixcheung force-pushed the rdist branch from c17601b to 6ef26fe Compare December 6, 2016 05:23

shivaram reviewed Dec 6, 2016

View reviewed changes

shivaram reviewed Dec 8, 2016

View reviewed changes

asfgit closed this in c3d3a9d Dec 8, 2016

shivaram mentioned this pull request Dec 8, 2016

[SPARKR][SPARK-18590] Change the R source build to Hadoop 2.6 #16218

Closed

shivaram mentioned this pull request Dec 8, 2016

[SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution #16221

Closed

[SPARK-18590][SPARKR] build R source package when making distribution #16014

[SPARK-18590][SPARKR] build R source package when making distribution #16014

Conversation

felixcheung commented Nov 25, 2016 • edited Loading

What changes were proposed in this pull request?

more details

How was this patch tested?

felixcheung Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

felixcheung Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Nov 26, 2016 • edited Loading

Choose a reason for hiding this comment

felixcheung commented Nov 26, 2016 • edited Loading

SparkQA commented Nov 26, 2016

shivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Nov 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Nov 28, 2016

shivaram commented Nov 29, 2016

rxin commented Nov 29, 2016

felixcheung commented Nov 29, 2016 • edited Loading

SparkQA commented Nov 29, 2016

felixcheung commented Nov 30, 2016

shivaram commented Nov 30, 2016

shivaram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Dec 5, 2016

felixcheung commented Dec 5, 2016

shivaram commented Dec 5, 2016

SparkQA commented Dec 5, 2016

SparkQA commented Dec 6, 2016

felixcheung commented Dec 6, 2016

Choose a reason for hiding this comment

felixcheung Dec 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram commented Dec 8, 2016

shivaram commented Dec 8, 2016

shivaram commented Dec 8, 2016

shivaram commented Dec 8, 2016

felixcheung commented Dec 8, 2016 via email

felixcheung commented Dec 8, 2016 via email

shivaram commented Dec 8, 2016

felixcheung commented Dec 8, 2016 via email

felixcheung commented Dec 8, 2016 via email

shivaram commented Dec 8, 2016

shivaram commented Dec 8, 2016

shivaram commented Dec 8, 2016

felixcheung commented Nov 25, 2016 •

edited

Loading

felixcheung Nov 25, 2016 •

edited

Loading

felixcheung Nov 25, 2016 •

edited

Loading

felixcheung Nov 26, 2016 •

edited

Loading

felixcheung commented Nov 26, 2016 •

edited

Loading

felixcheung Nov 28, 2016 •

edited

Loading

felixcheung commented Nov 29, 2016 •

edited

Loading

felixcheung Dec 6, 2016 •

edited

Loading