Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18590][SPARKR] build R source package when making distribution #16014

Closed
wants to merge 4 commits into from

Conversation

felixcheung
Copy link
Member

@felixcheung felixcheung commented Nov 25, 2016

What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

more details

These are the additional steps in make-distribution; please see here on what's going to a CRAN release, which is now run during make-distribution.sh.

  1. package needs to be installed because the first code block in vignettes is library(SparkR) without lib path
  2. R CMD build will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
  3. R CMD check on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
    (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
  4. R CMD Install on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
    (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling R CMD INSTALL --build pkg instead.
But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

How was this patch tested?

Manually, CI.

Version: 2.1.0
Date: 2016-11-06
Copy link
Member Author

@felixcheung felixcheung Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removed - I tried but haven't found a way to update this automatically, (I guess this could be in the release-tag script though)
But more importantly, seems like many (most?) packages do not have this in their DESCRIPTION file.

In any case, release date is stamped when releasing to CRAN.

@@ -3,7 +3,7 @@
importFrom("methods", "setGeneric", "setMethod", "setOldClass")
importFrom("methods", "is", "new", "signature", "show")
importFrom("stats", "gaussian", "setNames")
importFrom("utils", "download.file", "object.size", "packageVersion", "untar")
importFrom("utils", "download.file", "object.size", "packageVersion", "tail", "untar")
Copy link
Member Author

@felixcheung felixcheung Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was regressed from a recent commit. check-cran.sh actually is flagging this by appending to an existing NOTE but we only check for # of NOTE (which is still 1), and so this went in undetected.

@@ -189,6 +189,9 @@ if [[ "$1" == "package" ]]; then
SHA512 $PYTHON_DIST_NAME > \
$PYTHON_DIST_NAME.sha

echo "Copying R source package"
cp spark-$SPARK_VERSION-bin-$NAME/R/SparkR_$SPARK_VERSION.tar.gz .
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the source package we should release to CRAN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, this is the heart of the change? we were including R source before in releases, right, at least the source release? does this add something different?

Copy link
Member Author

@felixcheung felixcheung Nov 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srowen for asking. I've updated the PR description above to clarify this.

"
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
"

@felixcheung
Copy link
Member Author

felixcheung commented Nov 26, 2016

@shivaram Can you please have a look?

@SparkQA
Copy link

SparkQA commented Nov 26, 2016

Test build #69175 has finished for PR 16014 at commit 7977139.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung - Change mostly looks good. The one thing I want to do is check if this works locally (we should also check this on Jenkins ideally to avoid any surprises while cutting the release cc @rxin)

Also do you think there are any changes in terms of what goes into the actual release zip / tar.gz built ?

@@ -82,4 +83,20 @@ else
# This will run tests and/or build vignettes, and require SPARK_HOME
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
fi

# Install source package to get it to generate vignettes rds files, etc.
if [ -n "$CLEAN_INSTALL" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already done by install-dev.sh ? I'm a bit confused as to why we need to call install again.

Copy link
Member Author

@felixcheung felixcheung Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as mentioned above:

include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

R CMD Install on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)

Apparently the output is different with R CMD install versus what devtools is doing. I'll dig through the content and list them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did the diff. Here are the new files in the output of make-distribution in master branch with this change vs. 2.0.0
Files Added:

- R/lib/SparkR/Meta/vignette.rds
- /R/lib/SparkR/doc/
- /R/lib/SparkR/doc/index.html
- /R/lib/SparkR/doc/sparkr-vignettes.R
- /R/lib/SparkR/doc/sparkr-vignettes.Rmd
- /R/lib/SparkR/doc/sparkr-vignettes.html

Files removed: A bunch of HTML files starting from

/R/lib/SparkR/html/AFTSurvivalRegressionModel-class.html
...
/R/lib/SparkR/html/year.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like we lost the knitted HTML files in the SparkR package with this change. FWIW this may not be bad as the html files are not usually used locally and only used for the website and I think the docs creation part of the build should pick that up. (Verifying that now)

# Build source package and run full checks
# Install source package to get it to generate vignettes, etc.
# Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME
NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a little awkward that we use check-cran.sh to build, install the package. I think it points to the fact that we can refactor the scripts more. But that can be done in a future PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it is somewhat debatable whether we should run R CMD check in make-distribution.sh - but I feel there are gaps with what we check in Jenkins that it is worthwhile to repeat that here.

For everything else it's just convenient to call R from here. We could factor out the R environment stuff and have a separate install.sh (possibly replacing install-dev.sh since this does more with the source package? What do you think?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah longer term that sounds like a good idea.

@felixcheung
Copy link
Member Author

I thought we don't run make-distribution.sh on Jenkins, so I have been testing it manually/locally.
As per here the output of install is different - I'll come back to list that in a bit.

@shivaram
Copy link
Contributor

Just to clarify the Jenkins comment, we use Jenkins to test PRs and that doesn't run make-distribution.sh -- But the main release builds (say 2.0.2[2]) are also built on Jenkins by a separate job that calls just release-build.sh package[1] with the appropriate environment variables.

[1] https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L149
[2]https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/job/spark-release-package/57/consoleFull

@rxin
Copy link
Contributor

rxin commented Nov 29, 2016

Why is this a blocker thing?

@felixcheung
Copy link
Member Author

felixcheung commented Nov 29, 2016

@shivaram Cool, I did know about release-build but I didn't know it's running on Jenkins. I think we should be ok but might want to check Jenkins has "e1071" and "survival" which are optional for compatibility tests but R CMD check is enforcing/requiring it.

@rxin This PR updates what goes into the Spark binary release to match what we (intend to) release on CRAN for the R package

As for the diff, this is the delta between this PR and Spark 2.0.2 under the R/lib/SparkR directory. It turns out R CMD check also depends on Rd file generation in install-dev.sh (ie. devtools::document(pkg="./pkg", roclets=c("rd")) }).. this is going to take more time to untangle this in a follow up.

what's additional

SparkR/
-rw-r--r--   INDEX
drwxr-xr-x   doc

SparkR/Meta/
-rw-r--r--   vignette.rds

SparkR/doc/
-rw-r--r--   sparkr-vignettes.Rmd
-rw-r--r--   sparkr-vignettes.R
-rw-r--r--   sparkr-vignettes.html
-rw-r--r--   index.html

These new files are what are required for vignettes to work. If you recall, this is the conversation that prompted this change.

what's omitted

SparkR/html/
-rw-r--r--  R.css
-rw-r--r--  00Index.html

What it used to have year.html write.parquet.html sparkR.session.html etc, the html directory now only has 2 files. My understanding is these knitr html output files are actually not used at runtime, but only as static pages we serve on http://spark.apache.org/docs/latest/api/R/index.html. I check that ?sparkR.session in the sparkR shell is still working correctly.

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69315 has finished for PR 16014 at commit c9c9802.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

Any more thought on this? Without this we don't really have a signed tarball in the official release to release to CRAN...

@shivaram
Copy link
Contributor

Sorry I was caught up with some other stuff today. Will take a final look tomm morning.

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @felixcheung - There was a bunch of other stuff I got caught up with. I think this change itself looks good. The only minor things I found were the issue with the html files and the need to change release-build.sh.

I can merge this to master and branch-2.1 unless you think we should fix those above issues in this change.

# Build source package and run full checks
# Install source package to get it to generate vignettes, etc.
# Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME
NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah longer term that sounds like a good idea.

@@ -82,4 +83,20 @@ else
# This will run tests and/or build vignettes, and require SPARK_HOME
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
fi

# Install source package to get it to generate vignettes rds files, etc.
if [ -n "$CLEAN_INSTALL" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did the diff. Here are the new files in the output of make-distribution in master branch with this change vs. 2.0.0
Files Added:

- R/lib/SparkR/Meta/vignette.rds
- /R/lib/SparkR/doc/
- /R/lib/SparkR/doc/index.html
- /R/lib/SparkR/doc/sparkr-vignettes.R
- /R/lib/SparkR/doc/sparkr-vignettes.Rmd
- /R/lib/SparkR/doc/sparkr-vignettes.html

Files removed: A bunch of HTML files starting from

/R/lib/SparkR/html/AFTSurvivalRegressionModel-class.html
...
/R/lib/SparkR/html/year.html

@@ -82,4 +83,20 @@ else
# This will run tests and/or build vignettes, and require SPARK_HOME
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
fi

# Install source package to get it to generate vignettes rds files, etc.
if [ -n "$CLEAN_INSTALL" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like we lost the knitted HTML files in the SparkR package with this change. FWIW this may not be bad as the html files are not usually used locally and only used for the website and I think the docs creation part of the build should pick that up. (Verifying that now)

@@ -71,6 +72,9 @@ while (( "$#" )); do
--pip)
MAKE_PIP=true
;;
--r)
MAKE_R=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if you want this to get picked up by the official release building procedure we also need to edit release-build.sh [1]. Can you coordinate this with @rxin ?

[1]

make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &

@felixcheung
Copy link
Member Author

@shivaram yes, that is correct and what I've found here

Basically I don't think it is the R source package process to include the html files (they are still here getting created), and as explained and you pointed out, they are not actually used by R.

@felixcheung
Copy link
Member Author

As for release-build.sh - I had the change in there but I've changed it to make it more clear.

@shivaram
Copy link
Contributor

shivaram commented Dec 5, 2016

Ah got it - I didn't notice that. BTW the way @rxin handled this for pip is a bit different in 37e52f8

Might be good to merge with upstream and handle pip, R similarly ?

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69655 has finished for PR 16014 at commit c17601b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2016

Test build #69712 has finished for PR 16014 at commit 6ef26fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

thanks, updated. @rxin could you please review release-build.sh?

FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" &
make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" &
make_binary_release "without-hadoop" "--r -Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" &
make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" "withr" &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason to use the without-hadoop build for the R package ? Just wondering if this will affect the users in any fashion

Copy link
Member Author

@felixcheung felixcheung Dec 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly to use a "separate profile" from "withpip"

Running this R CMD build would run some Spark code (mainly in vignettes since we turn off tests in R CMD check), but nothing that depends on the file system etc.

Also the Spark jar, while loaded and called into during that process, will not be packaged into the resulting R source package, so I thought it didn't matter which build profile we would run this in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram what do you think about this

I'd like to merge this to branch-2.1 to see if we could make it to 2.1.0 if at all possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it sounds fine. I was waiting to see if @rxin (or @JoshRosen ?) would take a look because I have not reviewed changes to this file before. Let me take another closer look and then we can merge it to branch-2.1 -- We'll see what happens to the RC process after that

cd ..

echo "Copying and signing R source package"
R_DIST_NAME=SparkR_$SPARK_VERSION.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify this is the tgz that we will upload to CRAN right ?

@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

LGTM. I took another look at release-build.sh and I think it looks fine. Merging this into master, branch-2.1. I'll also see if I can test this out somehow on jenkins

@asfgit asfgit closed this in c3d3a9d Dec 8, 2016
asfgit pushed a commit that referenced this pull request Dec 8, 2016
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
 But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

Manually, CI.

Author: Felix Cheung <[email protected]>

Closes #16014 from felixcheung/rdist.

(cherry picked from commit c3d3a9d)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

@felixcheung I triggered a nightly build on branch-2.1 and this doesn't work correctly in the no-hadoop build. While building the vignettes we run into an error -- I'm going to change this to use another of the hadoop builds

~/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-without-hadoop/R
* checking for file '/home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-without-hadoop/R/pkg/DESCRIPTION' ... OK
* preparing 'SparkR':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

    as.data.frame, colnames, colnames<-, drop, intersect, rank,
    rbind, sample, subset, summary, transform, union

Spark package found in SPARK_HOME: /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-without-hadoop
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:118)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:104)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 7 more
Quitting from lines 31-32 (sparkr-vignettes.Rmd)
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
JVM is not ready after 10 seconds
Execution halted

asfgit pushed a commit that referenced this pull request Dec 8, 2016
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in #16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes #16218 from shivaram/fix-sparkr-release-build.

(cherry picked from commit 202fcd2)
Signed-off-by: Shivaram Venkataraman <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 8, 2016
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in #16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes #16218 from shivaram/fix-sparkr-release-build.
@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

Hmm still doesn't work

* DONE (SparkR)
+ popd
+ mkdir /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/conf
+ cp /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/docker.properties.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/fairscheduler.xml.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/log4j.properties.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/metrics.properties.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/slaves.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/spark-defaults.conf.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/conf/spark-env.sh.template /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/conf
+ cp /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/README.md /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/bin /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/python /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/sbin /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist
+ '[' -d /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/lib/SparkR ']'
+ mkdir -p /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/R/lib
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/lib/SparkR /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/R/lib
+ cp /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/lib/sparkr.zip /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist/R/lib
+ '[' true == true ']'
+ TARDIR_NAME=spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ TARDIR=/home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ rm -rf /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ cp -r /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/dist /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ tar czf spark-2.1.1-SNAPSHOT-bin-hadoop2.6.tgz -C /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6 spark-2.1.1-SNAPSHOT-bin-hadoop2.6
+ rm -rf /home/jenkins/workspace/spark-branch-2.1-package/spark-2.1.1-SNAPSHOT-bin-hadoop2.6/spark-2.1.1-SNAPSHOT-bin-hadoop2.6
Copying and signing R source package
cp: cannot stat `spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/SparkR_2.1.1-SNAPSHOT.tar.gz': No such file or directory

@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

Hmm - The problem is that the directory we want to get the file from gets deleted in the previous step ? I'm going to debug this locally now.

@felixcheung
Copy link
Member Author

felixcheung commented Dec 8, 2016 via email

@felixcheung
Copy link
Member Author

felixcheung commented Dec 8, 2016 via email

@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

No I found the problem - its two fold

  • The tgz is not copied into the dist directory in make-distribution.sh [1]. This is relatively easy to fix. It only works in Python because the entire python directory is copied into the dist directory.

  • The second problem is one of naming - Our jar file is called SparkR_2.1.0.tar.gz as that is the version in DESCRIPTION file in R. I think we need to do a copy at some point to create SparkR_2.1.1-SNAPSHOT.tar.gz

[1]

mkdir -p "$DISTDIR"/R/lib

@felixcheung
Copy link
Member Author

felixcheung commented Dec 8, 2016 via email

@felixcheung
Copy link
Member Author

felixcheung commented Dec 8, 2016 via email

@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

I think release-tag.sh is only run while building RCs and final releases - I dont think we run it for nightly builds - So thats not getting run as a part of my test.

@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

Hmm looking more closely I dont think the directory being deleted / being in dist/ is a real issue. This is because in release-build.sh we do cd ..[1] before trying to look for the file. I think just fixing the file naming thing should be sufficient. I'm going to try that first.
[1]

@shivaram
Copy link
Contributor

shivaram commented Dec 8, 2016

New PR in #16221

asfgit pushed a commit that referenced this pull request Dec 9, 2016
…emove pip tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works correctly.

Issue discussed in #16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes #16221 from shivaram/fix-sparkr-release-build-name.

(cherry picked from commit 4ac8b20)
Signed-off-by: Shivaram Venkataraman <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 9, 2016
…emove pip tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works correctly.

Issue discussed in apache#16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes apache#16221 from shivaram/fix-sparkr-release-build-name.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

### more details

These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
 But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

## How was this patch tested?

Manually, CI.

Author: Felix Cheung <[email protected]>

Closes apache#16014 from felixcheung/rdist.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes apache#16218 from shivaram/fix-sparkr-release-build.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…emove pip tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works correctly.

Issue discussed in apache#16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes apache#16221 from shivaram/fix-sparkr-release-build-name.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)

But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.

This PR also includes a few minor fixes.

### more details

These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh.
1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path
2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)
3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN
 (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
 (the output of this step is what we package into Spark dist and sparkr.zip)

Alternatively,
   R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead.
 But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.

## How was this patch tested?

Manually, CI.

Author: Felix Cheung <[email protected]>

Closes apache#16014 from felixcheung/rdist.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes apache#16218 from shivaram/fix-sparkr-release-build.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…emove pip tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works correctly.

Issue discussed in apache#16014 (comment)

Author: Shivaram Venkataraman <[email protected]>

Closes apache#16221 from shivaram/fix-sparkr-release-build-name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants