-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18590][SPARKR] build R source package when making distribution #16014
Conversation
Version: 2.1.0 | ||
Date: 2016-11-06 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is removed - I tried but haven't found a way to update this automatically, (I guess this could be in the release-tag script though)
But more importantly, seems like many (most?) packages do not have this in their DESCRIPTION file.
In any case, release date is stamped when releasing to CRAN.
@@ -3,7 +3,7 @@ | |||
importFrom("methods", "setGeneric", "setMethod", "setOldClass") | |||
importFrom("methods", "is", "new", "signature", "show") | |||
importFrom("stats", "gaussian", "setNames") | |||
importFrom("utils", "download.file", "object.size", "packageVersion", "untar") | |||
importFrom("utils", "download.file", "object.size", "packageVersion", "tail", "untar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was regressed from a recent commit. check-cran.sh actually is flagging this by appending to an existing NOTE but we only check for # of NOTE (which is still 1), and so this went in undetected.
@@ -189,6 +189,9 @@ if [[ "$1" == "package" ]]; then | |||
SHA512 $PYTHON_DIST_NAME > \ | |||
$PYTHON_DIST_NAME.sha | |||
|
|||
echo "Copying R source package" | |||
cp spark-$SPARK_VERSION-bin-$NAME/R/SparkR_$SPARK_VERSION.tar.gz . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the source package we should release to CRAN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, this is the heart of the change? we were including R source before in releases, right, at least the source release? does this add something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srowen for asking. I've updated the PR description above to clarify this.
"
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
"
@shivaram Can you please have a look? |
Test build #69175 has finished for PR 16014 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felixcheung - Change mostly looks good. The one thing I want to do is check if this works locally (we should also check this on Jenkins ideally to avoid any surprises while cutting the release cc @rxin)
Also do you think there are any changes in terms of what goes into the actual release zip / tar.gz built ?
@@ -82,4 +83,20 @@ else | |||
# This will run tests and/or build vignettes, and require SPARK_HOME | |||
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz | |||
fi | |||
|
|||
# Install source package to get it to generate vignettes rds files, etc. | |||
if [ -n "$CLEAN_INSTALL" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already done by install-dev.sh
? I'm a bit confused as to why we need to call install again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is as mentioned above:
include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
R CMD Install on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)
(the output of this step is what we package into Spark dist and sparkr.zip)
Apparently the output is different with R CMD install
versus what devtools
is doing. I'll dig through the content and list them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did the diff. Here are the new files in the output of make-distribution
in master branch with this change vs. 2.0.0
Files Added:
- R/lib/SparkR/Meta/vignette.rds
- /R/lib/SparkR/doc/
- /R/lib/SparkR/doc/index.html
- /R/lib/SparkR/doc/sparkr-vignettes.R
- /R/lib/SparkR/doc/sparkr-vignettes.Rmd
- /R/lib/SparkR/doc/sparkr-vignettes.html
Files removed: A bunch of HTML files starting from
/R/lib/SparkR/html/AFTSurvivalRegressionModel-class.html
...
/R/lib/SparkR/html/year.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like we lost the knitted HTML files in the SparkR package with this change. FWIW this may not be bad as the html files are not usually used locally and only used for the website and I think the docs creation part of the build should pick that up. (Verifying that now)
# Build source package and run full checks | ||
# Install source package to get it to generate vignettes, etc. | ||
# Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME | ||
NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a little awkward that we use check-cran.sh
to build, install the package. I think it points to the fact that we can refactor the scripts more. But that can be done in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it is somewhat debatable whether we should run R CMD check
in make-distribution.sh
- but I feel there are gaps with what we check in Jenkins that it is worthwhile to repeat that here.
For everything else it's just convenient to call R from here. We could factor out the R environment stuff and have a separate install.sh
(possibly replacing install-dev.sh
since this does more with the source package? What do you think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah longer term that sounds like a good idea.
I thought we don't run |
Just to clarify the Jenkins comment, we use Jenkins to test PRs and that doesn't run make-distribution.sh -- But the main release builds (say 2.0.2[2]) are also built on Jenkins by a separate job that calls just [1] https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L149 |
Why is this a blocker thing? |
@shivaram Cool, I did know about release-build but I didn't know it's running on Jenkins. I think we should be ok but might want to check Jenkins has "e1071" and "survival" which are optional for compatibility tests but @rxin This PR updates what goes into the Spark binary release to match what we (intend to) release on CRAN for the R package As for the diff, this is the delta between this PR and Spark 2.0.2 under the R/lib/SparkR directory. It turns out what's additional
These new files are what are required for vignettes to work. If you recall, this is the conversation that prompted this change. what's omitted
What it used to have |
Test build #69315 has finished for PR 16014 at commit
|
Any more thought on this? Without this we don't really have a signed tarball in the official release to release to CRAN... |
Sorry I was caught up with some other stuff today. Will take a final look tomm morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @felixcheung - There was a bunch of other stuff I got caught up with. I think this change itself looks good. The only minor things I found were the issue with the html
files and the need to change release-build.sh.
I can merge this to master and branch-2.1 unless you think we should fix those above issues in this change.
# Build source package and run full checks | ||
# Install source package to get it to generate vignettes, etc. | ||
# Do not source the check-cran.sh - it should be run from where it is for it to set SPARK_HOME | ||
NO_TESTS=1 CLEAN_INSTALL=1 "$SPARK_HOME/"R/check-cran.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah longer term that sounds like a good idea.
@@ -82,4 +83,20 @@ else | |||
# This will run tests and/or build vignettes, and require SPARK_HOME | |||
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz | |||
fi | |||
|
|||
# Install source package to get it to generate vignettes rds files, etc. | |||
if [ -n "$CLEAN_INSTALL" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did the diff. Here are the new files in the output of make-distribution
in master branch with this change vs. 2.0.0
Files Added:
- R/lib/SparkR/Meta/vignette.rds
- /R/lib/SparkR/doc/
- /R/lib/SparkR/doc/index.html
- /R/lib/SparkR/doc/sparkr-vignettes.R
- /R/lib/SparkR/doc/sparkr-vignettes.Rmd
- /R/lib/SparkR/doc/sparkr-vignettes.html
Files removed: A bunch of HTML files starting from
/R/lib/SparkR/html/AFTSurvivalRegressionModel-class.html
...
/R/lib/SparkR/html/year.html
@@ -82,4 +83,20 @@ else | |||
# This will run tests and/or build vignettes, and require SPARK_HOME | |||
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz | |||
fi | |||
|
|||
# Install source package to get it to generate vignettes rds files, etc. | |||
if [ -n "$CLEAN_INSTALL" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like we lost the knitted HTML files in the SparkR package with this change. FWIW this may not be bad as the html files are not usually used locally and only used for the website and I think the docs creation part of the build should pick that up. (Verifying that now)
@@ -71,6 +72,9 @@ while (( "$#" )); do | |||
--pip) | |||
MAKE_PIP=true | |||
;; | |||
--r) | |||
MAKE_R=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW if you want this to get picked up by the official release building procedure we also need to edit release-build.sh [1]. Can you coordinate this with @rxin ?
[1]
spark/dev/create-release/release-build.sh
Line 220 in d9eb4c7
make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" & |
As for release-build.sh - I had the change in there but I've changed it to make it more clear. |
Test build #69655 has finished for PR 16014 at commit
|
Test build #69712 has finished for PR 16014 at commit
|
thanks, updated. @rxin could you please review release-build.sh? |
FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos" | ||
make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" & | ||
make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" & | ||
make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" & | ||
make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" "withpip" & | ||
make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" & | ||
make_binary_release "without-hadoop" "--r -Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" & | ||
make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" "withr" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason to use the without-hadoop
build for the R package ? Just wondering if this will affect the users in any fashion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mostly to use a "separate profile" from "withpip"
Running this R CMD build
would run some Spark code (mainly in vignettes since we turn off tests in R CMD check
), but nothing that depends on the file system etc.
Also the Spark jar, while loaded and called into during that process, will not be packaged into the resulting R source package, so I thought it didn't matter which build profile we would run this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it sounds fine. I was waiting to see if @rxin (or @JoshRosen ?) would take a look because I have not reviewed changes to this file before. Let me take another closer look and then we can merge it to branch-2.1 -- We'll see what happens to the RC process after that
cd .. | ||
|
||
echo "Copying and signing R source package" | ||
R_DIST_NAME=SparkR_$SPARK_VERSION.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify this is the tgz that we will upload to CRAN right ?
LGTM. I took another look at |
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. Manually, CI. Author: Felix Cheung <[email protected]> Closes #16014 from felixcheung/rdist. (cherry picked from commit c3d3a9d) Signed-off-by: Shivaram Venkataraman <[email protected]>
@felixcheung I triggered a nightly build on branch-2.1 and this doesn't work correctly in the no-hadoop build. While building the vignettes we run into an error -- I'm going to change this to use another of the hadoop builds
|
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in #16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes #16218 from shivaram/fix-sparkr-release-build. (cherry picked from commit 202fcd2) Signed-off-by: Shivaram Venkataraman <[email protected]>
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in #16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes #16218 from shivaram/fix-sparkr-release-build.
Hmm still doesn't work
|
Hmm - The problem is that the directory we want to get the file from gets deleted in the previous step ? I'm going to debug this locally now. |
This just means the source tgz file is not created?
Copying and signing R source package
cp: cannot stat `spark-2.1.1-SNAPSHOT-bin-hadoop2.6/R/SparkR_2.1.1-SNAPSHOT.tar.gz': No such file or directory
|
I'm guessing maybe the path needs to be a full path?
|
No I found the problem - its two fold
[1] spark/dev/make-distribution.sh Line 244 in 202fcd2
|
For the 2nd point, I thought make-tag.sh should update the version in DESCRIPTION file. Was that not the case when we make preparations for 2.1.1?
|
I think release-tag.sh is only run while building RCs and final releases - I dont think we run it for nightly builds - So thats not getting run as a part of my test. |
Hmm looking more closely I dont think the directory being deleted / being in spark/dev/create-release/release-build.sh Line 184 in 202fcd2
|
New PR in #16221 |
…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in #16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes #16221 from shivaram/fix-sparkr-release-build-name. (cherry picked from commit 4ac8b20) Signed-off-by: Shivaram Venkataraman <[email protected]>
…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16221 from shivaram/fix-sparkr-release-build-name.
## What changes were proposed in this pull request? This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. Author: Felix Cheung <[email protected]> Closes apache#16014 from felixcheung/rdist.
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16218 from shivaram/fix-sparkr-release-build.
…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16221 from shivaram/fix-sparkr-release-build-name.
## What changes were proposed in this pull request? This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. ### more details These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. ## How was this patch tested? Manually, CI. Author: Felix Cheung <[email protected]> Closes apache#16014 from felixcheung/rdist.
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16218 from shivaram/fix-sparkr-release-build.
…emove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16221 from shivaram/fix-sparkr-release-build-name.
What changes were proposed in this pull request?
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not)
But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below.
This PR also includes a few minor fixes.
more details
These are the additional steps in make-distribution; please see here on what's going to a CRAN release, which is now run during make-distribution.sh.
library(SparkR)
without lib pathR CMD build
will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation)R CMD check
on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN(will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests)
R CMD Install
on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1)(the output of this step is what we package into Spark dist and sparkr.zip)
Alternatively,
R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling
R CMD INSTALL --build pkg
instead.But in any case, despite installing the package multiple times this is relatively fast.
Building vignettes takes a while though.
How was this patch tested?
Manually, CI.