-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a Docker image generator gradle plugin for Spark applications #381
Conversation
dev/publish_functions.sh
Outdated
@@ -22,6 +22,9 @@ publish_artifacts() { | |||
echo "</server></servers></settings>" >> $tmp_settings | |||
|
|||
./build/mvn -T 1C --settings $tmp_settings -DskipTests "${PALANTIR_FLAGS[@]}" deploy | |||
cd resource-managers/kubernetes/docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want this to be done independently, e.g. I want to be able to retry only this part of the publish and not rerun the rest of the publish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually need to cd here (would be less error prone if you don't need to)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we need to cd (because the git root isn't a gradle project, whereas resource-managers/kubernetes/docker
is), and also I'd love to do this independently @mcheah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -479,6 +495,10 @@ workflows: | |||
requires: | |||
- build-sbt | |||
<<: *all-branches-and-tags | |||
- run-spark-docker-gradle-plugin-tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to wire publishing as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be included because we put the gradle publish call in publish_functions.sh
set -euo pipefail | ||
ROOT=$(git rev-parse --show-toplevel) | ||
cd "$ROOT/resource-managers/kubernetes/docker" | ||
./gradlew --info compileJava compileTestJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally you can just call resource-managers/kubernetes/docker/gradlew
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not build
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re calling resource-managers/kubernetes/docker/gradlew
, you have to be in the gradle build's root project, otherwise you'd need to set it with -p
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build
runs the tests too which is redundant given that we run tests in a separate phase. But it's probably fine to just have one build step that runs both compile and test.
Regarding running without changing directories, you get this:
FAILURE: Build failed with an exception.
* What went wrong:
Task 'compileJava' not found in root project 'spark'.
Gradle infers the project from the current working directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mccheah looks good - mostly would like to figure out if we can remove some boilerplate gradle scripts |
* limitations under the License. | ||
*/ | ||
|
||
rootProject.name = 'spark-docker-image-generator_2.11' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't add 2.11 - this isn't a scala project?
new TarArchiveInputStream(dockerResourcesGZipped)) { | ||
TarArchiveEntry nextEntry = dockerResourcesTar.getNextTarEntry(); | ||
while (nextEntry != null) { | ||
boolean extracted = maybeCopyTarEntry(dockerResourcesTar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extract once and copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function can be described using a Sync
task (or, if we need dockerBuildDir
to be defined lazily, then can use an inline action project.sync(spec -> ...)
.
Picking selective files from a tar is easy, though there is a bit more logic necessary if you wanna rename paths... as an example to get started:
// Not 100% sure if you can give it a URL but from looking into the code it should work
// Alternatively you can implement a `ReadableResource` and pass that to tarTree
FileTree tree = project.tarTree(getClass().getResource("/docker-resources.tgz"));
project.sync(spec -> {
spec.from(tree, s2 -> {
s2.include("docker-resources/bin", "docker-resources/sbin");
});
// TODO implement stripFirstDirectoryFrom - in groovy you'd do { it.relativePath.segments.toList()[1..-1].join("/") }
spec.eachFile(fcd -> fcd.setRelativePath(stripFirstDirectoryFrom(fcd.getRelativePath())));
spec.into(outputDir);
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you don't need deleteAndMkDirs
either, Sync automatcally copies just what was included by the from
blocks into the into
directory, removing everything else that's not supposed to be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already had to implement this somewhere else and can confirm it works. Implementation for stripFirstDirectoryFrom
:
RelativePath stripFirstDirectoryFrom(RelativePath relativePath) {
String[] segments = relativePath.getSegments();
return new RelativePath(relativePath.isFile(), Arrays.copyOfRange(segments, 1, segments.length));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't want to extract the tgz with just the top level directory removed - we want to also be dropping the files in specific locations so that the docker build can work with the Dockerfile properly when adding files. Thus something like this:
else if (sourcePathSegments[1].equals("entrypoint")) {
copyDetails.setRelativePath(
RelativePath.parse(
true,
"kubernetes/dockerfiles/spark/" + withoutTopLevel));
}
The alternative is building the tgz differently such that it's easier to extract on the task execution side.
return String.format("FROM %s", getBaseImage()); | ||
} else if (line.equals( | ||
" apk add --no-cache bash tini libc6-compat && \\")) { | ||
return line.replace("libc6-compat ", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just remove this from image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, we'll discuss this in the community - probably we want custom image authors to be providing this component themselves. For now I think it's fine to fork the Dockerfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait - this is our fork, let's just remove it from dockerfile here
BufferedReader sourceDockerFileBuffered = | ||
new BufferedReader(sourceDockerFileReader)) { | ||
List<String> fileLines = Collections.unmodifiableList( | ||
sourceDockerFileBuffered.lines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Files.lines()? Just remember to close the stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source docker file here is embedded inside the tarball. You can't get a Path
or a File
object directly from the archive entry, at least not one which can be used to open a stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about not embedding a tgz? Not sure how many files are actually there. Then you could use https://google.github.io/guava/releases/21.0/api/docs/com/google/common/io/Resources.html#readLines-java.net.URL-java.nio.charset.Charset-
+ dockerBuildDirectory.getAbsolutePath()); | ||
} | ||
|
||
project.afterEvaluate(evaluatedProject -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be using afterEvaluate - dependencies should be setup via task inputs and outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In sync. If any of the task inputs need to be computed lazily, declare a property field in the task which you can create with project.getObjects().property<...>()
, and instantiate it with a provider like task.theProperty.set(project.provider(() -> ...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at https://docs.gradle.org/current/userguide/lazy_configuration.html which describes how to use these in more detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the wiring here relies on the jar
task being set up already -n are we saying that the plugin user should be passing that output jar in specifically? I'm not sure how we get the output of the jar
task injected into the image without using afterEvaluate
to verify that the jar
task exists in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jar task comes from JavaPlugin
, you can do smth like
project.getPluginManager().withPlugin("java", javaPlugin -> {
// access the jar task here, knowing it has been created
});
set -euo pipefail | ||
ROOT=$(git rev-parse --show-toplevel) | ||
cd "$ROOT/resource-managers/kubernetes/docker" | ||
./gradlew --info compileJava compileTestJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re calling resource-managers/kubernetes/docker/gradlew
, you have to be in the gradle build's root project, otherwise you'd need to set it with -p
.
dev/publish_functions.sh
Outdated
@@ -22,6 +22,9 @@ publish_artifacts() { | |||
echo "</server></servers></settings>" >> $tmp_settings | |||
|
|||
./build/mvn -T 1C --settings $tmp_settings -DskipTests "${PALANTIR_FLAGS[@]}" deploy | |||
cd resource-managers/kubernetes/docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we need to cd (because the git root isn't a gradle project, whereas resource-managers/kubernetes/docker
is), and also I'd love to do this independently @mcheah.
set -euo pipefail | ||
ROOT=$(git rev-parse --show-toplevel) | ||
cd "$ROOT/resource-managers/kubernetes/docker" | ||
./gradlew --info test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these scripts just assume they'll be executing in kubernetes/docker? I'd rather be explicit about this in circle, something like ./build/gradlew -p resource-managers/kubernetes/docker --info test
|
||
task prepareResources(type:Exec) { | ||
workingDir "$rootDir" | ||
commandLine 'scripts/copy-docker-items-to-resources.sh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will always execute and can't take advantage of caching.
Looking at the script, it feels like this can be accomplished with two gradle tasks:
Sync
a bunch of stuff intoDOCKER_BUNDLE_DIR
. There are a bunch of examples in this working with files document about how to create file trees, use them as inputs, filter them etc.Tar
(see creating archives, but useTar
instead ofZip
& can enable compression too) that directory into an archive. There is still one gotcha here that you can only
Note: MODULE_ROOT_DIR is just (project.)rootDir
etc and I'd recommend creating outputs under the build
directory as is customary/default in gradle, as opposed to target
.
So DOCKER_BUNDLE_DIR
could also just be file("${buildDir}/docker-resources")
|
||
@Override | ||
public void apply(Project project) { | ||
project.getPluginManager().apply(getClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this - this is how to apply another plugin to a project. This plugin is already being applied by virtue of being in this method.
"-t", | ||
extension.getImageName(), | ||
buildDirectory.getAbsolutePath()); | ||
dockerBuild.setDependsOn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependsOn(...)
extension.getImageName(), | ||
buildDirectory.getAbsolutePath()); | ||
dockerBuild.setDependsOn( | ||
Arrays.asList(evaluatedProject.getTasks().getByName("sparkDockerPrepare"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can give it just strings - dependsOn("sparkDockerPrepare")
"docker", | ||
"tag", | ||
extension.getImageName(), | ||
String.format("%s:%s", extension.getImageName(), tag)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly the Exec task doesn't readily accept lazy arguments, so I can see why this wouldn't work unless you did afterEvaluate
.
The better way to do it though would be to create a custom class extending DefaultTask
that declares Property<...>
for any lazy arguments it needs, and do the work manually in a @TaskAction
-annotated method (can make use of project.exec(spec -> ...)
where things like setCommandLine
would be called on the spec
.
new TarArchiveInputStream(dockerResourcesGZipped)) { | ||
TarArchiveEntry nextEntry = dockerResourcesTar.getNextTarEntry(); | ||
while (nextEntry != null) { | ||
boolean extracted = maybeCopyTarEntry(dockerResourcesTar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function can be described using a Sync
task (or, if we need dockerBuildDir
to be defined lazily, then can use an inline action project.sync(spec -> ...)
.
Picking selective files from a tar is easy, though there is a bit more logic necessary if you wanna rename paths... as an example to get started:
// Not 100% sure if you can give it a URL but from looking into the code it should work
// Alternatively you can implement a `ReadableResource` and pass that to tarTree
FileTree tree = project.tarTree(getClass().getResource("/docker-resources.tgz"));
project.sync(spec -> {
spec.from(tree, s2 -> {
s2.include("docker-resources/bin", "docker-resources/sbin");
});
// TODO implement stripFirstDirectoryFrom - in groovy you'd do { it.relativePath.segments.toList()[1..-1].join("/") }
spec.eachFile(fcd -> fcd.setRelativePath(stripFirstDirectoryFrom(fcd.getRelativePath())));
spec.into(outputDir);
});
new TarArchiveInputStream(dockerResourcesGZipped)) { | ||
TarArchiveEntry nextEntry = dockerResourcesTar.getNextTarEntry(); | ||
while (nextEntry != null) { | ||
boolean extracted = maybeCopyTarEntry(dockerResourcesTar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you don't need deleteAndMkDirs
either, Sync automatcally copies just what was included by the from
blocks into the into
directory, removing everything else that's not supposed to be there.
@dansanduleac do you have any advice for how to create tasks given the setup of the extension, without using the afterEvaluate method? I looked at https://github.com/palantir/gradle-docker/blob/develop/src/main/groovy/com/palantir/gradle/docker/PalantirDockerPlugin.groovy#L117 and in particular it generated a separate docker tag task for every tag to be generated. But it had to use Taking a step back though I wouldn't be entirely opposed to the idea of only allowing for a single tag per Spark application image, so if accomplishing this would require an overly complex setup we can make this only assign one tag. |
Hm, we could define a task |
Actually |
Yep if you want to create a different number of tasks based on runtime values, then you need afterEvaluate (and wire them up via dependencies to a task that you explicitly create outside of afterEvaluate). But perhaps you can do only that bit in afterEvaluate, that would simplify the flow a lot. |
@dansanduleac @robert3005 think I've addressed all the comments. Will add some more tests but more feedback is appreciated. |
@mccheah how about instead of putting docker files as a tgz in the jar we publish them separately? This would be akin to how gradle-baseline does things where configuration is an input to the plugin. This would let us remove some of the weird file reading and in general be more gradle friendly |
dockerBuildDirectory.getAbsolutePath()); | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline character
import java.io.File; | ||
import java.io.IOException; | ||
|
||
public final class DockerBuildTaskSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer use of mockitojunitrunner since initMocks
no longer works in junit5 and it's more obvious to migrate since the runner no longer exists there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still there's weirdness here - you shouldn't need after evaluate. You want to string everything together with providers/properties and leave getting values to gradle
Agreed that properties should be used in the extension. We can move the project up to the root directory, I mostly didn't want to make the root directory too different from stock Spark's (i.e. isolate the added code to be in the resource-managers/kubernetes/docker subdirectory). The use of |
Makes sense - one more question though. Why do we need tags? They're not useful and it's easier to just publish snapshot and release versions. Internally we stopped relying on latest and snapshot long time ago and I don't see a good reason why you'd want them. |
The cue is mainly taken from gradle-docker: https://github.com/palantir/gradle-docker/blob/develop/src/main/groovy/com/palantir/gradle/docker/DockerExtension.groovy#L35. But this is more by convention and consistency with that plugin. In practice tag should more or less be the git tag, or the git hash if it's a snapshot.. |
- Move Gradle project to the root directory - Use Gradle 4.9 - More properties - Use baseline
|
||
<!-- Generated code should not be subjected to checkstyle. --> | ||
<suppress files="[/\\].*[/\\]generated.*[/\\]" checks="." /> | ||
<suppress files="[/\\].*[/\\]docker-resources.*[/\\]" checks="." /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid invoking checkstyle on shell scripts that are placed in the resources folder
</module> | ||
<module name="Header"> | ||
<!--<property name="header" value="^/\*$\n^ \* Licensed to the Apache Software Foundation (ASF) under one or more$\n"/>--> | ||
<property name="headerFile" value="${config_loc}/header.txt"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the header use only the Apache license
<module name="Indentation"> <!-- Java Style Guide: Block indentation: +2 spaces --> | ||
<property name="arrayInitIndent" value="4"/> | ||
<property name="lineWrappingIndentation" value="4"/> | ||
<property name="basicOffset" value="2"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All indentation is 2 spaces to be consistent with the rest of the Spark codebase
<message key="illegal.type" value="Do not use Guava caches, they are outperformed by and harder to use than Caffeine caches"/> | ||
</module> | ||
<!-- Use Spark's built-in import ordering --> | ||
<module name="ImportOrder"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import ordering matches the rest of Spark, although from what we can tell this hasn't been enforced consistently. https://spark.apache.org/contributing.html
@@ -113,3 +113,8 @@ structured-streaming/* | |||
kafka-source-initial-offset-version-2.1.0.bin | |||
kafka-source-initial-offset-future-version.bin | |||
vote.tmpl | |||
gradle-build/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid inspecting files built by Gradle for license checks
@robert3005 addressed comments. I've noted the differences to stock baseline in comments here. if you feel strongly about not supporting multiple tags then we can remove it. |
allprojects { | ||
group 'org.apache.spark' | ||
version System.env.CIRCLE_TAG ?: gitVersion() | ||
buildDir = 'gradle-build' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Spark's root directory already contains a folder called build
, presumably for sbt, and we can't delete its contents freely with ./gradlew clean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had some comments from before
testCompile gradleTestKit() | ||
} | ||
|
||
def getGitRepoRoot = { -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a closure whether you add ->
or not, can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function isn't in the latest version.
into 'kubernetes/dockerfiles/spark' | ||
} | ||
|
||
from(fileTree("$getGitRepoRoot/bin")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use a relative path here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this doesn't apply anymore, given how we've rearranged the Gradle project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We instead use rootDir
instead of getGitRepoRoot
. if we were to use relative paths, we'd have to refer to the parent path, e.g. from(fileTree("../bin"))
. I'm generally against referencing the parent path this way because it requires one to know where the script is running from. Referencing the absolute path gives us a path that's consistent regardless of where the script is running from or if we modify the directory structure of this project later.
# limitations under the License. | ||
# | ||
set -euo pipefail | ||
./gradlew --info --stacktrace test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't need this - just call the gradle command, you gain nothing by wrapping it
@mccheah lets remove the wrapping scripts after that it's LGTM from me |
commandLine.add(extension.getImageName()); | ||
task.setCommandLine(commandLine); | ||
}); | ||
project.afterEvaluate(evaluatedProject -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not necessary as long as we make configuration of the extension lazy. If you look at how MavenPublish plugin works from 4.8 onwards is that it creates tasks in all
action on publication container. Hence, we can assume that any time you'd need an after evaluate is an error since it means the user force project evaluation at some point.
Simple thing to do here would be to make tags NamedDomainObjectContainer
and then you can lazily register things by name which would be the tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done separately as long as it's done
@dansanduleac @robert3005 going to merge after latest CI passes if there are no more comments. |
Cool with me, just remember to squash |
Merge is complete, thanks! |
Whoops it broke master - will fix |
Allows specifying the jars for the application's classpath as well as a custom base image.