Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Docker image generator gradle plugin for Spark applications #381

Merged
merged 30 commits into from
Aug 31, 2018

Conversation

mccheah
Copy link

@mccheah mccheah commented Jun 9, 2018

Allows specifying the jars for the application's classpath as well as a custom base image.

@@ -22,6 +22,9 @@ publish_artifacts() {
echo "</server></servers></settings>" >> $tmp_settings

./build/mvn -T 1C --settings $tmp_settings -DskipTests "${PALANTIR_FLAGS[@]}" deploy
cd resource-managers/kubernetes/docker
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want this to be done independently, e.g. I want to be able to retry only this part of the publish and not rerun the rest of the publish.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need to cd here (would be less error prone if you don't need to)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need to cd (because the git root isn't a gradle project, whereas resource-managers/kubernetes/docker is), and also I'd love to do this independently @mcheah.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -479,6 +495,10 @@ workflows:
requires:
- build-sbt
<<: *all-branches-and-tags
- run-spark-docker-gradle-plugin-tests:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to wire publishing as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be included because we put the gradle publish call in publish_functions.sh

set -euo pipefail
ROOT=$(git rev-parse --show-toplevel)
cd "$ROOT/resource-managers/kubernetes/docker"
./gradlew --info compileJava compileTestJava

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you can just call resource-managers/kubernetes/docker/gradlew?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not build?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re calling resource-managers/kubernetes/docker/gradlew, you have to be in the gradle build's root project, otherwise you'd need to set it with -p.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build runs the tests too which is redundant given that we run tests in a separate phase. But it's probably fine to just have one build step that runs both compile and test.

Regarding running without changing directories, you get this:

FAILURE: Build failed with an exception.

* What went wrong:
Task 'compileJava' not found in root project 'spark'.

Gradle infers the project from the current working directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert3005
Copy link

@mccheah looks good - mostly would like to figure out if we can remove some boilerplate gradle scripts

* limitations under the License.
*/

rootProject.name = 'spark-docker-image-generator_2.11'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add 2.11 - this isn't a scala project?

new TarArchiveInputStream(dockerResourcesGZipped)) {
TarArchiveEntry nextEntry = dockerResourcesTar.getNextTarEntry();
while (nextEntry != null) {
boolean extracted = maybeCopyTarEntry(dockerResourcesTar,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extract once and copy?

Copy link

@dansanduleac dansanduleac Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function can be described using a Sync task (or, if we need dockerBuildDir to be defined lazily, then can use an inline action project.sync(spec -> ...).
Picking selective files from a tar is easy, though there is a bit more logic necessary if you wanna rename paths... as an example to get started:

// Not 100% sure if you can give it a URL but from looking into the code it should work
// Alternatively you can implement a `ReadableResource` and pass that to tarTree
FileTree tree = project.tarTree(getClass().getResource("/docker-resources.tgz"));
project.sync(spec -> {
    spec.from(tree, s2 -> {
        s2.include("docker-resources/bin", "docker-resources/sbin");
    });
    // TODO implement stripFirstDirectoryFrom - in groovy you'd do {  it.relativePath.segments.toList()[1..-1].join("/") }
    spec.eachFile(fcd -> fcd.setRelativePath(stripFirstDirectoryFrom(fcd.getRelativePath())));
    spec.into(outputDir);
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you don't need deleteAndMkDirs either, Sync automatcally copies just what was included by the from blocks into the into directory, removing everything else that's not supposed to be there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already had to implement this somewhere else and can confirm it works. Implementation for stripFirstDirectoryFrom:

    RelativePath stripFirstDirectoryFrom(RelativePath relativePath) {
        String[] segments = relativePath.getSegments();
        return new RelativePath(relativePath.isFile(), Arrays.copyOfRange(segments, 1, segments.length));
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't want to extract the tgz with just the top level directory removed - we want to also be dropping the files in specific locations so that the docker build can work with the Dockerfile properly when adding files. Thus something like this:

else if (sourcePathSegments[1].equals("entrypoint")) {
  copyDetails.setRelativePath(
      RelativePath.parse(
          true,
          "kubernetes/dockerfiles/spark/" + withoutTopLevel));
} 

The alternative is building the tgz differently such that it's easier to extract on the task execution side.

return String.format("FROM %s", getBaseImage());
} else if (line.equals(
" apk add --no-cache bash tini libc6-compat && \\")) {
return line.replace("libc6-compat ", "");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just remove this from image?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, we'll discuss this in the community - probably we want custom image authors to be providing this component themselves. For now I think it's fine to fork the Dockerfile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait - this is our fork, let's just remove it from dockerfile here

BufferedReader sourceDockerFileBuffered =
new BufferedReader(sourceDockerFileReader)) {
List<String> fileLines = Collections.unmodifiableList(
sourceDockerFileBuffered.lines()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files.lines()? Just remember to close the stream

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source docker file here is embedded inside the tarball. You can't get a Path or a File object directly from the archive entry, at least not one which can be used to open a stream.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about not embedding a tgz? Not sure how many files are actually there. Then you could use https://google.github.io/guava/releases/21.0/api/docs/com/google/common/io/Resources.html#readLines-java.net.URL-java.nio.charset.Charset-

+ dockerBuildDirectory.getAbsolutePath());
}

project.afterEvaluate(evaluatedProject -> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be using afterEvaluate - dependencies should be setup via task inputs and outputs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sync. If any of the task inputs need to be computed lazily, declare a property field in the task which you can create with project.getObjects().property<...>(), and instantiate it with a provider like task.theProperty.set(project.provider(() -> ...))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at https://docs.gradle.org/current/userguide/lazy_configuration.html which describes how to use these in more detail

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the wiring here relies on the jar task being set up already -n are we saying that the plugin user should be passing that output jar in specifically? I'm not sure how we get the output of the jar task injected into the image without using afterEvaluate to verify that the jar task exists in the first place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jar task comes from JavaPlugin, you can do smth like

project.getPluginManager().withPlugin("java", javaPlugin -> {
    // access the jar task here, knowing it has been created
});

set -euo pipefail
ROOT=$(git rev-parse --show-toplevel)
cd "$ROOT/resource-managers/kubernetes/docker"
./gradlew --info compileJava compileTestJava

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re calling resource-managers/kubernetes/docker/gradlew, you have to be in the gradle build's root project, otherwise you'd need to set it with -p.

@@ -22,6 +22,9 @@ publish_artifacts() {
echo "</server></servers></settings>" >> $tmp_settings

./build/mvn -T 1C --settings $tmp_settings -DskipTests "${PALANTIR_FLAGS[@]}" deploy
cd resource-managers/kubernetes/docker

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need to cd (because the git root isn't a gradle project, whereas resource-managers/kubernetes/docker is), and also I'd love to do this independently @mcheah.

set -euo pipefail
ROOT=$(git rev-parse --show-toplevel)
cd "$ROOT/resource-managers/kubernetes/docker"
./gradlew --info test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these scripts just assume they'll be executing in kubernetes/docker? I'd rather be explicit about this in circle, something like ./build/gradlew -p resource-managers/kubernetes/docker --info test


task prepareResources(type:Exec) {
workingDir "$rootDir"
commandLine 'scripts/copy-docker-items-to-resources.sh'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always execute and can't take advantage of caching.
Looking at the script, it feels like this can be accomplished with two gradle tasks:

  1. Sync a bunch of stuff into DOCKER_BUNDLE_DIR. There are a bunch of examples in this working with files document about how to create file trees, use them as inputs, filter them etc.
  2. Tar (see creating archives, but use Tar instead of Zip & can enable compression too) that directory into an archive. There is still one gotcha here that you can only

Note: MODULE_ROOT_DIR is just (project.)rootDir etc and I'd recommend creating outputs under the build directory as is customary/default in gradle, as opposed to target.
So DOCKER_BUNDLE_DIR could also just be file("${buildDir}/docker-resources")


@Override
public void apply(Project project) {
project.getPluginManager().apply(getClass());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this - this is how to apply another plugin to a project. This plugin is already being applied by virtue of being in this method.

"-t",
extension.getImageName(),
buildDirectory.getAbsolutePath());
dockerBuild.setDependsOn(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependsOn(...)

extension.getImageName(),
buildDirectory.getAbsolutePath());
dockerBuild.setDependsOn(
Arrays.asList(evaluatedProject.getTasks().getByName("sparkDockerPrepare")));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can give it just strings - dependsOn("sparkDockerPrepare")

"docker",
"tag",
extension.getImageName(),
String.format("%s:%s", extension.getImageName(), tag));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly the Exec task doesn't readily accept lazy arguments, so I can see why this wouldn't work unless you did afterEvaluate.
The better way to do it though would be to create a custom class extending DefaultTask that declares Property<...> for any lazy arguments it needs, and do the work manually in a @TaskAction-annotated method (can make use of project.exec(spec -> ...) where things like setCommandLine would be called on the spec.

new TarArchiveInputStream(dockerResourcesGZipped)) {
TarArchiveEntry nextEntry = dockerResourcesTar.getNextTarEntry();
while (nextEntry != null) {
boolean extracted = maybeCopyTarEntry(dockerResourcesTar,
Copy link

@dansanduleac dansanduleac Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function can be described using a Sync task (or, if we need dockerBuildDir to be defined lazily, then can use an inline action project.sync(spec -> ...).
Picking selective files from a tar is easy, though there is a bit more logic necessary if you wanna rename paths... as an example to get started:

// Not 100% sure if you can give it a URL but from looking into the code it should work
// Alternatively you can implement a `ReadableResource` and pass that to tarTree
FileTree tree = project.tarTree(getClass().getResource("/docker-resources.tgz"));
project.sync(spec -> {
    spec.from(tree, s2 -> {
        s2.include("docker-resources/bin", "docker-resources/sbin");
    });
    // TODO implement stripFirstDirectoryFrom - in groovy you'd do {  it.relativePath.segments.toList()[1..-1].join("/") }
    spec.eachFile(fcd -> fcd.setRelativePath(stripFirstDirectoryFrom(fcd.getRelativePath())));
    spec.into(outputDir);
});

new TarArchiveInputStream(dockerResourcesGZipped)) {
TarArchiveEntry nextEntry = dockerResourcesTar.getNextTarEntry();
while (nextEntry != null) {
boolean extracted = maybeCopyTarEntry(dockerResourcesTar,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you don't need deleteAndMkDirs either, Sync automatcally copies just what was included by the from blocks into the into directory, removing everything else that's not supposed to be there.

@mccheah
Copy link
Author

mccheah commented Jun 19, 2018

@dansanduleac do you have any advice for how to create tasks given the setup of the extension, without using the afterEvaluate method?

I looked at https://github.com/palantir/gradle-docker/blob/develop/src/main/groovy/com/palantir/gradle/docker/PalantirDockerPlugin.groovy#L117 and in particular it generated a separate docker tag task for every tag to be generated. But it had to use afterEvaluate to do that, and I'm wondering if we can do something more idiomatic.

Taking a step back though I wouldn't be entirely opposed to the idea of only allowing for a single tag per Spark application image, so if accomplishing this would require an overly complex setup we can make this only assign one tag.

@mccheah
Copy link
Author

mccheah commented Jun 19, 2018

Hm, we could define a task Rule which checks if the extension defined the given tag and constructs the task on the fly. Would that work? https://www.oreilly.com/library/view/gradle-beyond-the/9781449373801/ch03.html. I'll give this a try.

@mccheah
Copy link
Author

mccheah commented Jun 20, 2018

Actually Rule and Model based Gradle development is a pretty significant paradigm shift altogether. It might be correct for this situation though. Will do some experimentation with the APIs.

@dansanduleac
Copy link

Yep if you want to create a different number of tasks based on runtime values, then you need afterEvaluate (and wire them up via dependencies to a task that you explicitly create outside of afterEvaluate). But perhaps you can do only that bit in afterEvaluate, that would simplify the flow a lot.
Not sure if it's worth complicating matters with the Rule api which seems to me as powerful as scala implicits but with the added blackboxiness of doing everything via reflection.

@mccheah
Copy link
Author

mccheah commented Jun 25, 2018

@dansanduleac @robert3005 think I've addressed all the comments. Will add some more tests but more feedback is appreciated.

@robert3005
Copy link

@mccheah how about instead of putting docker files as a tgz in the jar we publish them separately? This would be akin to how gradle-baseline does things where configuration is an input to the plugin. This would let us remove some of the weird file reading and in general be more gradle friendly

dockerBuildDirectory.getAbsolutePath());
}

}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline character

import java.io.File;
import java.io.IOException;

public final class DockerBuildTaskSuite {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer use of mockitojunitrunner since initMocks no longer works in junit5 and it's more obvious to migrate since the runner no longer exists there

Copy link

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still there's weirdness here - you shouldn't need after evaluate. You want to string everything together with providers/properties and leave getting values to gradle

@mccheah
Copy link
Author

mccheah commented Aug 16, 2018

Agreed that properties should be used in the extension. We can move the project up to the root directory, I mostly didn't want to make the root directory too different from stock Spark's (i.e. isolate the added code to be in the resource-managers/kubernetes/docker subdirectory). The use of afterEvaluate as well as custom exec tasks is a necessary evil - please refer to earlier discussion in the PR.

@robert3005
Copy link

Makes sense - one more question though. Why do we need tags? They're not useful and it's easier to just publish snapshot and release versions. Internally we stopped relying on latest and snapshot long time ago and I don't see a good reason why you'd want them.

@mccheah
Copy link
Author

mccheah commented Aug 16, 2018

The cue is mainly taken from gradle-docker: https://github.com/palantir/gradle-docker/blob/develop/src/main/groovy/com/palantir/gradle/docker/DockerExtension.groovy#L35. But this is more by convention and consistency with that plugin. In practice tag should more or less be the git tag, or the git hash if it's a snapshot..

- Move Gradle project to the root directory
- Use Gradle 4.9
- More properties
- Use baseline

<!-- Generated code should not be subjected to checkstyle. -->
<suppress files="[/\\].*[/\\]generated.*[/\\]" checks="." />
<suppress files="[/\\].*[/\\]docker-resources.*[/\\]" checks="." />
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid invoking checkstyle on shell scripts that are placed in the resources folder

</module>
<module name="Header">
<!--<property name="header" value="^/\*$\n^ \* Licensed to the Apache Software Foundation (ASF) under one or more$\n"/>-->
<property name="headerFile" value="${config_loc}/header.txt"/>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the header use only the Apache license

<module name="Indentation"> <!-- Java Style Guide: Block indentation: +2 spaces -->
<property name="arrayInitIndent" value="4"/>
<property name="lineWrappingIndentation" value="4"/>
<property name="basicOffset" value="2"/>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All indentation is 2 spaces to be consistent with the rest of the Spark codebase

<message key="illegal.type" value="Do not use Guava caches, they are outperformed by and harder to use than Caffeine caches"/>
</module>
<!-- Use Spark's built-in import ordering -->
<module name="ImportOrder">
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import ordering matches the rest of Spark, although from what we can tell this hasn't been enforced consistently. https://spark.apache.org/contributing.html

@@ -113,3 +113,8 @@ structured-streaming/*
kafka-source-initial-offset-version-2.1.0.bin
kafka-source-initial-offset-future-version.bin
vote.tmpl
gradle-build/*
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid inspecting files built by Gradle for license checks

@mccheah
Copy link
Author

mccheah commented Aug 18, 2018

@robert3005 addressed comments. I've noted the differences to stock baseline in comments here.

if you feel strongly about not supporting multiple tags then we can remove it.

allprojects {
group 'org.apache.spark'
version System.env.CIRCLE_TAG ?: gitVersion()
buildDir = 'gradle-build'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Spark's root directory already contains a folder called build, presumably for sbt, and we can't delete its contents freely with ./gradlew clean.

Copy link

@dansanduleac dansanduleac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had some comments from before

testCompile gradleTestKit()
}

def getGitRepoRoot = { ->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a closure whether you add -> or not, can remove it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function isn't in the latest version.

into 'kubernetes/dockerfiles/spark'
}

from(fileTree("$getGitRepoRoot/bin")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use a relative path here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this doesn't apply anymore, given how we've rearranged the Gradle project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We instead use rootDir instead of getGitRepoRoot. if we were to use relative paths, we'd have to refer to the parent path, e.g. from(fileTree("../bin")). I'm generally against referencing the parent path this way because it requires one to know where the script is running from. Referencing the absolute path gives us a path that's consistent regardless of where the script is running from or if we modify the directory structure of this project later.

# limitations under the License.
#
set -euo pipefail
./gradlew --info --stacktrace test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need this - just call the gradle command, you gain nothing by wrapping it

@robert3005
Copy link

@mccheah lets remove the wrapping scripts after that it's LGTM from me

commandLine.add(extension.getImageName());
task.setCommandLine(commandLine);
});
project.afterEvaluate(evaluatedProject -> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not necessary as long as we make configuration of the extension lazy. If you look at how MavenPublish plugin works from 4.8 onwards is that it creates tasks in all action on publication container. Hence, we can assume that any time you'd need an after evaluate is an error since it means the user force project evaluation at some point.

Simple thing to do here would be to make tags NamedDomainObjectContainer and then you can lazily register things by name which would be the tag.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done separately as long as it's done

@mccheah
Copy link
Author

mccheah commented Aug 31, 2018

@dansanduleac @robert3005 going to merge after latest CI passes if there are no more comments.

@robert3005
Copy link

Cool with me, just remember to squash

@mccheah mccheah merged commit 036259f into master Aug 31, 2018
@mccheah
Copy link
Author

mccheah commented Aug 31, 2018

Merge is complete, thanks!

@mccheah mccheah deleted the add-docker-image-publish-scripts branch August 31, 2018 18:19
@mccheah
Copy link
Author

mccheah commented Aug 31, 2018

Whoops it broke master - will fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants