Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use per WORKSPACE / host tools keyed bazel cache(s) #6767

Merged
merged 8 commits into from
Feb 13, 2018

Conversation

BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Feb 9, 2018

We can solve a lot of problems by just using one cache per (WORKSPACE [org/repo], (Image Tool Versions [gcc, python, etc.]))
We do this by setting:

CACHE_URL="http://${CACHE_HOST}:${CACHE_PORT}/${CACHE_ID}"
echo "build --remote_http_cache=${CACHE_URL}"

Where CACHE_ID currently comes from CACHE_ID="${WORKSPACE_NAME},$(hash_toolchains)"

  • Simple per repo cache
  • No incorrect cache sharing (in theory, since we key each setup to it's own storage)
  • If somehow incorrect cache sharing happens by accident, we just update the cache keying logic to be different so a new cache is used, no mucking with the cache node required
  • If we add a new host toolchain we can add it to the cache keying trivially
  • Doesn't depend on any special bazel features

Right now I've implemented the keying and tested it against nginx with webdav enabled locally.
TODO:

  • update nursery deployment to use something else that supports storing multiple directories
    • evaluate caches:
      • hazelcast is more complex than what we want, nginx with webdav looks a bit simplistic..., etc. nothing seems to support many repos while being operationally minimal (IE not hazelcast, we don't need a huge complex distributed cache, JVM tuning etc..)
    • implement our own server that meets our needs (mainly multiple/arbitrary individual caches)

Follow-ups:

  • start using experimentally for test-infra
  • Track cache usage / add metrics
  • Implement eviction for stale / unused entries
  • integrate bazelrcs into image(s)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2018
@BenTheElder
Copy link
Member Author

/cc @ixdy @krzyzacy

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 9, 2018
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 9, 2018
@ixdy ixdy changed the title [WIP] use per WORKSPACe / host tools keyed bazel cache(s) [WIP] use per WORKSPACE / host tools keyed bazel cache(s) Feb 9, 2018
echo "build --action_env=CACHE_PYTHON_VERSION=${PYTHON_VERSION}"
# point it at our http cache ...
# NOTE our cache is versioned by the first path segment
# TODO(bentheelder): update the nursery deployment to something that supports this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache server deployment "nursery" is actually https://github.com/buchgr/bazel-remote currently which only supports a single cache directory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done now

# point it at our http cache ...
# NOTE our cache is versioned by the first path segment
# TODO(bentheelder): update the nursery deployment to something that supports this
WORKSPACE_NAME="${REPO_NAME:-$(basename "$PWD")}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is reponame just like test-infra or is it kubernetes/test-infra?

Copy link
Member Author

@BenTheElder BenTheElder Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just test-infra https://github.com/kubernetes/test-infra/tree/master/prow#job-environment-variables

It's probably better to use owner as well, though implementing the bash fallback for CI jobs may be a bit messy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this and cleaned up the bash a bit

@ixdy
Copy link
Member

ixdy commented Feb 9, 2018

another thought: most of our images include an IMAGE environment variable which is updated each time we push a new image. could we use that for hashing instead?

it'd result in slightly less caching, since we probably don't change most of the underlying dependencies that often. OTOH it should be pretty safe.

@BenTheElder
Copy link
Member Author

another thought: most of our images include an IMAGE environment variable which is updated each time we push a new image. could we use that for hashing instead?

We can, but I'm not really concerned if we can track GCC versions correctly (this part is easy) and the images are pushed fairly frequently for other changes (kubetest).

Either way we need to key off the repo/WORKSPACE as well.

@BenTheElder
Copy link
Member Author

Since we're based on debian jessie now host tools like python and gcc are unlikely to change until we switch the base image to something else someday (stretch?).

@BenTheElder
Copy link
Member Author

Also if we do run into an issue with the toolchain hashing this we can switch out the cache identifier generation logic to use $IMAGE instead as long as the backend supports arbitrary URL/file system safe names for separate caches. I'd like to maximize the actual caching if we can though. 🤷‍♂️

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 10, 2018
@BenTheElder BenTheElder force-pushed the cache-of-caches branch 4 times, most recently from 6a2e1d9 to 159ab63 Compare February 10, 2018 01:29
@BenTheElder
Copy link
Member Author

the server should work now, need to clean up the logging and test it some more then package it up for deployment.

@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 10, 2018

To test the server locally, add this to .bazelrc in test-infra:

startup --host_jvm_args=-Dbazel.DigestFunction=sha256
build --spawn_strategy=remote --genrule_strategy=remote
build --strategy=Javac=remote --strategy=Closure=remote
build --remote_local_fallback
build --remote_http_cache=http://localhost:8080/k8s.io/test-infra

build and run:

bazel build //experiment/nursery
bazel-bin/experiment/nursery/darwin_amd64_stripped/nursery --dir=$HOME/bazel-remote-cache

build something in another shell:

bazel build //...

@BenTheElder BenTheElder force-pushed the cache-of-caches branch 2 times, most recently from 5733ed5 to aa6d81c Compare February 10, 2018 02:10
@BenTheElder
Copy link
Member Author

/test pull-test-infra-bazel-canary
/test pull-test-infra-bazel
[using new nursery deployment]

@BenTheElder
Copy link
Member Author

/area bazel

}

// Get provides your readHandler with the contents at key
func (c *Cache) Get(key string, readHandler ReadHandler) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a ReadHandler you could just return the ReadSeeker and an error. That would be more idiomatic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this allows me to track who currently holds a cache key which may be very useful in the future.

}

func main() {
// TODO(bentheelder): bound cache size / convert to LRU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be completed before we use this anywhere so that the cache doesn't grow uncontrollably.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only takes up about 4gb to host test infra, the disk is 375gb. We won't use this in prod yet, just experimentally. I'm going to collect some more data from experimental usage before deciding on an eviction strategy.

Filling the disk doesn't cause failures though either, you just won't be able to insert new things into the cache.

Copy link
Member Author

@BenTheElder BenTheElder Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth noting that this runs on a dedicated node, so it doesn't affect the rest of CI. A previous iteration of this already runs and is used by the pull-.*bazel.*-canary jobs.

}
// unknown error
log.WithError(err).Error("error getting key")
http.Error(w, err.Error(), http.StatusNotFound)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a 5xx status code instead of a 404.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I didn't pay too much attention to the error codes since the only client should be bazel which should only concerned with 200 or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// handle unsupported methods...
default:
log.Warn("received an invalid request method: %v", r.Method)
http.Error(w, "unsupported method", http.StatusBadRequest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

405 is more specific than 400.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case http.MethodPut:
// only hash CAS, not action cache
// the action cache is hash -> metadata
// the CAS is well, a CAS, which we can hash...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the action cache (ac) variant is supposed to do. Where is the hash -> metadata mapping mentioned in the comment and how is it used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the protocol linked at the top of the file and the comment below: https://docs.bazel.build/versions/master/remote-caching.html#http-caching-protocol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel does PUT $HTTP_CACHE_URL/ac/$HASH (bytes of PUT are action metadata) and GET the same way
It also does PUT $HTTP_CACHE_URL/cas/$HASH (bytes of PUT are content-address object storage) and GET the same way.

if acOrCAS != "cas" {
hash = ""
}
err := cache.Put(r.URL.Path, r.Body, hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the file retrieval process.
If we do a PUT to /org/repo/cas/15e2b0d and create a file cache-dir/org/repo/cas/15e2b0d we wont be able to GET the file without knowing its content digest (15e2b0d) which we would never know unless we already had the file... Is this where action caches come into play?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Caching works like:
Bazel computes the entire build as a graph of Actions, each Action has inputs etc and is hashed to some action-cache key. The action cache key is then looked up in the cache for metadata stored from actually executing an action. If there is a cache hit, then the objects are looked up in the CAS from this metadata.

The CAS is just a content addressed store of files used / produced by actions.

Copy link
Member Author

@BenTheElder BenTheElder Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't hash the actions because they are computed each time and map to metadata from execution instead of the action itself, which is much cheaper to compute. The caches are something like:

Action Cache (../ac/..): Hash(Action.serialize()) -> Action.execute().metadata()

Action Metadata -> CAS Keys

CAS (.../cas...): Hash(OutputFile.getBytes()) -> OutputFile.getBytes()


func cacheHandler(cache *diskcache.Cache) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// parse and validate path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to create a logrus.Entry here that has a path and a method field and use that to log warnings and errors throughout this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member Author

@BenTheElder BenTheElder Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also set "component"

@cjwagner
Copy link
Member

Why aren't we just using GCS as the cache (https://docs.bazel.build/versions/master/remote-caching.html#google-cloud-storage)?

@BenTheElder
Copy link
Member Author

@cjwagner because this way the bandwidth is all in-cluster which should be faster and cheaper, we also don't want to provide jobs with unrestricted access to stuff data in GCS. This way the storage is private because it can only be accessed within the cluster and our costs are bound to the cost of the cache node.

@BenTheElder
Copy link
Member Author

Serving files from a local SSD in cluster is very fast, pull-test-infra-bazel-canary jobs are now bound to the cost of the pip install and pylint, the time to bazel build / test is otherwise neglible using the cache node.

@krzyzacy
Copy link
Member

I'm fine with experimenting this with a few canary jobs to start with, as the performance improvement already looks promising. punt to @cjwagner if he has more comments.

Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining everything. Maybe add a short comment explaining why we have our own bazel cache implementation instead of using an existing one.
/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 13, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, cjwagner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:
  • OWNERS [BenTheElder,cjwagner]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -26,15 +27,38 @@ package_to_version () {

# look up a binary with which and return the debian package it belongs to
command_to_package () {
BINARY_PATH=$(readlink -f $(which $1))
local BINARY_PATH=$(readlink -f $(which $1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, and elsewhere

@krzyzacy
Copy link
Member

/lint
/joke

@k8s-ci-robot
Copy link
Contributor

@krzyzacy: What do you call a cow with two legs? Lean beef.

In response to this:

/lint
/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krzyzacy: 1 warning.

In response to this:

/lint
/joke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

limitations under the License.
*/

// cache implements disk backed cache storage for use in nursery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: package comment should be of the form "Package diskcache ...". More info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@BenTheElder
Copy link
Member Author

Thanks for explaining everything. Maybe add a short comment explaining why we have our own bazel cache implementation instead of using an existing one.

👍

If everything goes well with the next round of canary experiments I plan to:

  • add an eviction strategy
  • rename to greenhouse, add a more detailed README, and graduate it from "experiment/"
  • integrate creating the bazelrcs into our images (behind some feature gate)
  • instrument / add metrics, create a velodrome dashboard, alert on low disk

@BenTheElder
Copy link
Member Author

/hold cancel

Will continue to improve this in other PRs, this is already XL. Thanks for all the review @cjwagner @krzyzacy @ixdy 😄

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2018
@k8s-ci-robot k8s-ci-robot merged commit 4d96d02 into kubernetes:master Feb 13, 2018
@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 13, 2018

Small follow-up:
8m11s: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/test-infra/5137/pull-test-infra-bazel-canary/115/
After the cache is warm:
3m6s: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/test-infra/5137/pull-test-infra-bazel-canary/117/

bentheelder@gke-prow-bazel-cache-70c4c5ed-61kg ~ $ df -h /mnt/disks/ssd0
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        369G  2.1G  348G   1% /mnt/disks/ssd0
bentheelder@gke-prow-bazel-cache-70c4c5ed-61kg ~ $ sudo du -h /mnt/disks/ssd0/cache
2.1G    /mnt/disks/ssd0/cache/kubernetes/test-infra,b68e017af9a3ae39e87829bd5fe5bfaa/cas
7.4M    /mnt/disks/ssd0/cache/kubernetes/test-infra,b68e017af9a3ae39e87829bd5fe5bfaa/ac
2.1G    /mnt/disks/ssd0/cache/kubernetes/test-infra,b68e017af9a3ae39e87829bd5fe5bfaa
2.1G    /mnt/disks/ssd0/cache/kubernetes
2.1G    /mnt/disks/ssd0/cache

@BenTheElder BenTheElder deleted the cache-of-caches branch February 13, 2018 02:21
@BenTheElder BenTheElder added the area/greenhouse Issues or PRs related to code in /greenhouse (our remote bazel cache) label Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/bazel area/greenhouse Issues or PRs related to code in /greenhouse (our remote bazel cache) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants