-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use per WORKSPACE / host tools keyed bazel cache(s) #6767
Conversation
dc61d9e
to
126be60
Compare
echo "build --action_env=CACHE_PYTHON_VERSION=${PYTHON_VERSION}" | ||
# point it at our http cache ... | ||
# NOTE our cache is versioned by the first path segment | ||
# TODO(bentheelder): update the nursery deployment to something that supports this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cache server deployment "nursery" is actually https://github.com/buchgr/bazel-remote currently which only supports a single cache directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is done now
# point it at our http cache ... | ||
# NOTE our cache is versioned by the first path segment | ||
# TODO(bentheelder): update the nursery deployment to something that supports this | ||
WORKSPACE_NAME="${REPO_NAME:-$(basename "$PWD")}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is reponame just like test-infra
or is it kubernetes/test-infra
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just test-infra
https://github.com/kubernetes/test-infra/tree/master/prow#job-environment-variables
It's probably better to use owner as well, though implementing the bash fallback for CI jobs may be a bit messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this and cleaned up the bash a bit
another thought: most of our images include an it'd result in slightly less caching, since we probably don't change most of the underlying dependencies that often. OTOH it should be pretty safe. |
We can, but I'm not really concerned if we can track GCC versions correctly (this part is easy) and the images are pushed fairly frequently for other changes (kubetest). Either way we need to key off the repo/WORKSPACE as well. |
Since we're based on debian jessie now host tools like python and gcc are unlikely to change until we switch the base image to something else someday (stretch?). |
Also if we do run into an issue with the toolchain hashing this we can switch out the cache identifier generation logic to use |
6a2e1d9
to
159ab63
Compare
the server should work now, need to clean up the logging and test it some more then package it up for deployment. |
159ab63
to
8b52fdb
Compare
To test the server locally, add this to .bazelrc in test-infra:
build and run:
build something in another shell:
|
5733ed5
to
aa6d81c
Compare
/test pull-test-infra-bazel-canary |
a127ac6
to
2991eb1
Compare
2991eb1
to
62f2414
Compare
f2e343a
to
fc60337
Compare
/area bazel |
} | ||
|
||
// Get provides your readHandler with the contents at key | ||
func (c *Cache) Get(key string, readHandler ReadHandler) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a ReadHandler
you could just return the ReadSeeker
and an error. That would be more idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this allows me to track who currently holds a cache key which may be very useful in the future.
} | ||
|
||
func main() { | ||
// TODO(bentheelder): bound cache size / convert to LRU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be completed before we use this anywhere so that the cache doesn't grow uncontrollably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only takes up about 4gb to host test infra, the disk is 375gb. We won't use this in prod yet, just experimentally. I'm going to collect some more data from experimental usage before deciding on an eviction strategy.
Filling the disk doesn't cause failures though either, you just won't be able to insert new things into the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting that this runs on a dedicated node, so it doesn't affect the rest of CI. A previous iteration of this already runs and is used by the pull-.*bazel.*-canary
jobs.
experiment/nursery/main.go
Outdated
} | ||
// unknown error | ||
log.WithError(err).Error("error getting key") | ||
http.Error(w, err.Error(), http.StatusNotFound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a 5xx status code instead of a 404.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I didn't pay too much attention to the error codes since the only client should be bazel which should only concerned with 200 or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
experiment/nursery/main.go
Outdated
// handle unsupported methods... | ||
default: | ||
log.Warn("received an invalid request method: %v", r.Method) | ||
http.Error(w, "unsupported method", http.StatusBadRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
405 is more specific than 400.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
case http.MethodPut: | ||
// only hash CAS, not action cache | ||
// the action cache is hash -> metadata | ||
// the CAS is well, a CAS, which we can hash... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the action cache (ac
) variant is supposed to do. Where is the hash -> metadata
mapping mentioned in the comment and how is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the protocol linked at the top of the file and the comment below: https://docs.bazel.build/versions/master/remote-caching.html#http-caching-protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bazel does PUT $HTTP_CACHE_URL/ac/$HASH
(bytes of PUT are action metadata) and GET
the same way
It also does PUT $HTTP_CACHE_URL/cas/$HASH
(bytes of PUT are content-address object storage) and GET
the same way.
if acOrCAS != "cas" { | ||
hash = "" | ||
} | ||
err := cache.Put(r.URL.Path, r.Body, hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the file retrieval process.
If we do a PUT to /org/repo/cas/15e2b0d
and create a file cache-dir/org/repo/cas/15e2b0d
we wont be able to GET the file without knowing its content digest (15e2b0d
) which we would never know unless we already had the file... Is this where action caches come into play?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Caching works like:
Bazel computes the entire build as a graph of Action
s, each Action
has inputs etc and is hashed to some action-cache key. The action cache key is then looked up in the cache for metadata stored from actually executing an action. If there is a cache hit, then the objects are looked up in the CAS from this metadata.
The CAS is just a content addressed store of files used / produced by actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't hash the actions because they are computed each time and map to metadata from execution instead of the action itself, which is much cheaper to compute. The caches are something like:
Action Cache (../ac/..
): Hash(Action.serialize())
-> Action.execute().metadata()
Action Metadata -> CAS Keys
CAS (.../cas...
): Hash(OutputFile.getBytes())
-> OutputFile.getBytes()
experiment/nursery/main.go
Outdated
|
||
func cacheHandler(cache *diskcache.Cache) http.Handler { | ||
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
// parse and validate path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to create a logrus.Entry
here that has a path
and a method
field and use that to log warnings and errors throughout this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, also set "component"
d447767
to
a9c164e
Compare
Why aren't we just using GCS as the cache (https://docs.bazel.build/versions/master/remote-caching.html#google-cloud-storage)? |
@cjwagner because this way the bandwidth is all in-cluster which should be faster and cheaper, we also don't want to provide jobs with unrestricted access to stuff data in GCS. This way the storage is private because it can only be accessed within the cluster and our costs are bound to the cost of the cache node. |
Serving files from a local SSD in cluster is very fast, pull-test-infra-bazel-canary jobs are now bound to the cost of the pip install and pylint, the time to bazel build / test is otherwise neglible using the cache node. |
I'm fine with experimenting this with a few canary jobs to start with, as the performance improvement already looks promising. punt to @cjwagner if he has more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining everything. Maybe add a short comment explaining why we have our own bazel cache implementation instead of using an existing one.
/lgtm
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder, cjwagner The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these OWNERS Files:
Approvers can indicate their approval by writing |
@@ -26,15 +27,38 @@ package_to_version () { | |||
|
|||
# look up a binary with which and return the debian package it belongs to | |||
command_to_package () { | |||
BINARY_PATH=$(readlink -f $(which $1)) | |||
local BINARY_PATH=$(readlink -f $(which $1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: per https://google.github.io/styleguide/shell.xml?showone=Variable_Names#Variable_Names, variable names should be lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, and elsewhere
/lint |
@krzyzacy: What do you call a cow with two legs? Lean beef. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krzyzacy: 1 warning.
In response to this:
/lint
/joke
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
limitations under the License. | ||
*/ | ||
|
||
// cache implements disk backed cache storage for use in nursery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: package comment should be of the form "Package diskcache ...". More info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
👍 If everything goes well with the next round of canary experiments I plan to:
|
Small follow-up:
|
We can solve a lot of problems by just using one cache per
(WORKSPACE [org/repo], (Image Tool Versions [gcc, python, etc.]))
We do this by setting:
Where
CACHE_ID
currently comes fromCACHE_ID="${WORKSPACE_NAME},$(hash_toolchains)"
Right now I've implemented the keying and tested it against nginx with webdav enabled locally.
TODO:
Follow-ups:
bazelrc
s into image(s)