-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Re-design cache keys (filesystem paths) #983
Comments
Interestingly, I left out I believe
I believe we can do away with |
Assuming we use the |
As for errors, I'm imagining something like this: Assume we are trying to create a symcache from a debug file (required) and an il2cpp file (optional).
Does that make sense? |
I don’t think that would be necessary anymore and would thus remove all that machinery.
IMO we should not be persisting anything on the file-system if the primary source file was never found. We could indeed have the cache key/json file be something like this:
We could then have trait methods for Not quite sure how to factor in the versioning here. I would suggest to make the version part of the path, but it might also be possible to make it part of the key itself. I’m also not sure if we even want to consider the version of source files? Maybe imagine a |
This introduces a new `CacheKeyBuilder` which can be used to construct a combined `CacheKey` that uses information from different sources that feed into one cache item. It also refactors both the file-system and the shared-cache layers to use straight `&str` relative paths, though thus far they are still using the legacy cache keys. The last missing TODO item here would be to properly hook up the combined cache key for symcaches. So far the code is not being used neither in the read nor the write path, though this PR prepares most of the code that would get us there. This takes care of most of what is discussed in #983, though it does not yet implement "debuggability" (aka writing a manifest file that would save all the metadata in a separate .json file).
Problem Statement
There is currently a couple of issues with cache keys:
\0
characters. They seem to come from invalid/garbage minidump modulecode_file
names. This is a separate issue though that we should address separately.secondary_sources
, but just their existence.The current naming of the files is quite convenient when trying to locally inspect cache contents for very specific files.
Another thing which is not a problem in itself, but an observation: The Sentry source is mutable, while all others are considered immutable. This means you can re-upload new file contents using the same uuid (kinda defeating the purpose of uuids).
Proposal
I propose to structure the on-disk path like this:
Where
$key
itself is some form of hash consisting of all the sources contributing to that cache.We want to have a unique cache key that changes whenever one of these contributing source files does.
This should be the case when someone uploads a bcsymbolmap or il2cpp file after the fact.
Same as when someone uploads a sourcemap directly to sentry artifacts which would have previously been scraped from the web.
The
$key
part can be asha256
hash, grouped into byte-prefixes such as.../aa/bb/cc/ddeeff...
to reduce the number of files in one directory.Metadata
One advantage of the "legacy" naming format is that the scope (aka project id), source id and location are encoded in the filename. This is a useful property to have to diagnose problems related to caching.
An accompanying
.json
file can be written to the cache as well (and probably touched alongside the main cache file).This file is only written to, it is not being read by symbolicator. Its purpose is only to help inspect cache contents and where they come from for developers. The format is not specified and can change freely over time.
Examples
A
objects
cache is derived from only one source:{ source: sentry:project, $id }
{ source: sentry:microsoft, url: $url }
A
symcache
can be derived from up to 4 different sources:{ source: sentry:project, $id }
{ source: custom-http-source, $url }
{ source: custom-s3-source, $bucket, $path }
{ source: custom-gcs-source, $bucket, $path }
A
sourcemapcache
can be derived from two different sources:{ source: sentry:artifact, $id }
{ source: scraped, $url }
Legacy Key / Migration path
The current cache keys use a the following scheme:
That is implemented here:
We need to support those file names for a longer time still, which means we have to carry that around anyway.
To migrate forward to a new naming scheme, we could use a
sha256
hash on a stable version of the legacy path as the primary hash, and then update that hash with more parts if a specific cache file is derived from more than one source file. Depending on the specific cache, the hash should also be updated with the kind of file that was used.As for the rollout strategy, I propose the following:
Open Questions
$key.json
file alongside the cache which lists all the contributing sources?)The text was updated successfully, but these errors were encountered: