Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Re-design cache keys (filesystem paths) #983

Closed
Swatinem opened this issue Jan 19, 2023 · 4 comments · Fixed by #1043
Closed

RFC: Re-design cache keys (filesystem paths) #983

Swatinem opened this issue Jan 19, 2023 · 4 comments · Fixed by #1043
Assignees

Comments

@Swatinem
Copy link
Member

Swatinem commented Jan 19, 2023

Problem Statement

There is currently a couple of issues with cache keys:

  • They encode a filename pattern in their name, which can grow quite large up to causing Cache Keys can exceed maximum file name length #965
  • Similar to the issue above, we have also seen internal errors about file names with internal \0 characters. They seem to come from invalid/garbage minidump module code_file names. This is a separate issue though that we should address separately.
  • They are not grouped by source id, which is easily fixable by making the source id a directory rather than a file prefix.
  • As mentioned in fix: Use unique Cache files for symcaches with markers #982, the cache key does not take into account the actual content of secondary_sources, but just their existence.
  • In general, the cache keys do not make sense for caches derived from multiple files, which is also relevant for sourcemaps.

The current naming of the files is quite convenient when trying to locally inspect cache contents for very specific files.

Another thing which is not a problem in itself, but an observation: The Sentry source is mutable, while all others are considered immutable. This means you can re-upload new file contents using the same uuid (kinda defeating the purpose of uuids).

Proposal

I propose to structure the on-disk path like this:

caches/$type/$version/$key

Where $key itself is some form of hash consisting of all the sources contributing to that cache.
We want to have a unique cache key that changes whenever one of these contributing source files does.
This should be the case when someone uploads a bcsymbolmap or il2cpp file after the fact.
Same as when someone uploads a sourcemap directly to sentry artifacts which would have previously been scraped from the web.

The $key part can be a sha256 hash, grouped into byte-prefixes such as .../aa/bb/cc/ddeeff... to reduce the number of files in one directory.

Metadata

One advantage of the "legacy" naming format is that the scope (aka project id), source id and location are encoded in the filename. This is a useful property to have to diagnose problems related to caching.
An accompanying .json file can be written to the cache as well (and probably touched alongside the main cache file).
This file is only written to, it is not being read by symbolicator. Its purpose is only to help inspect cache contents and where they come from for developers. The format is not specified and can change freely over time.

Examples

A objects cache is derived from only one source:

  • { source: sentry:project, $id }
  • OR: { source: sentry:microsoft, url: $url }

A symcache can be derived from up to 4 different sources:

  • debug file: { source: sentry:project, $id }
  • plist file: { source: custom-http-source, $url }
  • bcsymbolmap file: { source: custom-s3-source, $bucket, $path }
  • il2cpp file: { source: custom-gcs-source, $bucket, $path }

A sourcemapcache can be derived from two different sources:

  • minified js: { source: sentry:artifact, $id }
  • sourcemap: { source: scraped, $url }

Legacy Key / Migration path

The current cache keys use a the following scheme:

.../$scope/$sourceId_$location

That is implemented here:

We need to support those file names for a longer time still, which means we have to carry that around anyway.

To migrate forward to a new naming scheme, we could use a sha256 hash on a stable version of the legacy path as the primary hash, and then update that hash with more parts if a specific cache file is derived from more than one source file. Depending on the specific cache, the hash should also be updated with the kind of file that was used.

struct CacheKey {
  legacy: String, // essentially `"{scope}/{path_safe(RemoteFile::cache_key)}"`
  hash: [u8; 32], // essentially `sha256::digest(legacy)` for caches that have a single source file
}

As for the rollout strategy, I propose the following:

  • Try both filenames when reading the cache
  • Write new type of filenames on cache creation
  • Use the background recomputation feature to populate the new cache files
  • ☝🏻 monitor the progress via metrics
  • eventually drop the legacy cache key part when it is safe to do

Open Questions

  • How can we improve the debug-ability of persisted cache files? (we could also persist a $key.json file alongside the cache which lists all the contributing sources?)
  • How should we do the transition and backwards compatibility between the two cache formats?
  • Should we also re-think how we persist not-found / errors?
  • Related to errors: How should we deal with not-found / errors when we have multiple sources, some of which are mandatory or optional. eg. SourceMaps need both minified and sourcemap file, whereas for symcache the mapping files are optional.
@Swatinem
Copy link
Member Author

Interestingly, I left out scope completely from the proposal.

I believe scope is really a property of the symbol source: It literally is if source.is_public { Scoped(request.scope) } else { Global }. The purpose if it is to guarantee cache isolation for private sources.

  • The public sources are obviously public symbol servers, but also our "kinda-private" apple symbols which are supposed to be shared among requests.
  • Private (scoped) sources are custom symbol servers, which get assigned a new uuid (for cache busting purposes no less) on every update.
  • Another private symbol source is the sentry source, which gets a per-project url that makes it sufficiently isolated.

I believe we can do away with scope altogether for the reasons outlined.

@loewenheim
Copy link
Contributor

Assuming we use the $key.json file (which I think is a good idea), would we still put markers in symcaches to denote whether they were created with auxiliary files?

@loewenheim
Copy link
Contributor

As for errors, I'm imagining something like this:

Assume we are trying to create a symcache from a debug file (required) and an il2cpp file (optional).

  • Case 1: Debug file wasn't found
    caches/symcache/$version/$key contains a "not found" marker
    caches/symcache/$version/$key.json contains
{
  "debug_file": {
   "source": "…",
   "required": true,
   "found": false
  },
  "il2cpp_file": {…}
}
  • Case 2: Debug file wasn't found, il2cpp file was
    caches/symcache/$version/$key contains the symcache generated without the il2cpp file
    caches/symcache/$version/$key.json contains
{
  "debug_file": {
   "source": "…",
   "required": true,
   "found": true
  },
  "il2cpp_file": {
    "source": "...",
    "required": false,
    "found": false
  }
}

Does that make sense?

@Swatinem
Copy link
Member Author

Assuming we use the $key.json file (which I think is a good idea), would we still put markers in symcaches to denote whether they were created with auxiliary files?

I don’t think that would be necessary anymore and would thus remove all that machinery.

Assume we are trying to create a symcache from a debug file (required) and an il2cpp file (optional).

IMO we should not be persisting anything on the file-system if the primary source file was never found.
We implemented this short-circuit in our caching work so far (almost by accident), but I am quite happy with that behavior tbh.

We could indeed have the cache key/json file be something like this:

struct SymCacheKey {
  object: Key,
  bcsymbolmap: Option<Key>,
  il2cpp: Option<Key>,
}

We could then have trait methods for .path_for_version(&self, version) -> String and .write_metadata_json(&self, &mut Write) which would write the json snippet / metadata into the file.

Not quite sure how to factor in the versioning here. I would suggest to make the version part of the path, but it might also be possible to make it part of the key itself.

I’m also not sure if we even want to consider the version of source files? Maybe imagine a SymCache gets constructed from a il2cpp cache version X, and a different one from cache version Y. Do we want to support such cases?

Swatinem added a commit that referenced this issue Feb 14, 2023
This introduces a new `CacheKeyBuilder` which can be used to construct a combined `CacheKey` that uses information from different sources that feed into one cache item.

It also refactors both the file-system and the shared-cache layers to use straight `&str` relative paths, though thus far they are still using the legacy cache keys.

The last missing TODO item here would be to properly hook up the combined cache key for symcaches.

So far the code is not being used neither in the read nor the write path, though this PR prepares most of the code that would get us there.

This takes care of most of what is discussed in #983, though it does not yet implement "debuggability" (aka writing a manifest file that would save all the metadata in a separate .json file).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants