RFC: Re-design cache keys (filesystem paths) #983

Swatinem · 2023-01-19T12:04:22Z

Problem Statement

There is currently a couple of issues with cache keys:

They encode a filename pattern in their name, which can grow quite large up to causing Cache Keys can exceed maximum file name length #965
Similar to the issue above, we have also seen internal errors about file names with internal \0 characters. They seem to come from invalid/garbage minidump module code_file names. This is a separate issue though that we should address separately.
They are not grouped by source id, which is easily fixable by making the source id a directory rather than a file prefix.
As mentioned in fix: Use unique Cache files for symcaches with markers #982, the cache key does not take into account the actual content of secondary_sources, but just their existence.
In general, the cache keys do not make sense for caches derived from multiple files, which is also relevant for sourcemaps.

The current naming of the files is quite convenient when trying to locally inspect cache contents for very specific files.

Another thing which is not a problem in itself, but an observation: The Sentry source is mutable, while all others are considered immutable. This means you can re-upload new file contents using the same uuid (kinda defeating the purpose of uuids).

Proposal

I propose to structure the on-disk path like this:

caches/$type/$version/$key

Where $key itself is some form of hash consisting of all the sources contributing to that cache.
We want to have a unique cache key that changes whenever one of these contributing source files does.
This should be the case when someone uploads a bcsymbolmap or il2cpp file after the fact.
Same as when someone uploads a sourcemap directly to sentry artifacts which would have previously been scraped from the web.

The $key part can be a sha256 hash, grouped into byte-prefixes such as .../aa/bb/cc/ddeeff... to reduce the number of files in one directory.

Metadata

One advantage of the "legacy" naming format is that the scope (aka project id), source id and location are encoded in the filename. This is a useful property to have to diagnose problems related to caching.
An accompanying .json file can be written to the cache as well (and probably touched alongside the main cache file).
This file is only written to, it is not being read by symbolicator. Its purpose is only to help inspect cache contents and where they come from for developers. The format is not specified and can change freely over time.

Examples

A objects cache is derived from only one source:

{ source: sentry:project, $id }
OR: { source: sentry:microsoft, url: $url }

A symcache can be derived from up to 4 different sources:

debug file: { source: sentry:project, $id }
plist file: { source: custom-http-source, $url }
bcsymbolmap file: { source: custom-s3-source, $bucket, $path }
il2cpp file: { source: custom-gcs-source, $bucket, $path }

A sourcemapcache can be derived from two different sources:

minified js: { source: sentry:artifact, $id }
sourcemap: { source: scraped, $url }

Legacy Key / Migration path

The current cache keys use a the following scheme:

.../$scope/$sourceId_$location

That is implemented here:

We need to support those file names for a longer time still, which means we have to carry that around anyway.

To migrate forward to a new naming scheme, we could use a sha256 hash on a stable version of the legacy path as the primary hash, and then update that hash with more parts if a specific cache file is derived from more than one source file. Depending on the specific cache, the hash should also be updated with the kind of file that was used.

struct CacheKey {
  legacy: String, // essentially `"{scope}/{path_safe(RemoteFile::cache_key)}"`
  hash: [u8; 32], // essentially `sha256::digest(legacy)` for caches that have a single source file
}

As for the rollout strategy, I propose the following:

Try both filenames when reading the cache
Write new type of filenames on cache creation
Use the background recomputation feature to populate the new cache files
☝🏻 monitor the progress via metrics
eventually drop the legacy cache key part when it is safe to do

Open Questions

How can we improve the debug-ability of persisted cache files? (we could also persist a $key.json file alongside the cache which lists all the contributing sources?)
How should we do the transition and backwards compatibility between the two cache formats?
Should we also re-think how we persist not-found / errors?
Related to errors: How should we deal with not-found / errors when we have multiple sources, some of which are mandatory or optional. eg. SourceMaps need both minified and sourcemap file, whereas for symcache the mapping files are optional.

The text was updated successfully, but these errors were encountered:

Swatinem · 2023-01-19T12:19:04Z

Interestingly, I left out scope completely from the proposal.

I believe scope is really a property of the symbol source: It literally is if source.is_public { Scoped(request.scope) } else { Global }. The purpose if it is to guarantee cache isolation for private sources.

The public sources are obviously public symbol servers, but also our "kinda-private" apple symbols which are supposed to be shared among requests.
Private (scoped) sources are custom symbol servers, which get assigned a new uuid (for cache busting purposes no less) on every update.
Another private symbol source is the sentry source, which gets a per-project url that makes it sufficiently isolated.

I believe we can do away with scope altogether for the reasons outlined.

loewenheim · 2023-01-19T12:26:25Z

Assuming we use the $key.json file (which I think is a good idea), would we still put markers in symcaches to denote whether they were created with auxiliary files?

loewenheim · 2023-01-19T12:38:38Z

As for errors, I'm imagining something like this:

Assume we are trying to create a symcache from a debug file (required) and an il2cpp file (optional).

Case 1: Debug file wasn't found
caches/symcache/$version/$key contains a "not found" marker
caches/symcache/$version/$key.json contains

{
  "debug_file": {
   "source": "…",
   "required": true,
   "found": false
  },
  "il2cpp_file": {…}
}

Case 2: Debug file wasn't found, il2cpp file was
caches/symcache/$version/$key contains the symcache generated without the il2cpp file
caches/symcache/$version/$key.json contains

{
  "debug_file": {
   "source": "…",
   "required": true,
   "found": true
  },
  "il2cpp_file": {
    "source": "...",
    "required": false,
    "found": false
  }
}

Does that make sense?

Swatinem · 2023-01-19T13:41:31Z

Assuming we use the $key.json file (which I think is a good idea), would we still put markers in symcaches to denote whether they were created with auxiliary files?

I don’t think that would be necessary anymore and would thus remove all that machinery.

Assume we are trying to create a symcache from a debug file (required) and an il2cpp file (optional).

IMO we should not be persisting anything on the file-system if the primary source file was never found.
We implemented this short-circuit in our caching work so far (almost by accident), but I am quite happy with that behavior tbh.

We could indeed have the cache key/json file be something like this:

struct SymCacheKey {
  object: Key,
  bcsymbolmap: Option<Key>,
  il2cpp: Option<Key>,
}

We could then have trait methods for .path_for_version(&self, version) -> String and .write_metadata_json(&self, &mut Write) which would write the json snippet / metadata into the file.

Not quite sure how to factor in the versioning here. I would suggest to make the version part of the path, but it might also be possible to make it part of the key itself.

I’m also not sure if we even want to consider the version of source files? Maybe imagine a SymCache gets constructed from a il2cpp cache version X, and a different one from cache version Y. Do we want to support such cases?

This introduces a new `CacheKeyBuilder` which can be used to construct a combined `CacheKey` that uses information from different sources that feed into one cache item. It also refactors both the file-system and the shared-cache layers to use straight `&str` relative paths, though thus far they are still using the legacy cache keys. The last missing TODO item here would be to properly hook up the combined cache key for symcaches. So far the code is not being used neither in the read nor the write path, though this PR prepares most of the code that would get us there. This takes care of most of what is discussed in #983, though it does not yet implement "debuggability" (aka writing a manifest file that would save all the metadata in a separate .json file).

This was referenced Jan 20, 2023

fix: Use unique Cache files for symcaches with markers #982

Closed

ref: Move CacheKey into its own file #988

Merged

ashwoods assigned Swatinem Jan 27, 2023

This was referenced Jan 27, 2023

Add in-memory caching #994

Closed

Clean up internal Error / Cache propagation #922

Closed

ref: Reorganize and document Caching functionality #998

Merged

Swatinem mentioned this issue Feb 13, 2023

feat: Introduce a CacheKeyBuilder #1033

Merged

This was referenced Feb 14, 2023

ref: Use new CacheKey for writes and shared-cache #1038

Merged

ref: Clean up deprecated Cache features #1043

Merged

Swatinem closed this as completed in #1043 Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Re-design cache keys (filesystem paths) #983

RFC: Re-design cache keys (filesystem paths) #983

Swatinem commented Jan 19, 2023 •

edited

Loading

Swatinem commented Jan 19, 2023

loewenheim commented Jan 19, 2023

loewenheim commented Jan 19, 2023

Swatinem commented Jan 19, 2023

RFC: Re-design cache keys (filesystem paths) #983

RFC: Re-design cache keys (filesystem paths) #983

Comments

Swatinem commented Jan 19, 2023 • edited Loading

Problem Statement

Proposal

Metadata

Examples

Legacy Key / Migration path

Open Questions

Swatinem commented Jan 19, 2023

loewenheim commented Jan 19, 2023

loewenheim commented Jan 19, 2023

Swatinem commented Jan 19, 2023

Swatinem commented Jan 19, 2023 •

edited

Loading