Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swing-store export should have a "replay" artifact level #8105

Closed
mhofman opened this issue Jul 27, 2023 · 5 comments · Fixed by #8170
Closed

Swing-store export should have a "replay" artifact level #8105

mhofman opened this issue Jul 27, 2023 · 5 comments · Fixed by #8170
Assignees
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request swing-store SwingSet package: SwingSet

Comments

@mhofman
Copy link
Member

mhofman commented Jul 27, 2023

What is the Problem Being Solved?

The current export mode of the swing-store exporter allows to include some historical data (archival / debug , or only the current artifacts.
The former are not deterministic as it depends on whether some artifacts were pruned in the past (e.g. through state-sync)

While current artifacts are sufficient for operational purposes today, we know we'll need to replay historical transcript spans of the current incarnation of vats to support major XS upgrades.

We thus need a new deterministic mode that includes such artifacts so that every validator has sufficient data when replaying from state-sync.

Related: #8025 (comment)

Description of the Design

Change the export options to replace the "export mode" option with an "artifact level" option as follow:

  • operational (the minimal set of artifacts to start a node, deterministic, same as current export level)
  • replay (includes all artifacts needed to replay the latest incarnation of vats, deterministic but may fail if some needed artifacts are missing, new level)
  • archival (includes all available transcript artifacts, non-deterministic, same as today archival)
  • debug (includes all available artifacts, non-deterministic, same as today debug)

Security Considerations

We should switch state-sync to the replay level. Node that have pruned artifacts will fail to create state-sync snapshots until they have repopulated the missing artifacts, however such failures are not consensus affecting.

Similarly, genesis export should use that mode, with the same considerations as state-sync.

Scaling Considerations

This will increase the size of state-sync snapshots (and genesis exports). To mitigate we need to restart vats (null upgrade) more often.

Genesis export, and possibly state-sync, suffer from a memory usage issue in cosmos, where all data is held in memory before being written to disk. This could raise the resources necessary to operate such nodes, but as for security considerations, these operations are not part of the normal validating operations.

Test Plan

TBD

Upgrade Considerations

See security considerations. We should provide the ability for validators to re-populate missing artifacts required for the "replay" level.

@mhofman mhofman added enhancement New feature or request SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset swing-store labels Jul 27, 2023
@JimLarson JimLarson assigned warner and unassigned mhofman Aug 2, 2023
@Agoric Agoric deleted a comment from aj-agoric Aug 4, 2023
@warner
Copy link
Member

warner commented Aug 9, 2023

We estimate that this will grow the state-sync snapshot data size (sum of the compressed chunks that a new validator must download) from 335MB to 700-ish MB. The time to generate the snapshot will grow too (maybe double, maybe more.. my follower node took 3-4min to generate the 335MB one, @mhofman 's took 15min to generate the 700MB one, but they're on different hardware). That also increases the time to download and decompress.

I'm thinking artifactMode instead of artifactLevel. And as I started to write the docs, I hit a weirdness. Currently, artifactMode is a filter: it might reduce the number of artifacts you get, but there's no setting that might cause the export to fail, regardless of what is or is not in your DB. But the proposal/need is for something that will fail unless a certain set of artifacts can be created.

Would we want that for all the levels? E.g. would archival fail unless we had every old incarnation? Would debug fail unless we had every old snapshot? That doesn't seem useful, or at least it would preclude something does seem useful ("give me everything you've got, complete or not").

I suppose one approach would be to have two args/options, one for the maximum level (behaving like the old exportMode), and a second for the minumum required level (to prevent state-sync exports that lack the full current incarnation).

If we do that, I'd probably want to change the second argument to an options bag (perhaps with backwards compatibility for a single string). And then I guess the option names would be include and required, or includedArtifacts and requiredArtifacts.

@mhofman
Copy link
Member Author

mhofman commented Aug 9, 2023

would archival fail unless we had every old incarnation?

Yes

Would debug fail unless we had every old snapshot?

No, I was thinking it could enforce 'archival', but that's not necessary.

See what I wrote here https://github.com/Agoric/agoric-sdk/pull/8143/files#diff-165a82d94d7114e699229abed73f9f5739be922caa7bafb8a44aff5930c31d2bR28-R33:

 * @typedef {'none'  // No artifacts included
 *  | 'operational'  // Minimum artifacts for running a production node (latest transcript spans and snapshots)
 *  | 'replay'       // Artifacts needed to replay to the current state (full transcripts of latest vat incarnation)
 *  | 'archival'     // All replay related artifacts, including historical (full transcripts of all vat incarnations)
 *  | 'debug'        // All artifacts available (may include historical heap snapshots)
 * } SwingStoreArtifactLevel

I suppose one approach would be to have two args/options, one for the maximum level (behaving like the old exportMode), and a second for the minumum required level (to prevent state-sync exports that lack the full current incarnation).

I think we should not overcomplicate this. I called it a level as I was thinking about it in term of enforced level of data I want in the export (or import). I don't see a reason to have "at least this but maybe give me this much more". We can tack on a 'debug' that doesn't enforce anything and just dumps everything if needed.

I'd probably want to change the second argument to an options bag

I sure hope we do that regardless of the shape. The less positional params, the best

@warner
Copy link
Member

warner commented Aug 10, 2023

@mhofman and I just settled on makeSwingStoreExporter(dirPath, { artifactMode }), with modes as listed above. For a given database, the four levels will generate increasing supersets of artifacts:

  • replay gets you every artifact from operational plus additional spans for the current incarnation (if any)
  • archival gets you every artifact from replay plus additional spans from previous incarnations (if any)
  • debug gets you every artifact from archival plus old snapshots (if any, which requires that your openSwingStore was done with a non-default options.keepSnapshots = true)

However replay is special in that it might fail, even though the swingstore is capable of supporting normal operations. operational will never fail (because swingstores must always have the current spans). archival and debug will never fail because they're best-effort: you get everything available (with/without historical snapshots) with no claims about completeness.

We're changing the old positional exportMode argument into an options bag. We're not going to provide backwards compatibility with callers who continue to pass a string here: we'll land these swingstore changes in the same PR where @mhofman updates the cosmic-swingset -side caller to pass an options bag.

If I have time, I'll also add some short command-line tools, maybe in packages/swing-store/tools/ or bin/, to write out a directory of artifacts (for repopulation), and to repopulate a swingstore from such a directory. The former is trivial, the latter might require some more real code.

Implementation notes:

  • when the exporter is told artifactMode: 'replay', it needs to check all current transcript span records, use that to build a table of vatID to incarnation, for each vatID find the earliest transcript span record for that same incarnation, use that to establish a startPos and endPos for each vat's current incarnation, then do a query per vatID to count the number of transcript items for the current incarnation. If this reveals any missing items, makeSwingStoreExporter should throw, rather than providing a SwingStoreExporter object.
  • on the importer side, artifactMode: 'replay' needs to perform a similar check after all the artifacts are processed, and throw rather than doing a commit()
  • the "generate artifacts for repopulation" tool can just run makeSwingStoreExporter(dirPath, { artifactMode: 'replay' }), then walk getArtifactNames() and write each one into a separate file. It can finish with instructions to zip up the directory or something. Or maybe the tool should import a zip library and do that step too.
  • the "repopulate from bunch of artifacts" needs new code which processes the filename (as an artifact name) and dispatches to importBundle (except not really) or populateSnapshot or populateTranscriptSpan. This will be a copy of the last third of importSwingStore, but needs to live in a new file, and a new function named repopulateSwingStore
    • I don't know if this should be a standalone function, like importSwingStore, or a method, like hostStorage.repairMetadata. Actually, given that it needs to be used by a standalone tool, I think I'll start by making it a standalone function: if we find some use case for invoking it directly from cosmic-swingset, we can change that later. The biggest consequence is in how/when commit() gets called.

@mhofman
Copy link
Member Author

mhofman commented Aug 10, 2023

However replay is special in that it might fail, even though the swingstore is capable of supporting normal operations. operational will never fail (because swingstores must always have the current spans)

The operational import can fail if we don't provide all the operational artifacts. The fact that operational export doesn't fail is an artifact of the invariants we enforce on the swing-store. Also I'd like archival to fail if not all previous incarnations transcript artifacts are present, to support scenarios like getting the data necessary for a full Manchurian replay.

I'll also add some short command-line tools, maybe in packages/swing-store/tools/ or bin/, to write out a directory of artifacts (for repopulation), and to repopulate a swingstore from such a directory.

No need, that's what cosmic-swingset/src/{export,import}-kernel-db.js does.

  • the "generate artifacts for repopulation" tool can just run makeSwingStoreExporter(dirPath, { artifactMode: 'replay' }), then walk getArtifactNames() and write each one into a separate file. It can finish with instructions to zip up the directory or something. Or maybe the tool should import a zip library and do that step too.

First part is what the export tool already does. Second part is up to the user, they just need to provide a directory to the import tool (as options)

  • Actually, given that it needs to be used by a standalone tool, I think I'll start by making it a standalone function: if we find some use case for invoking it directly from cosmic-swingset, we can change that later.

Import tool already "supports" this combination, but just throws since it has nothing to call. I can go either standalone or hostStorage methods. One interesting question is whether we want to support concurrent repopulation while swingset is running in another connection. I don't think so, but anyway that's not the issue for that.

@warner
Copy link
Member

warner commented Aug 10, 2023

However replay is special in that it might fail, even though the swingstore is capable of supporting normal operations. operational will never fail (because swingstores must always have the current spans)

The operational import can fail if we don't provide all the operational artifacts. The fact that operational export doesn't fail is an artifact of the invariants we enforce on the swing-store. Also I'd like archival to fail if not all previous incarnations transcript artifacts are present, to support scenarios like getting the data necessary for a full Manchurian replay.

Ah, yeah, I was only talking about export failing. For sure any import can fail for a variety of reasons.

I can make import(archival) fail.. I guess I'll just have it check that all transcript items starting from pos=0 are present (i.e. COUNT(*) FROM transcriptItems WHERE vatID=? must equal SELECT endPos FROM transcriptSpans WHERE vatID=? AND isCurrent=1, plus or minus a fencepost), rather than making assumptions about incarnation starting at 0.

No need, that's what cosmic-swingset/src/{export,import}-kernel-db.js does.

Awesome, less work for me :).

One interesting question is whether we want to support concurrent repopulation while swingset is running in another connection. I don't think so, but anyway that's not the issue for that.

Yeah, I think concurrent repopulation would make it easier to use (don't need to halt your validator while you do the repopulation), but it can wait for another day.

warner added a commit that referenced this issue Aug 10, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
warner added a commit that referenced this issue Aug 15, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
warner added a commit that referenced this issue Aug 15, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
mhofman pushed a commit that referenced this issue Aug 15, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
mhofman pushed a commit that referenced this issue Aug 15, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
mhofman pushed a commit that referenced this issue Aug 16, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
mhofman pushed a commit that referenced this issue Aug 16, 2023
Previously, `makeSwingStoreExporter()` took a positional argument
named `exportMode`, with values of 'current', 'archival', or
'debug'. This controlled how many artifacts were included in the
export, on a best-effort basis (e.g. a DB whose old spans were pruned
would emit the same artifacts with either 'current' or 'archival').

`importSwingStore()` took an options bag with both the
`makeSwingStore` options (like `keepSnapshots` and `keepTranscripts`),
and an import-specific `includeHistorical` boolean, which controlled
which artifacts were processed by the import. This was also on a
best-effort basis: `includeHistorical: true` on an export dataset that
lacked old spans would produce the same (pruned) DB as `false`.

This commit changes both APIs to take an options bag with a common
`artifactMode` option, with values of `operational`, `replay`,
`archival`, or `debug`. The `operational` choice replaces `current`
and behaves the same way: just enough data for normal operations. The
new `replay` choice 'operational' and 'archival', and selects all
transcript spans for the current incarnation of each vat, but omits
transcript spans for old incarnations: enough to perform a full
vat-replay of the latest incarnation.

Note: `makeSwingStoreExporter` was changed from a positional argument
to an options bag, and no attempt was made to be compatible with
old-style callers.

During export, the mode is now strict: if the DB lacks the artifacts
requested by the given mode, `makeSwingStoreExporter()` will throw an
error, rather than emit fewer artifacts than desired. This means
`artifactMode: 'replay'` will fail unless the DB being exported has
all those old (current-incarnation) transcript items. And `archival`
will fail unless the DB has the old incarnation spans too. The `debug`
mode is best-effort, and emits everything available without the
additional completeness checks.

During import, the mode applies both an import filter and a
completeness check. So exporting with `archival` but importing with
`operational` will get you a pruned DB, lacking anything
historical. Exporting with `operational` and importing with `replay`
or `archival` will fail, because the newly-populated DB does not
contain any historical artifacts.

closes #8105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request swing-store SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants