Improved Atlas shutdown and crash tolerance (#3082) #3086

kantai · 2022-04-08T18:06:06Z

Description

This PR replaces the rx/tx synchronized channel used to broadcast AttachmentInstances from the chain-coordinator thread to the p2p thread with direct writes to the AtlasDB. This PR preserves the p2p thread's behavior -- new attachment instances are written as "queued" by the coordinator, but aren't checked against extant attachment data until the p2p thread passes over the queued instances. After this, the behavior is exactly the same as before, with new attachment notices being passed through the network result object.

Applicable issues

Fixes Improve AtlasDB crash and shutdown tolerance #3082

Additional info (benefits, drawbacks, caveats)

The amount of queued instances checked is limited by a new const

Checklist

Test coverage for the modified code paths should be provided by the existing atlas stress and atlas integration tests.
Changelog is updated
Along with the changes, I've added some rustdocs to portions of the atlas code I touched, which explain the intended life cycle of attachment instances

codecov · 2022-04-08T18:42:34Z

Codecov Report

Merging #3086 (2b70498) into develop (dac322d) will increase coverage by 29.16%.
The diff coverage is 67.63%.

@@             Coverage Diff              @@
##           develop    #3086       +/-   ##
============================================
+ Coverage    31.26%   60.42%   +29.16%     
============================================
  Files          298      240       -58     
  Lines       275135   131041   -144094     
============================================
- Hits         86012    79178     -6834     
+ Misses      189123    51863   -137260

Impacted Files	Coverage Δ
src/net/atlas/mod.rs	`86.44% <ø> (ø)`
src/net/mod.rs	`49.75% <ø> (+31.10%)`	⬆️
src/net/atlas/download.rs	`7.56% <44.28%> (-70.84%)`	⬇️
src/net/atlas/db.rs	`68.54% <70.92%> (-8.95%)`	⬇️
testnet/stacks-node/src/node.rs	`84.31% <72.72%> (+0.05%)`	⬆️
src/chainstate/coordinator/mod.rs	`84.29% <84.21%> (+2.94%)`	⬆️
src/net/p2p.rs	`65.48% <100.00%> (+11.75%)`	⬆️
testnet/stacks-node/src/neon_node.rs	`83.86% <100.00%> (-0.19%)`	⬇️
testnet/stacks-node/src/run_loop/helium.rs	`94.96% <100.00%> (+0.06%)`	⬆️
testnet/stacks-node/src/run_loop/neon.rs	`82.80% <100.00%> (-1.34%)`	⬇️
... and 223 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jcnelson · 2022-04-13T01:06:59Z

src/net/atlas/db.rs

+const ATLASDB_SCHEMA_2: &'static [&'static str] = &[
+    r#"
+    ALTER TABLE attachment_instances
+    ADD    status INTEGER NOT NULL


I'm surprised sqlite doesn't force you to provide a default value here.

It doesn't. I removed "NOT NULL", because the default value defeats the point of not null here anyways. Also, added a test where the database actually undergoes migration.

src/net/atlas/download.rs

jcnelson · 2022-04-13T01:26:35Z

src/net/atlas/tests.rs

+            .queue_attachment_instance(attachment_instance)
+            .unwrap();
+        atlas_db
+            .mark_attachment_instance_checked(attachment_instance, true)


Is there test coverage for simulating a crash and recovery? i.e. Is there a test somewhere that convinces us that if the node crashed and was restarted, the attachment system would discover that there were attachments queued but not checked?

No, but I can try to add something along those lines. It won't be a "crash", but instead a stop-and-restart.

Added in d3bf291: chainstate::coordinator::tests::atlas_stop_start

jcnelson · 2022-04-13T01:29:14Z

src/chainstate/coordinator/mod.rs

-                            Ok(_) => {}
-                            Err(e) => {
-                                error!("Atlas: error dispatching attachments {}", e);
+                        if let Some(atlas_db) = self.atlas_db.as_mut() {


Correct me if I'm wrong, but doesn't this call to queue_attachment_instance() happen outside of block-processing? Like, if a node processed a block and then crashed before reaching this code path, would it be able to enqueue the block's attachment data on restart?

Would it be possible to commit to the attachment data from the event stream from within StacksChainState::append_block(), right before we commit to the Stacks block? Enqueuing attachment instances is idempotent, so re-doing them in the event the node crashes in-between saving them and committing the block should be safe.

Yes, but I'd like to do that in a separate PR. It's true that this change doesn't make the atlas queue entirely crash proof, but it removes the 2-thread dependency for the atlas queue, which makes the shutdown behavior much better already.

This PR is already changing a lot of how Atlas works, so I'd prefer to not also move atlas queueing logic into append_block as part of the same changeset.

jcnelson

Did an initial pass. My biggest question is whether or not this PR allows the node to recover if it crashes while in the middle of the block. I can't convince myself that it does -- it seems this saves attachment events after the block that produced them is committed, which I think means that there's a window of time in which the block is saved, but the attachments are not. This would make it possible for the node to crash in such a way that it loses attachment events.

pavitthrap

lgtm except +1 on Jude's comment on append_block

src/net/atlas/db.rs

jcnelson · 2023-02-07T16:22:27Z

This needs to be looked at.

…into feat/atlas-tolerance-3082

kantai · 2023-02-23T21:54:01Z

Okay -- this PR should be ready for re-review!

jcnelson · 2023-02-28T18:41:03Z

src/net/atlas/db.rs

-    pub fn insert_uninstantiated_attachment_instance(
+    /// Queue a new attachment instance, status will be set to "queued",
+    /// and the is_available field set to false
+    pub fn queue_attachment_instance(


Can you document which thread calls this function? Same for insert_initial_attachment_instance.

Sure, however, I'm somewhat hesitant to document the caller from the called method. If atlas event processing is moved from the coordinator's purview to db::blocks or whatever, that move can occur without ever touching this function, which could easily lead to a stale rustdoc here.

Added in cb9f8dc

jcnelson · 2023-02-28T18:44:58Z

src/net/atlas/download.rs

+    /// Check any queued attachment instances to see if we already have data for them,
+    ///  returning a vector of (instance, attachment) pairs for any of the queued attachments
+    ///  which already had the associated data
+    /// Marks any processed attachments as checked


Which thread calls this function?

This method is invoked in the thread managing the AttachmentDownloader. This is currently the P2P thread. Will add to the comment.

Added in cb9f8dc

testnet/stacks-node/src/node.rs

jcnelson · 2023-02-28T18:50:09Z

src/net/atlas/db.rs

+/// is completed, any checked instances are updated to `Checked`.
+pub enum AttachmentInstanceStatus {
+    /// This variant indicates that the attachments instance has been written,
+    /// but the downloader has not yet checked that the attachment matched


By "downloader" you mean the "downloader in the p2p thread", right?

Yes, the AttachmentsDownloader, will update the comment to make more clear

Added in cb9f8dc

jcnelson · 2023-02-28T19:02:19Z

src/net/p2p.rs

@@ -5393,7 +5392,7 @@ impl PeerNetwork {
        // enqueue them.
        PeerNetwork::with_attachments_downloader(self, |network, attachments_downloader| {
            let mut known_attachments = attachments_downloader
-                .enqueue_new_attachments(attachment_requests, &mut network.atlasdb, false)
+                .check_queued_attachment_instances(&mut network.atlasdb)


Can this be moved to the relayer thread, as part of the process_network_result() call? In general, I'd keep as much unbound disk I/O out of the p2p thread as possible.

It's fine if you don't do this in this PR. If not, then please open an issue and assign it to me to do this.

I'll open an issue for that -- it would entail moving the atlas downloader from the PeerNetwork struct into the Relayer struct.

jcnelson · 2023-02-28T19:04:20Z

src/net/atlas/db.rs

@@ -47,7 +52,16 @@ use crate::types::chainstate::StacksBlockId;

 use super::{AtlasConfig, Attachment, AttachmentInstance};

-pub const ATLASDB_VERSION: &'static str = "1";
+pub const ATLASDB_VERSION: &'static str = "2";


Do you think you could add some file-level documentation about the states an attachment goes through before it's either committed or rejected from the DB? I'm having a hard time understanding from the current comments what states an attachment can be in. At maximum, it looks like there are four, but not all of them make semantic sense:

(unavailable, queued): The node got the attachment event, but has not yet received any data for it

(available, queued): The node got the data, but not the corresponding attachment event?

(unavailable, checked): ???

(available, checked): The node saw the attachment event, has the data, and has authenticated it

If I'm understanding this right, you've just moved the attachment channel to the DB. The chains coordinator logs all attachment events it encounters straight into the DB, with the Queued status. Later on, when the p2p thread receives attachment data, it authenticates the data against "queued" attachment events and if the data matches one or more of them, it (1) marks the attachment as checked, (2) stores the attachment data, and (3) marks it as available. Is that correct?

Added module level rustdocs (and more information to the other rustdocs) that talks more about this (cb9f8dc)

"Availability" and "checked" have little to do with each other (except that the availability field has no meaning until an attachment has been checked). "is available" means that the atlas content is already stored in the stacks-node's atlas database. "checked" means that the atlas downloader has acknowledged that a new content hash has been added to the system (i.e., a new attachment instance). So, "(unavailable, checked)" is indeed possible and part of the normal flow of the system: it just means that the atlas downloader hasn't downloaded the corresponding content yet.

…ce-3082

pavitthrap

lgtm

src/net/atlas/db.rs

src/net/atlas/tests.rs

jcnelson

Thanks @kantai! LGTM

blockstack-devops · 2024-11-13T00:21:26Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

kantai added 3 commits April 6, 2022 17:47

feat: replace atlas sync channel with direct db insert

bdac625

add limit to queue result size, clean out attachments_rx/tx instances

3a7c700

update CHANGELOG add more rustdocs

6bac33f

kantai linked an issue Apr 8, 2022 that may be closed by this pull request

Improve AtlasDB crash and shutdown tolerance #3082

Closed

kantai self-assigned this Apr 8, 2022

kantai requested a review from jcnelson April 8, 2022 18:06

jcnelson added the release 2.05.0.2.0 label Apr 11, 2022

jcnelson reviewed Apr 13, 2022

View reviewed changes

src/net/atlas/download.rs Show resolved Hide resolved

jcnelson reviewed Apr 13, 2022

View reviewed changes

jcnelson removed the release 2.05.0.2.0 label Apr 18, 2022

pavitthrap approved these changes Apr 22, 2022

View reviewed changes

src/net/atlas/db.rs Show resolved Hide resolved

jcnelson added the frozen PRs that are on hold label Jul 11, 2022

kantai and others added 3 commits February 8, 2023 11:40

Merge and fix conflicts from remote-tracking branch 'origin/develop' …

f823c3f

…into feat/atlas-tolerance-3082

cleanup warns

97ee5f6

Merge branch 'develop' into feat/atlas-tolerance-3082

fa63a49

jcnelson mentioned this pull request Feb 21, 2023

node halts [mainnet] #3456

Open

kantai added 4 commits February 22, 2023 13:40

Merge branch 'develop' into feat/atlas-tolerance-3082

04514c4

test: add unit test for schema migration

6eb017d

chore: add comments/docstrings

d9a5e18

test: atlas instance queuing through restart

d3bf291

kantai requested a review from jcnelson February 23, 2023 21:52

kantai added enhancement Iterations on existing features or infrastructure. deployment BNS and removed frozen PRs that are on hold labels Feb 23, 2023

jcnelson reviewed Feb 28, 2023

View reviewed changes

testnet/stacks-node/src/node.rs Outdated Show resolved Hide resolved

jcnelson reviewed Feb 28, 2023

View reviewed changes

kantai added 2 commits February 28, 2023 17:48

docs: add more rustdocs

cb9f8dc

Merge remote-tracking branch 'origin/develop' into feat/atlas-toleran…

9b9d227

…ce-3082

kantai mentioned this pull request Mar 1, 2023

Move Atlas attachment storage I/O out of p2p thread #3595

Open

pavitthrap approved these changes Mar 1, 2023

View reviewed changes

src/net/atlas/db.rs Outdated Show resolved Hide resolved

src/net/atlas/db.rs Outdated Show resolved Hide resolved

src/net/atlas/db.rs Outdated Show resolved Hide resolved

src/net/atlas/tests.rs Show resolved Hide resolved

chore: address PR feedback

9b97a39

kantai requested a review from jcnelson March 2, 2023 20:47

jcnelson approved these changes Mar 6, 2023

View reviewed changes

Merge branch 'develop' into feat/atlas-tolerance-3082

2b70498

kantai merged commit 154b316 into develop Mar 6, 2023

kantai mentioned this pull request Mar 6, 2023

Improve AtlasDB crash and shutdown tolerance #3082

Closed

kantai mentioned this pull request Mar 23, 2023

Queue new atlas instances before committing block operations #3634

Open

blockstack-devops added the locked label Nov 13, 2024

stacks-network locked as resolved and limited conversation to collaborators Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Atlas shutdown and crash tolerance (#3082) #3086

Improved Atlas shutdown and crash tolerance (#3082) #3086

kantai commented Apr 8, 2022

codecov bot commented Apr 8, 2022 •

edited

Loading

jcnelson Apr 13, 2022

kantai Feb 22, 2023

jcnelson Apr 13, 2022

kantai Feb 22, 2023

kantai Feb 23, 2023

jcnelson Apr 13, 2022

jcnelson Apr 13, 2022

kantai Feb 22, 2023

jcnelson left a comment

pavitthrap left a comment

jcnelson commented Feb 7, 2023

kantai commented Feb 23, 2023

jcnelson Feb 28, 2023

kantai Feb 28, 2023

kantai Feb 28, 2023

jcnelson Feb 28, 2023

kantai Feb 28, 2023

kantai Feb 28, 2023

jcnelson Feb 28, 2023

kantai Feb 28, 2023

kantai Feb 28, 2023

jcnelson Feb 28, 2023

jcnelson Feb 28, 2023

kantai Feb 28, 2023

jcnelson Feb 28, 2023

kantai Feb 28, 2023

pavitthrap left a comment

jcnelson left a comment

blockstack-devops commented Nov 13, 2024

Improved Atlas shutdown and crash tolerance (#3082) #3086

Improved Atlas shutdown and crash tolerance (#3082) #3086

Conversation

kantai commented Apr 8, 2022

Description

Applicable issues

Additional info (benefits, drawbacks, caveats)

Checklist

codecov bot commented Apr 8, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcnelson left a comment

Choose a reason for hiding this comment

pavitthrap left a comment

Choose a reason for hiding this comment

jcnelson commented Feb 7, 2023

kantai commented Feb 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavitthrap left a comment

Choose a reason for hiding this comment

jcnelson left a comment

Choose a reason for hiding this comment

blockstack-devops commented Nov 13, 2024

codecov bot commented Apr 8, 2022 •

edited

Loading