Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Atlas shutdown and crash tolerance (#3082) #3086

Merged
merged 14 commits into from
Mar 6, 2023

Conversation

kantai
Copy link
Member

@kantai kantai commented Apr 8, 2022

Description

This PR replaces the rx/tx synchronized channel used to broadcast AttachmentInstances from the chain-coordinator thread to the p2p thread with direct writes to the AtlasDB. This PR preserves the p2p thread's behavior -- new attachment instances are written as "queued" by the coordinator, but aren't checked against extant attachment data until the p2p thread passes over the queued instances. After this, the behavior is exactly the same as before, with new attachment notices being passed through the network result object.

Applicable issues

Additional info (benefits, drawbacks, caveats)

  • The amount of queued instances checked is limited by a new const

Checklist

  • Test coverage for the modified code paths should be provided by the existing atlas stress and atlas integration tests.
  • Changelog is updated
  • Along with the changes, I've added some rustdocs to portions of the atlas code I touched, which explain the intended life cycle of attachment instances

@kantai kantai linked an issue Apr 8, 2022 that may be closed by this pull request
@kantai kantai self-assigned this Apr 8, 2022
@kantai kantai requested a review from jcnelson April 8, 2022 18:06
@codecov
Copy link

codecov bot commented Apr 8, 2022

Codecov Report

Merging #3086 (2b70498) into develop (dac322d) will increase coverage by 29.16%.
The diff coverage is 67.63%.

@@             Coverage Diff              @@
##           develop    #3086       +/-   ##
============================================
+ Coverage    31.26%   60.42%   +29.16%     
============================================
  Files          298      240       -58     
  Lines       275135   131041   -144094     
============================================
- Hits         86012    79178     -6834     
+ Misses      189123    51863   -137260     
Impacted Files Coverage Δ
src/net/atlas/mod.rs 86.44% <ø> (ø)
src/net/mod.rs 49.75% <ø> (+31.10%) ⬆️
src/net/atlas/download.rs 7.56% <44.28%> (-70.84%) ⬇️
src/net/atlas/db.rs 68.54% <70.92%> (-8.95%) ⬇️
testnet/stacks-node/src/node.rs 84.31% <72.72%> (+0.05%) ⬆️
src/chainstate/coordinator/mod.rs 84.29% <84.21%> (+2.94%) ⬆️
src/net/p2p.rs 65.48% <100.00%> (+11.75%) ⬆️
testnet/stacks-node/src/neon_node.rs 83.86% <100.00%> (-0.19%) ⬇️
testnet/stacks-node/src/run_loop/helium.rs 94.96% <100.00%> (+0.06%) ⬆️
testnet/stacks-node/src/run_loop/neon.rs 82.80% <100.00%> (-1.34%) ⬇️
... and 223 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

const ATLASDB_SCHEMA_2: &'static [&'static str] = &[
r#"
ALTER TABLE attachment_instances
ADD status INTEGER NOT NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised sqlite doesn't force you to provide a default value here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. I removed "NOT NULL", because the default value defeats the point of not null here anyways. Also, added a test where the database actually undergoes migration.

.queue_attachment_instance(attachment_instance)
.unwrap();
atlas_db
.mark_attachment_instance_checked(attachment_instance, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there test coverage for simulating a crash and recovery? i.e. Is there a test somewhere that convinces us that if the node crashed and was restarted, the attachment system would discover that there were attachments queued but not checked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I can try to add something along those lines. It won't be a "crash", but instead a stop-and-restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in d3bf291: chainstate::coordinator::tests::atlas_stop_start

Ok(_) => {}
Err(e) => {
error!("Atlas: error dispatching attachments {}", e);
if let Some(atlas_db) = self.atlas_db.as_mut() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but doesn't this call to queue_attachment_instance() happen outside of block-processing? Like, if a node processed a block and then crashed before reaching this code path, would it be able to enqueue the block's attachment data on restart?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to commit to the attachment data from the event stream from within StacksChainState::append_block(), right before we commit to the Stacks block? Enqueuing attachment instances is idempotent, so re-doing them in the event the node crashes in-between saving them and committing the block should be safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'd like to do that in a separate PR. It's true that this change doesn't make the atlas queue entirely crash proof, but it removes the 2-thread dependency for the atlas queue, which makes the shutdown behavior much better already.

This PR is already changing a lot of how Atlas works, so I'd prefer to not also move atlas queueing logic into append_block as part of the same changeset.

Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass. My biggest question is whether or not this PR allows the node to recover if it crashes while in the middle of the block. I can't convince myself that it does -- it seems this saves attachment events after the block that produced them is committed, which I think means that there's a window of time in which the block is saved, but the attachments are not. This would make it possible for the node to crash in such a way that it loses attachment events.

Copy link
Contributor

@pavitthrap pavitthrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except +1 on Jude's comment on append_block

src/net/atlas/db.rs Show resolved Hide resolved
@jcnelson jcnelson added the frozen PRs that are on hold label Jul 11, 2022
@jcnelson
Copy link
Member

jcnelson commented Feb 7, 2023

This needs to be looked at.

@kantai kantai requested a review from jcnelson February 23, 2023 21:52
@kantai kantai added enhancement Iterations on existing features or infrastructure. deployment BNS and removed frozen PRs that are on hold labels Feb 23, 2023
@kantai
Copy link
Member Author

kantai commented Feb 23, 2023

Okay -- this PR should be ready for re-review!

pub fn insert_uninstantiated_attachment_instance(
/// Queue a new attachment instance, status will be set to "queued",
/// and the is_available field set to false
pub fn queue_attachment_instance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document which thread calls this function? Same for insert_initial_attachment_instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, however, I'm somewhat hesitant to document the caller from the called method. If atlas event processing is moved from the coordinator's purview to db::blocks or whatever, that move can occur without ever touching this function, which could easily lead to a stale rustdoc here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in cb9f8dc

/// Check any queued attachment instances to see if we already have data for them,
/// returning a vector of (instance, attachment) pairs for any of the queued attachments
/// which already had the associated data
/// Marks any processed attachments as checked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which thread calls this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is invoked in the thread managing the AttachmentDownloader. This is currently the P2P thread. Will add to the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in cb9f8dc

/// is completed, any checked instances are updated to `Checked`.
pub enum AttachmentInstanceStatus {
/// This variant indicates that the attachments instance has been written,
/// but the downloader has not yet checked that the attachment matched
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "downloader" you mean the "downloader in the p2p thread", right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the AttachmentsDownloader, will update the comment to make more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in cb9f8dc

@@ -5393,7 +5392,7 @@ impl PeerNetwork {
// enqueue them.
PeerNetwork::with_attachments_downloader(self, |network, attachments_downloader| {
let mut known_attachments = attachments_downloader
.enqueue_new_attachments(attachment_requests, &mut network.atlasdb, false)
.check_queued_attachment_instances(&mut network.atlasdb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to the relayer thread, as part of the process_network_result() call? In general, I'd keep as much unbound disk I/O out of the p2p thread as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine if you don't do this in this PR. If not, then please open an issue and assign it to me to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue for that -- it would entail moving the atlas downloader from the PeerNetwork struct into the Relayer struct.

@@ -47,7 +52,16 @@ use crate::types::chainstate::StacksBlockId;

use super::{AtlasConfig, Attachment, AttachmentInstance};

pub const ATLASDB_VERSION: &'static str = "1";
pub const ATLASDB_VERSION: &'static str = "2";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could add some file-level documentation about the states an attachment goes through before it's either committed or rejected from the DB? I'm having a hard time understanding from the current comments what states an attachment can be in. At maximum, it looks like there are four, but not all of them make semantic sense:

  • (unavailable, queued): The node got the attachment event, but has not yet received any data for it
  • (available, queued): The node got the data, but not the corresponding attachment event?
  • (unavailable, checked): ???
  • (available, checked): The node saw the attachment event, has the data, and has authenticated it

If I'm understanding this right, you've just moved the attachment channel to the DB. The chains coordinator logs all attachment events it encounters straight into the DB, with the Queued status. Later on, when the p2p thread receives attachment data, it authenticates the data against "queued" attachment events and if the data matches one or more of them, it (1) marks the attachment as checked, (2) stores the attachment data, and (3) marks it as available. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added module level rustdocs (and more information to the other rustdocs) that talks more about this (cb9f8dc)

"Availability" and "checked" have little to do with each other (except that the availability field has no meaning until an attachment has been checked). "is available" means that the atlas content is already stored in the stacks-node's atlas database. "checked" means that the atlas downloader has acknowledged that a new content hash has been added to the system (i.e., a new attachment instance). So, "(unavailable, checked)" is indeed possible and part of the normal flow of the system: it just means that the atlas downloader hasn't downloaded the corresponding content yet.

Copy link
Contributor

@pavitthrap pavitthrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

src/net/atlas/db.rs Outdated Show resolved Hide resolved
src/net/atlas/db.rs Outdated Show resolved Hide resolved
src/net/atlas/db.rs Outdated Show resolved Hide resolved
src/net/atlas/tests.rs Show resolved Hide resolved
@kantai kantai requested a review from jcnelson March 2, 2023 20:47
Copy link
Member

@jcnelson jcnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kantai! LGTM

@blockstack-devops
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@stacks-network stacks-network locked as resolved and limited conversation to collaborators Nov 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
BNS enhancement Iterations on existing features or infrastructure. locked
Projects
Status: Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Improve AtlasDB crash and shutdown tolerance
4 participants