Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saga pantry attach/detach: fail if pantry is removed from DNS #6866

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.

I'm marking this as a draft because it's completely untested. @jmpesp any suggestions for testing? I skipped over the Nexus integration tests for a test to expand or emulate and didn't see anything in this area, but definitely could've missed it (they are legion!). I could construct a local unit test of the attach/detach functions, but it would require a fair bit of setup machinery (a fake crucible pantry, modifying the DNS records, ...). Maybe worth it if there's no easy way to test this?

Removal from DNS should correspond with the pantry zone's expungement,
which should result in a saga node failure.
jgallagher added a commit that referenced this pull request Oct 14, 2024
This builds on #6866. After its changes, there was only caller of
`retry_until_known_result` left; this PR removes it. We keep the retry
loop, but instead of retrying for ever, we bail out if the sled we're
trying to reach is "gone", as determined by "is it no longer
in-service", which in practice means it's been expunged.
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to come back to your testing question, but the code looks correct to me

Comment on lines 1287 to 1292
// Importantly, _do not use `call_pantry_attach_for_disk`_! That retries
// as long as `pantry_address` is still resolvable in DNS, which we _do not
// want here_. The Pantry attach can fail if there's a racing Volume
// checkout to be sent to Propolis. Additionally, that call uses `attach`
// instead of `attach_activate_background`, which means it will hang on the
// activation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would update this message to only

Importantly, _do not use `call_pantry_attach_for_disk`_! That call uses `attach` instead of `attach_activate_background`, which means it will hang on the activation.

At some point, some sort of retry should be added here, but that shouldn't block this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fb21fb9

@davepacheco davepacheco added this to the 12 milestone Oct 15, 2024
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this code so I'm not sure how much confidence my +1 adds but it looks good to me!

nexus/src/app/sagas/common_storage.rs Outdated Show resolved Hide resolved
@jgallagher
Copy link
Contributor Author

After several false starts and finding legit bugs in the pantry qorb implementation (fixed in 62d3608), I was able to successfully test this in a4x2.

After the qorb pool did its every-30-second health checks to the pantry, I shut down the pantry service and then started a disk snapshot. The saga progressed up to node 13, then got stuck in the retry loop trying to attach, as expected; e.g.,

19:06:06.499Z WARN d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:102::7]:17000/crucible/pantry/0/volume/f27b642d-3979-476e-a7ae-5733f90d4455), retrying...
    file = common/src/progenitor_operation_retry.rs:124
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
    saga_name = snapshot-create
19:06:06.499Z WARN d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): failed external call (ProgenitorError(Communication Error: error sending request for url (http://[fd00:1122:3344:102::7]:17000/crucible/pantry/0/volume/f27b642d-3979-476e-a7ae-5733f90d
4455))), will retry in 19.24638114s
    file = common/src/progenitor_operation_retry.rs:177
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
    saga_name = snapshot-create

Several minutes later, I expunged the pantry, which removed it from internal DNS, at which point the saga node failed:

root@[fd00:1122:3344:102::4]:32221/omicron> select node_id,event_type,event_time from saga_node_event where saga_id = 'b9936011-1c9a-4041-9abb-734481f6338f' order by event_time asc;
... snip ...
       13 | started       | 2024-11-12 19:05:50.444103+00
       13 | failed        | 2024-11-12 19:23:20.29314+00
... snip ...

and Nexus logged that the failure was due to the target pantry no longer being present in DNS:

19:23:20.287Z DEBG d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: String("pantry attach failed: remote server is gone") })
    node_id = 13
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
19:23:37.024Z WARN d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): saga finished
    action_error_node_name = "call_pantry_attach_for_disk"
    action_error_source = ActionFailed { source_error: String("pantry attach failed: remote server is gone") }
    file = /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1044
    result = failure
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
    saga_name = snapshot-create
    sec_id = d928fa26-ac1c-4cc1-a686-6a165c60c4b8
    undo_result = success

I did not also test the changed detach / snapshot codepaths, but believe they're similar enough that testing attach is sufficient.

@jgallagher jgallagher marked this pull request as ready for review November 12, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants