Saga pantry attach/detach: fail if pantry is removed from DNS #6866

jgallagher · 2024-10-14T19:56:02Z

Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.

I'm marking this as a draft because it's completely untested. @jmpesp any suggestions for testing? I skipped over the Nexus integration tests for a test to expand or emulate and didn't see anything in this area, but definitely could've missed it (they are legion!). I could construct a local unit test of the attach/detach functions, but it would require a fair bit of setup machinery (a fake crucible pantry, modifying the DNS records, ...). Maybe worth it if there's no easy way to test this?

Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.

This builds on #6866. After its changes, there was only caller of `retry_until_known_result` left; this PR removes it. We keep the retry loop, but instead of retrying for ever, we bail out if the sled we're trying to reach is "gone", as determined by "is it no longer in-service", which in practice means it's been expunged.

jmpesp

I'll have to come back to your testing question, but the code looks correct to me

jmpesp · 2024-10-15T16:37:14Z

nexus/src/app/sagas/region_replacement_drive.rs

+    // Importantly, _do not use `call_pantry_attach_for_disk`_! That retries
+    // as long as `pantry_address` is still resolvable in DNS, which we _do not
+    // want here_. The Pantry attach can fail if there's a racing Volume
+    // checkout to be sent to Propolis. Additionally, that call uses `attach`
+    // instead of `attach_activate_background`, which means it will hang on the
+    // activation.


I would update this message to only

Importantly, _do not use `call_pantry_attach_for_disk`_! That call uses `attach` instead of `attach_activate_background`, which means it will hang on the activation.

At some point, some sort of retry should be added here, but that shouldn't block this PR.

Updated in fb21fb9

davepacheco

I'm not super familiar with this code so I'm not sure how much confidence my +1 adds but it looks good to me!

nexus/src/app/sagas/common_storage.rs

jgallagher · 2024-11-12T19:34:07Z

After several false starts and finding legit bugs in the pantry qorb implementation (fixed in 62d3608), I was able to successfully test this in a4x2.

After the qorb pool did its every-30-second health checks to the pantry, I shut down the pantry service and then started a disk snapshot. The saga progressed up to node 13, then got stuck in the retry loop trying to attach, as expected; e.g.,

19:06:06.499Z WARN d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:102::7]:17000/crucible/pantry/0/volume/f27b642d-3979-476e-a7ae-5733f90d4455), retrying...
    file = common/src/progenitor_operation_retry.rs:124
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
    saga_name = snapshot-create
19:06:06.499Z WARN d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): failed external call (ProgenitorError(Communication Error: error sending request for url (http://[fd00:1122:3344:102::7]:17000/crucible/pantry/0/volume/f27b642d-3979-476e-a7ae-5733f90d
4455))), will retry in 19.24638114s
    file = common/src/progenitor_operation_retry.rs:177
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
    saga_name = snapshot-create

Several minutes later, I expunged the pantry, which removed it from internal DNS, at which point the saga node failed:

root@[fd00:1122:3344:102::4]:32221/omicron> select node_id,event_type,event_time from saga_node_event where saga_id = 'b9936011-1c9a-4041-9abb-734481f6338f' order by event_time asc;
... snip ...
       13 | started       | 2024-11-12 19:05:50.444103+00
       13 | failed        | 2024-11-12 19:23:20.29314+00
... snip ...

and Nexus logged that the failure was due to the target pantry no longer being present in DNS:

19:23:20.287Z DEBG d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: String("pantry attach failed: remote server is gone") })
    node_id = 13
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
19:23:37.024Z WARN d928fa26-ac1c-4cc1-a686-6a165c60c4b8 (ServerContext): saga finished
    action_error_node_name = "call_pantry_attach_for_disk"
    action_error_source = ActionFailed { source_error: String("pantry attach failed: remote server is gone") }
    file = /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.1/src/sec.rs:1044
    result = failure
    saga_id = b9936011-1c9a-4041-9abb-734481f6338f
    saga_name = snapshot-create
    sec_id = d928fa26-ac1c-4cc1-a686-6a165c60c4b8
    undo_result = success

I did not also test the changed detach / snapshot codepaths, but believe they're similar enough that testing attach is sufficient.

Saga pantry attach/detach: fail if pantry is removed from DNS

2d24652

Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.

jgallagher requested review from davepacheco and jmpesp October 14, 2024 19:56

missed a pantry retry_until_known_result

4012296

jgallagher mentioned this pull request Oct 14, 2024

Remove retry_until_known_result() #6868

Open

jmpesp approved these changes Oct 15, 2024

View reviewed changes

davepacheco added this to the 12 milestone Oct 15, 2024

davepacheco assigned jgallagher Oct 15, 2024

davepacheco approved these changes Oct 16, 2024

View reviewed changes

nexus/src/app/sagas/common_storage.rs Outdated Show resolved Hide resolved

jgallagher added 4 commits November 7, 2024 11:50

Merge branch 'main' into john/fallible-pantry-attach-detach

6f59a1c

PR feedback

fb21fb9

fixes to pantry qorb connector

62d3608

Merge branch 'main' into john/fallible-pantry-attach-detach

6a16f1d

jgallagher marked this pull request as ready for review November 12, 2024 19:34

snapshot_create undo: treat pantry gone as successful detach

b94c1fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saga pantry attach/detach: fail if pantry is removed from DNS #6866

Saga pantry attach/detach: fail if pantry is removed from DNS #6866

jgallagher commented Oct 14, 2024

jmpesp left a comment

jmpesp Oct 15, 2024

jgallagher Nov 12, 2024

davepacheco left a comment

jgallagher commented Nov 12, 2024

Saga pantry attach/detach: fail if pantry is removed from DNS #6866

Are you sure you want to change the base?

Saga pantry attach/detach: fail if pantry is removed from DNS #6866

Conversation

jgallagher commented Oct 14, 2024

jmpesp left a comment

Choose a reason for hiding this comment

jmpesp Oct 15, 2024

Choose a reason for hiding this comment

jgallagher Nov 12, 2024

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

jgallagher commented Nov 12, 2024