-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saga pantry attach/detach: fail if pantry is removed from DNS #6866
base: main
Are you sure you want to change the base?
Conversation
Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.
This builds on #6866. After its changes, there was only caller of `retry_until_known_result` left; this PR removes it. We keep the retry loop, but instead of retrying for ever, we bail out if the sled we're trying to reach is "gone", as determined by "is it no longer in-service", which in practice means it's been expunged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to come back to your testing question, but the code looks correct to me
// Importantly, _do not use `call_pantry_attach_for_disk`_! That retries | ||
// as long as `pantry_address` is still resolvable in DNS, which we _do not | ||
// want here_. The Pantry attach can fail if there's a racing Volume | ||
// checkout to be sent to Propolis. Additionally, that call uses `attach` | ||
// instead of `attach_activate_background`, which means it will hang on the | ||
// activation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would update this message to only
Importantly, _do not use `call_pantry_attach_for_disk`_! That call uses `attach` instead of `attach_activate_background`, which means it will hang on the activation.
At some point, some sort of retry should be added here, but that shouldn't block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in fb21fb9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with this code so I'm not sure how much confidence my +1 adds but it looks good to me!
After several false starts and finding legit bugs in the pantry qorb implementation (fixed in 62d3608), I was able to successfully test this in a4x2. After the qorb pool did its every-30-second health checks to the pantry, I shut down the pantry service and then started a disk snapshot. The saga progressed up to node 13, then got stuck in the retry loop trying to attach, as expected; e.g.,
Several minutes later, I expunged the pantry, which removed it from internal DNS, at which point the saga node failed:
and Nexus logged that the failure was due to the target pantry no longer being present in DNS:
I did not also test the changed detach / snapshot codepaths, but believe they're similar enough that testing attach is sufficient. |
Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.
I'm marking this as a draft because it's completely untested. @jmpesp any suggestions for testing? I skipped over the Nexus integration tests for a test to expand or emulate and didn't see anything in this area, but definitely could've missed it (they are legion!). I could construct a local unit test of the attach/detach functions, but it would require a fair bit of setup machinery (a fake crucible pantry, modifying the DNS records, ...). Maybe worth it if there's no easy way to test this?