Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: prevent double initialization in cases where an error was mistakenly retried #1404

Merged
merged 2 commits into from
Mar 20, 2023

Conversation

Nirusu
Copy link
Contributor

@Nirusu Nirusu commented Mar 11, 2023

Proposed change(s)

  • Add "Connecting" spinner state for "constellation init"
  • Show "Initializing cluster" only when our gRPC connection was successful

This is based on @malt3 gRPC debug logging functions.
I figured we can also just use them for the normal user output :)
Not sure if this is the best place to put it there, though since I am not sure if the state can theoretically change multiple times to READY.
Hope I can get some feedback on that? Not super familiar with gRPC connection states and how this works in our init.
Otherwise it might make sense to make this a channel and let the caller handle this.

Second commit also adds a note that it can take a few minutes.

Checklist

  • Update docs
  • Add labels (e.g., for changelog category)
  • Link to Milestone

@Nirusu Nirusu requested a review from katexochen as a code owner March 11, 2023 08:52
@Nirusu Nirusu added the feature This introduces new functionality label Mar 11, 2023
@Nirusu Nirusu requested a review from malt3 March 11, 2023 08:54
@edgelesssys edgelesssys deleted a comment from netlify bot Mar 11, 2023
@malt3
Copy link
Contributor

malt3 commented Mar 13, 2023

I am not sure if the state can theoretically change multiple times to READY.

Currently it certainly can change to READY multiple times. This is something we do want to prevent in the future by disallowing further init retries once we see that the connection became ready once.

Comment on lines +268 to +278
d.spinner.Stop()
d.spinner.Start("Initializing cluster ", false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either prevent this spinner change from happening more than once in this func or prevent multiple init calls all together (see my other comment).

@daniel-weisse @3u13r any thoughts on preventing multiple init calls after the atls conn was established at least once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping from an established connection to connecting should result in a failure imo.
We need to refactor the whole init logic anyway. Currently, a failed attestation will result in a warning about this being a non recoverable error (which is not true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make a double connectivity.Ready a failure for now and discuss a larger refactor to init in an upcoming dev sync?

Because I would also +1 for a refactor just because of #1403 for which proper fixes also might include a refactor to init (from the bootstrapper side) to remove or better mitigate the pesky timeout logic.

Copy link
Member

@daniel-weisse daniel-weisse Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023-03-14T08:39:56+01:00       DEBUG   cmd/init.go:246 Created protoClient
2023-03-14T08:39:56+01:00       DEBUG   cmd/init.go:259 Connection state started as CONNECTING
2023-03-14T08:39:56+01:00       DEBUG   cloudcmd/validators.go:170      Validating attestation document
2023-03-14T08:39:57+01:00       DEBUG   cloudcmd/validators.go:170      Successfully validated attestation document
2023-03-14T08:39:57+01:00       DEBUG   cmd/init.go:261 Connection state changed to CONNECTING
2023-03-14T08:39:57+01:00       DEBUG   cmd/init.go:264 Connection ready
// Init actually starts

Connection changes to CONNECTING twice, so we should definitely not cancel here. But aborting with unretriable error after "Connection ready" seems ok to me.

Copy link
Contributor Author

@Nirusu Nirusu Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I gave it a try with channels in the new commit. Check it out, this makes things definitely more complicated though 😄

Note that I couldn't really test this yet since in my current attempts (yanking the cable, temporarily disabling WiFi) the connection survived. Seems it a bit more resilient than I thought it was (or macOS keeps WiFi on for a bit after disabling it).

@Nirusu Nirusu force-pushed the feat/spinner-connecting branch 2 times, most recently from 1e4d7cf to 0f85bf5 Compare March 14, 2023 17:09
@katexochen katexochen removed their request for review March 14, 2023 17:21
Copy link
Member

@daniel-weisse daniel-weisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall implementation idea looks good to me, though I would like this to be extensively tested before merging.
Is it possible to write a unit test for this?

cli/internal/cmd/init.go Outdated Show resolved Hide resolved
@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 15, 2023

I think so far we don't have a test for initCall. I can try writing one that at least tests the goroutine logic here.

@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 15, 2023

Okay, now the unit test actually complain about data races with the use of the logger in the goroutine. Will investigate.

cli/internal/cmd/init.go Outdated Show resolved Hide resolved
@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 15, 2023

Added a test and removed that ugly conntectedTwice variable by just using context.CancelWithCause.

Now there's still two issues:

  1. The data race still exists with the logger (effectively only reachable in the test I guess though). I don't know how to fix this.
  2. The test is still slightly racey without a time.Sleep to let the context cancellation propagate after the sender goroutine fired off the event over the channel and checks the context status. This should be fixable somehow though.
  3. Channels are blocking by default so I guess in a very unfortunate event logGRPCStateChanges could try to send on a blocking channel if the events change rapidly. I don't think that's really likely to happen, though?

This cancelling on multiple events is way more painful than I thought it would be...

It might actually be easier to move this down to logGRPCStateChanges rather than using channels and dealing with this mess of synchronization. What do you think about this?

@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 16, 2023

Bye channels and bye data races, this should be easier if everything is just in the same function 😅

Comment on lines 263 to 267
func (d *initDoer) handleGRPCStateChanges(ctx context.Context, wg *sync.WaitGroup, conn *grpc.ClientConn) {
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this function start a new Go routine instead of letting the caller handle this? @malt3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost in line with how the rest of our code handles go routines. CC @katexochen

The general pattern we currently follow:

func foo(ctx context.Context) {
  var wg sync.WaitGroup
  defer wg.Wait()
  
  usesAGoRoutine(ctx, wg)
  // do something ....
}

func usesAGoRoutine(ctx context.Context, wg sync.WaitGroup) {
  wg.Add(1)
  go func() {
    defer wg.Done()
    // do something in the background
  }
}

@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 17, 2023

Actually thanks to the linter warning I see that this doesn't work as aspected given that the goroutine is shutdown when the state changes to READY. Will still need to fix this.

@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 17, 2023

After discussing with @malt3 it seems like I misunderstood the requirement. If I understood correctly, don't want to abort the same retry when the connection changes to READY twice but rather cancel any upcoming retry attempts when the first connection is dropped.

In this case I believe the new commit should achieve this in a way easier way.

cli/internal/cmd/init.go Outdated Show resolved Hide resolved
@Nirusu Nirusu changed the title cli: split spinner into "Connecting" and "Initializing cluster" state for init cli: split spinner into "Connecting" and "Initializing cluster" state for init (and prevent double initialization cases when an error is mistakenly retried) Mar 20, 2023
@Nirusu Nirusu added the bug fix Fixing a bug label Mar 20, 2023
@Nirusu Nirusu merged commit 822d782 into main Mar 20, 2023
@Nirusu Nirusu deleted the feat/spinner-connecting branch March 20, 2023 12:33
@derpsteb derpsteb added needs backport This PR needs to be backported to a previous release and removed feature This introduces new functionality labels Mar 24, 2023
@katexochen katexochen changed the title cli: split spinner into "Connecting" and "Initializing cluster" state for init (and prevent double initialization cases when an error is mistakenly retried) cli: prevent double initialization in cases where an error was mistakenly retried Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixing a bug needs backport This PR needs to be backported to a previous release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants