Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in grpcsync.CallbackSerializer #6778

Closed

Conversation

s-matyukevich
Copy link
Contributor

Here is the race condition that we encountered on one of our servers.

goroutine 4809 [chan receive, 937 minutes]:
google.golang.org/grpc.(*ccResolverWrapper).UpdateState(0xc0005ef200, {{0xc001a11140, 0x3, 0x3}, {0xc0245f3380, 0x3, 0x3}, 0x0, 0x0})
	/go/pkg/mod/google.golang.org/[email protected]/resolver_conn_wrapper.go:178 +0x1ae
github.com/DataDog/dd-source/libs/go/grpc/resolvers/healthchecked.(*ddResolver).update(0xc000b89a20, {0x38db400, 0xc0003a93f8})
	/go/pkg/mod/github.com/!data!dog/[email protected]/libs/go/grpc/resolvers/healthchecked/base.go:443 +0x78f
github.com/DataDog/dd-source/libs/go/grpc/resolvers/healthchecked.(*ddResolver).updater(0xc000b89a20, 0x0?)
	/go/pkg/mod/github.com/!data!dog/[email protected]/libs/go/grpc/resolvers/healthchecked/base.go:402 +0x14a
created by github.com/DataDog/dd-source/libs/go/grpc/resolvers/healthchecked.(*builder).buildWithStopper in goroutine 1
	/go/pkg/mod/github.com/!data!dog/[email protected]/libs/go/grpc/resolvers/healthchecked/base.go:134 +0x2cb

goroutine 2156775 [chan receive, 937 minutes]:
google.golang.org/grpc.(*ccResolverWrapper).close(0xc0005ef200)
	/go/pkg/mod/google.golang.org/[email protected]/resolver_conn_wrapper.go:136 +0x13b
google.golang.org/grpc.(*ClientConn).enterIdleMode(0xc000b49400)
	/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:429 +0xf1
google.golang.org/grpc.(*idler).EnterIdleMode(0x1?)
	/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:324 +0x13
google.golang.org/grpc/internal/idle.(*manager).tryEnterIdleMode(0xc0000fbce0)
	/go/pkg/mod/google.golang.org/[email protected]/internal/idle/idle.go:206 +0x97
google.golang.org/grpc/internal/idle.(*manager).handleIdleTimeout(0xc0000fbce0)
	/go/pkg/mod/google.golang.org/[email protected]/internal/idle/idle.go:169 +0x54
created by time.goFunc
	/root/.gimme/versions/go1.21.3.linux.amd64/src/time/sleep.go:176 +0x2d

My understanding of this trace is the following:

  • calls to ccResolverWrapper.close and ccResolverWrapper.UpdateState happened simultaneously.
  • the following line was executed first. cs.callbacks list was empty and CallbackSerializer.run() exited without taking closedMu lock.
  • CallbackSerializer.Schedule() was called by UpdateState() It put new callback into cs.callbacks here It did this while holding closedMu lock, but we didn't check this lock before exiting from CallbackSerializer.run method.
  • As a result the new callback was never executed. I think backlog was meant to fix this, but it was never called.
  • ccResolverWrapper.UpdateState remained stuck on this line which make sense, because callback wasn't called and nobody wrote to the errCh

The fix is to always check backlog while holding closedMu lock before returning from CallbackSerializer.run()

Copy link

codecov bot commented Nov 9, 2023

Codecov Report

Merging #6778 (c882e6d) into master (be1d1c1) will increase coverage by 0.17%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6778      +/-   ##
==========================================
+ Coverage   83.34%   83.51%   +0.17%     
==========================================
  Files         285      285              
  Lines       30966    30965       -1     
==========================================
+ Hits        25809    25861      +52     
+ Misses       4076     4036      -40     
+ Partials     1081     1068      -13     
Files Coverage Δ
internal/grpcsync/callback_serializer.go 100.00% <100.00%> (+6.25%) ⬆️

... and 22 files with indirect coverage changes

@s-matyukevich
Copy link
Contributor Author

s-matyukevich commented Nov 9, 2023

Actually, now I am not sure how exactly this could happen:

  • the following line was executed first. cs.callbacks list was empty and CallbackSerializer.run() exited without taking closedMu lock.

My guess is that close was called twice at the same time.

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Please see the inline comment for a simplification.

@@ -85,11 +85,10 @@ func (cs *CallbackSerializer) run(ctx context.Context) {
// Do nothing here. Next iteration of the for loop will not happen,
// since ctx.Err() would be non-nil.
case callback, ok := <-cs.callbacks.Get():
if !ok {
return
if ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should need to read or check ok at all here. There's no way ok can ever be false, because the channel is only closed later. We'll unblock the select only when the context is canceled or a callback is created.

@dfawley
Copy link
Member

dfawley commented Nov 9, 2023

Actually, now I am not sure how exactly this could happen:

Ah, right. Because ok can never even be false, this return should never be able to happen.

@dfawley
Copy link
Member

dfawley commented Nov 9, 2023

My guess is that close was called twice at the same time.

There is no close. The way you stop the callback serializer is by canceling the context you passed to it. Multiple cancelations will do nothing.

@dfawley
Copy link
Member

dfawley commented Nov 9, 2023

One thing strange:

UpdateState is the only place where the serializer is used without holding ccr.mu. I don't understand why ccr.mu is needed in the other cases, though. ccr.serializer is read-only after initialization.

@dfawley
Copy link
Member

dfawley commented Nov 9, 2023

Aha, I found it. ccr.close is called with cc.mu held:

cc.resolverWrapper.close()

And ccr.UpdateState's callback calls updateResolverState which needs cc.mu:

cc.mu.Lock()

So if UpdateState (and all the things that schedule callbacks?) doesn't check whether the ccr is closing before scheduling callbacks, then we can deadlock.

@s-matyukevich
Copy link
Contributor Author

s-matyukevich commented Nov 10, 2023

So if UpdateState (and all the things that schedule callbacks?) doesn't check whether the ccr is closing before scheduling callbacks, then we can deadlock.

This make sense, however I am not sure what is the best way to fix that, which doesn't result in a lot of additional complexity. I guess it might be easier for you to close this PR and do the fix yourself?

@dfawley
Copy link
Member

dfawley commented Nov 10, 2023

This make sense, however I am not sure what is the best way to fix that, which doesn't result in a lot of additional complexity. I guess it might be easier for you to close this PR and do the fix yourself?

Sounds good. I'm not sure what the right fix is at this point, either, so it may take awhile unfortunately.

@dfawley
Copy link
Member

dfawley commented Nov 10, 2023

#6783 FYI

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants