Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: spurious cancelation of async webhooks, better tracing #2969

Merged
merged 3 commits into from
Dec 20, 2022

Conversation

alnr
Copy link
Contributor

@alnr alnr commented Dec 19, 2022

Previously, async webhooks (response.ignore=true) would be canceled early once the incoming Kratos request was served and it's associated context released. We now dissociate the cancellation of async hooks from the normal request processing flow.

I've also improved the OpenTelemetry instrumentation.

@alnr alnr self-assigned this Dec 19, 2022
Copy link
Member

@aeneasr aeneasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test that shows this change to be effective? You can probably add a test which executes a webhook where the target takes ~5 seconds to respond or something similar. Even better, if test without your patch fails

errChan = make(chan error, 1)
)
if async {
ctx, cancel = context.WithTimeout(context.Background(), 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would fix the issue, the underlying HTTP client may be canceling the request, not the context? It has a timeout of I think 10 seconds.

Using a background context is also a possible security issue. Please make sure that we're not confusing the context and name this, for example, asyncContext, and only use it in the correct execution branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the context being passed to this function (execute) being canceled/released while the async webhook is still in flight.

How is a background context a security issue?

ctx, cancel = context.WithTimeout(context.Background(), 5*time.Minute)
spanOpts = append(spanOpts, trace.WithNewRoot())
}
ctx, span := tracer.Start(ctx, "Webhook", spanOpts...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here for example is the first instance where the incorrect context is used if async is true. The trace would be lost and without context!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is correct. For async webhooks, we start a new root span which is not associated with the incoming span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of doing something like this:

req := // ...
var cancel // ...
if (async) {
  makeRequestContext, cancel = context.WithTimeout(context.Background(), time.Minute())
  req = req.WithContext(ctx)
}

that way, we ensure that the context is never leaving the async execution path :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is correct. For async webhooks, we start a new root span which is not associated with the incoming span.

But then the span would not appear as part of the request, and it would not be possible to correlate it to an incoming HTTP request? Is that really intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the span would be orphaned in that sense. But still easy enough to find.

If you made the async webhook a child span of the incoming Kratos request span, the child will outlive the parent span. That will trip up many tools and generally violate OpenTelemetry semantics.

OpenTracing used to have a special trace-trace relationship (FollowsFrom) which was perfect for cases like this. AFAIK, that did not make it into the OpenTelemetry specification, though, and tooling support was sketchy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, that is a very good explanation - thank you!

So I did not include this in the merge, would you be open to make a follow-up PR with the span fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged patch is fine as-is. The OpenTelemetry people themselves are not sure how they want to models this: open-telemetry/opentelemetry-specification#65

I'd suggest looking at some traces in Jaeger and Tempo, and deciding whether we're OK with how those tools interpret the data in practice. If there's a major hiccup, we can then change the instrumentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, that sounds like a plan!

@alnr
Copy link
Contributor Author

alnr commented Dec 19, 2022

Can you please add a test that shows this change to be effective? You can probably add a test which executes a webhook where the target takes ~5 seconds to respond or something similar. Even better, if test without your patch fails

Will add a test.

@alnr
Copy link
Contributor Author

alnr commented Dec 19, 2022

Added failing test as first commit and my fix as second.

Previously, async webhooks (response.ignore=true) would be canceled
early once the incoming Kratos request was served and it's associated
context released. We now dissociate the cancellation of async hooks
from the normal request processing flow.
@codecov
Copy link

codecov bot commented Dec 19, 2022

Codecov Report

Merging #2969 (1da1cd1) into master (e11ba52) will increase coverage by 0.08%.
The diff coverage is 95.12%.

@@            Coverage Diff             @@
##           master    #2969      +/-   ##
==========================================
+ Coverage   76.10%   76.18%   +0.08%     
==========================================
  Files         313      309       -4     
  Lines       19310    19071     -239     
==========================================
- Hits        14695    14529     -166     
+ Misses       3487     3416      -71     
+ Partials     1128     1126       -2     
Impacted Files Coverage Δ
selfservice/hook/web_hook.go 81.56% <95.12%> (+1.66%) ⬆️
credentialmigrate/migrate.go 75.00% <0.00%> (-7.36%) ⬇️
identity/handler.go 87.95% <0.00%> (-2.33%) ⬇️
selfservice/strategy/webauthn/nodes.go 100.00% <0.00%> (ø)
x/keys.go
identity/credentials_webauthn.go
selfservice/strategy/oidc/provider_lark.go
identity/credentials_lookup.go
selfservice/strategy/oidc/provider_config.go 34.04% <0.00%> (+0.04%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

// at this point, a goroutine is in the middle of the call to our test handler and waiting for a response
incomingCancel() // simulate the incoming Kratos request having finished
timeout := time.After(200 * time.Millisecond)
for done := false; !done; {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handlerEntered should indicate that we already entered the HTTP webhook target. I think we don't need to also check the log entry

async = gjson.GetBytes(e.conf, "response.ignore").Bool()
parseResponse = gjson.GetBytes(e.conf, "can_interrupt").Bool()
tracer = trace.SpanFromContext(ctx).TracerProvider().Tracer("kratos-webhooks")
cancel context.CancelFunc = func() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancel function was not needed IMO. We only used it to cancel the time out, but the underlying HTTP client already has a timeout itself (it's 30 seconds)

// incoming request context is cancelled.
//
// The webhook will still cancel after 30 seconds as that is the configured timeout for the HTTP client.
req = req.WithContext(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overshadowing the context, we now use context.Background() only where it is needed - in the request context. An alternative would be to not include a context if the request is async:

 	if !ignoreResponse {
 	 	req = req.WithContext(ctx)
 	}

Comment on lines +332 to +335
logger := e.deps.Logger().WithField("otel", map[string]string{
"trace_id": traceID.String(),
"span_id": spanID.String(),
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always include trace info

timeout := time.After(200 * time.Millisecond)
for done := false; !done; {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was needed, as we already wait for handlerEntered

var found bool
for !found {
for _, entry := range logHook.AllEntries() {
if entry.Message == "Webhook request succeeded" {
Copy link
Member

@aeneasr aeneasr Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this independent of ordering and just look whether we find the succeeded log entry :)

@aeneasr aeneasr merged commit 72de640 into master Dec 20, 2022
@aeneasr aeneasr deleted the fix-async-webhooks branch December 20, 2022 08:21
peturgeorgievv pushed a commit to senteca/kratos-fork that referenced this pull request Jun 30, 2023
Previously, async webhooks (response.ignore=true) would be canceled
early once the incoming Kratos request was served and it's associated
context released. We now dissociate the cancellation of async hooks
from the normal request processing flow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants