Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: annotate every cancel cause with message #5278

Closed
wants to merge 1 commit into from

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented Aug 30, 2024

We've been hitting some random ephemeral failed to compute cache key: context canceled errors for a while now, which have been very hard to track down.

Buildkit does use ContextCause everywhere, but only wraps the error using github.com/pkg/errors.WithStack so unless I'm missing something obvious the error messages don't actually contain any extra useful information.

This changes everywhere to use errors.Wrap, which includes the stack trace same as before but also includes a message so error messages here will have actual context on the cause.

I just went ahead and updated the entire codebase rather than those most likely to be a culprit for this bug since it seemed worth squashing this everywhere, but can be more selective if there's objections for any reason.

Also happy to be told I'm missing something obvious and there's a better way of going about this.

Otherwise error messages still just show up as "context canceled"
without any extra context on where the cancelation came from.

Signed-off-by: Erik Sipsma <[email protected]>
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't WithStack enough for understanding the code-path where the error appears?

If we are changing the text message of the error message then I think we should do it in smaller patches, showing how it updates the error shown to users in these cases. So that we see the before and after, and can confirm that the new error message is improvement.

@sipsma
Copy link
Collaborator Author

sipsma commented Sep 16, 2024

Why isn't WithStack enough for understanding the code-path where the error appears?

@tonistiigi Because WithStack doesn't actually modify the error messages returned to clients (nor should it since we don't want to return an entire stack traces back to clients in the error messages we send to them). When a user reports one of these context canceled errors is happening all we should need to know where the cancellation originated from is a single string in the message, not the entire stack.

If we are changing the text message of the error message then I think we should do it in smaller patches, showing how it updates the error shown to users in these cases. So that we see the before and after, and can confirm that the new error message is improvement.

Can you explain what you mean by "showing how it updates the error shown to users in these cases"? The error messages would be modified to include the text I annotated in the error message.

Part of the problem here is that we have no idea where the cancelation is coming from right now, so I can't really say what the error messages will look like until this change is made.

If there's concerns with this I can just put it in a fork instead for dagger instead, just trying to avoid this. I think you also mentioned that some upstream buildkit users were reporting context canceled errors at one point, so figured this would be helpful for that too.

@tonistiigi
Copy link
Member

Can you explain what you mean by "showing how it updates the error shown to users in these cases"? The error messages would be modified to include the text I annotated in the error message.

Yes, but it is hard for me to predict the full error message user sees without examples, and evaluate if that message would be clear to them. For some of them like the azblob pkg etc., I don't really care much, but the ones around solve for example seem much more common. Atm. they seem to be more like describing the code line. Eg. if user gets "solve status done after timeout" I'm not sure how helpful that is.

Part of the problem here is that we have no idea where the cancelation is coming from right now,

Why can't you show the stack trace with the line locations?

@jedevc
Copy link
Member

jedevc commented Sep 18, 2024

Yes, but it is hard for me to predict the full error message user sees without examples, and evaluate if that message would be clear to them.

Agreed, but surely any information at all is better than the current state of just "context cancelled" on the client? Currently to debug one of these, if a user gets context cancelled, you need to trawl through buildkit logs to work out what exactly has broken.

Aside from just making some of this easier to debug, it would also help improve triage (both for buildkit directly, and how we consume it in dagger), with issues where there are large numbers of users complaining about context cancelled. We currently have to ask users to upload logs so we can analyze them, which sometimes users don't do, or have privacy concerns, etc.

@tonistiigi
Copy link
Member

tonistiigi commented Sep 18, 2024

@jedevc For the user after they hit ctrl-c and build gracefully stops the meaningful error would be "your build was cancelled". While "context canceled" isn't the most amazing English, "solve status done after timeout" doesn't make it any better for the user in my opinion.

@colinhemmings offtopic, but we should probably replace the user message with a nice sentence in this case, at least when we can confirm that both "user invoked cancellation request" and server shut down with "context canceled".

@tonistiigi
Copy link
Member

Why can't you show the stack trace with the line locations?

Note that even if you don't want to show the full stack trace, you could parse out the frame that matches cancellation and just add "(filename:linenumber)" in the end of the error string.

@sipsma
Copy link
Collaborator Author

sipsma commented Sep 23, 2024

Why can't you show the stack trace with the line locations?

Note that even if you don't want to show the full stack trace, you could parse out the frame that matches cancellation and just add "(filename:linenumber)" in the end of the error string.

I had tried this a while back but failed, took a look again and realized it's something very dumb: if you get an error from github.com/pkg/errors and then wrap it using the standard library fmt.Errorf("%w") then you no longer can get the stack trace by printing the github.com/pkg/errors error with %+v (the only way I can see to extract the stack trace out?)

I had been %+v slightly too high in the stack where we had already been wrapping it with fmt.Errorf and thus lost the stack traces. Once I moved that logic to directly where we call buildkit implementations, I was able to extract the stack trace.

I sent out a dagger PR for this (dagger/dagger#8553). I'm not 100% sure if the stack trace parsing is really robust, but if there's more complication with all this we can just make a temporary fork of buildkit that has the change in this PR to ease debugging and run with that until we have figured out what's going wrong with those context cancelled errors.

If there's anything worth upstreaming let me know @tonistiigi but will close this for now.

@sipsma sipsma closed this Sep 23, 2024
@tonistiigi
Copy link
Member

I had tried this a while back but failed, took a look again and realized it's something very dumb: if you get an error from github.com/pkg/errors and then wrap it using the standard library fmt.Errorf("%w") then you no longer can get the stack trace by printing the github.com/pkg/errors error with %+v (the only way I can see to extract the stack trace out?)

If you use stack.Formatter() then I think this should fix it (if it doesn't feel free to send PR). https://pkg.go.dev/github.com/moby/[email protected]/util/stack#Formatter . stack.Traces() should also simplify this case I think.

If there's anything worth upstreaming let me know @tonistiigi but will close this for now.

Any cases you have for improving error messages shown to users for specific code paths, we are definitely interested in upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants