Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add errors data to MethodMeta message #10

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shreys7
Copy link
Member

@shreys7 shreys7 commented Nov 2, 2023

JIRA: https://postmanlabs.atlassian.net/browse/POA-155
Confluence: https://postmanlabs.atlassian.net/l/cp/EtJHwV3u

Why

Right now if we fail to capture a valid response, the request becomes a “partial witness” and is not included in the backend timeline. There are several possible causes:

  • No response is sent before the connection is torn down (by a client timeout resetting the connection or a server timeout resetting the connection.)
  • A response takes too long and the agent times out and uploads the partial witness.
  • The response cannot be fully parsed because it is truncated (which again may be an explicit decision we made) contains invalid JSON, or has other errors that our parser gives up on.

This means that the error rates we are giving to the user have a bias: they only show HTTP requests that successfully got a 4xx or 5xx response as errors, not those that failed because of another problem, or because the agent failed to interpret the response.

What

This PR adds support for describing these kinds of errors in the Method message.
Updated the MethodMeta message to have a new repeated errors field. Each error message contains two fields

  • type: Determines the type of error, e.g. AGENT_PARSING_ERROR, CONNECTION_RESET, etc.
  • message: A helper message to describe the error type.

The errors field is kept as repeated since there can be more than one reason for a failure to generate the complete witness.

… timeouts, as well as any agent related errors we faced while creating a witness
@shreys7 shreys7 requested a review from mgritter November 2, 2023 19:46
Comment on lines 320 to 328
// Represents the case where the client timed out waiting for a response from the server.
CLIENT_TIMEOUT = 2;

// Represents the case where the server timed out before it could send the response back.
SERVER_TIMEOUT = 3;

// Represents when one of the parties decides to abruptly terminate the connection.
// A reset is usually indicative of a problem, similar to a timeout, but the underlying causes can be different.
CONNECTION_RESET = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these three can be combined into two, or split into four, but they're a bit ambiguous right now.

We don't really know that the cause of a early close is, it might be a timeout or it might be a crash. So I think we could do CLIENT_CLOSED and SERVER_CLOSED and that covers all the cases, without the need for CONNECTION_RESET.

At the TCP protocol level we could distinguish between a normal FIN shutdown and an actual RST, but RST should be rare and I don't think it's necessarily worth calling them out separately. (RST tends to indicate that the TCP state machines are not in sync, for example if a machine rebooted and came back.) But the alternative is CLIENT_FIN, SERVER_FIN, CLIENT_RST, SERVER_RST to cover the full spectrum of what we could observe.

}

errorType type = 1;
string message = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think a bit about whether we want to store this unstructured field.

Alternatives are:

  • Put it in the witness metadata (the JSON wrapper we use for upload) or client telemetry.
  • Use only structured data in witnesses.

The danger is that if we actually start relying on it then it becomes a fixed format we have to respect, and if we don't rely on it then what is the point of paying for it?

Do you have any thoughts here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean when you say witness metadata. Do you mean somewhere in the witness with info struct?

type witnessWithInfo struct {
	// The name of the interface on which this witness was captured.
	netInterface string

	srcIP           net.IP // The HTTP client's IP address.
	srcPort         uint16 // The HTTP client's port number.
	dstIP           net.IP // The HTTP server's IP address.
	dstPort         uint16 // The HTTP server's port number.
	observationTime time.Time
	id              akid.WitnessID
	requestEnd      time.Time
	responseStart   time.Time

	witness *pb.Witness
}

I was also skeptical about adding this field. I kept it to pair it up with the OTHER errorType case where we don't know what exactly went wrong, but we do have an error message that can be used to display somewhere (to us or to the user). I am open for the option of omitting it completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the structure I was thinking of. We do not have clear design guidelines about what is "info" and what belongs in the IR itself. I think I erred on my very first project by putting processing latency in the IR, when it would have been fine in witnessWithInfo instead.

I do think the error should be part of the witness, so that the witness clearly records its state, but I'm not 100% sure I'm not making the same error again. :)

We can leave the message field as-is for now, but leave a note suggesting additional fields if any structured data needs to be transmitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants