Add errors data to MethodMeta message #10

shreys7 · 2023-11-02T19:46:50Z

JIRA: https://postmanlabs.atlassian.net/browse/POA-155
Confluence: https://postmanlabs.atlassian.net/l/cp/EtJHwV3u

Why

Right now if we fail to capture a valid response, the request becomes a “partial witness” and is not included in the backend timeline. There are several possible causes:

No response is sent before the connection is torn down (by a client timeout resetting the connection or a server timeout resetting the connection.)
A response takes too long and the agent times out and uploads the partial witness.
The response cannot be fully parsed because it is truncated (which again may be an explicit decision we made) contains invalid JSON, or has other errors that our parser gives up on.

This means that the error rates we are giving to the user have a bias: they only show HTTP requests that successfully got a 4xx or 5xx response as errors, not those that failed because of another problem, or because the agent failed to interpret the response.

What

This PR adds support for describing these kinds of errors in the Method message.
Updated the MethodMeta message to have a new repeated errors field. Each error message contains two fields

type: Determines the type of error, e.g. AGENT_PARSING_ERROR, CONNECTION_RESET, etc.
message: A helper message to describe the error type.

The errors field is kept as repeated since there can be more than one reason for a failure to generate the complete witness.

… timeouts, as well as any agent related errors we faced while creating a witness

mgritter · 2023-11-02T19:59:17Z

proto/api_spec/method.proto

+    // Represents the case where the client timed out waiting for a response from the server.
+    CLIENT_TIMEOUT = 2;
+
+    // Represents the case where the server timed out before it could send the response back.
+    SERVER_TIMEOUT = 3;
+
+    // Represents when one of the parties decides to abruptly terminate the connection.
+    // A reset is usually indicative of a problem, similar to a timeout, but the underlying causes can be different.
+    CONNECTION_RESET = 4;


I think these three can be combined into two, or split into four, but they're a bit ambiguous right now.

We don't really know that the cause of a early close is, it might be a timeout or it might be a crash. So I think we could do CLIENT_CLOSED and SERVER_CLOSED and that covers all the cases, without the need for CONNECTION_RESET.

At the TCP protocol level we could distinguish between a normal FIN shutdown and an actual RST, but RST should be rare and I don't think it's necessarily worth calling them out separately. (RST tends to indicate that the TCP state machines are not in sync, for example if a machine rebooted and came back.) But the alternative is CLIENT_FIN, SERVER_FIN, CLIENT_RST, SERVER_RST to cover the full spectrum of what we could observe.

mgritter · 2023-11-02T20:01:54Z

proto/api_spec/method.proto

+  }
+
+  errorType type = 1;
+  string message = 2;


I need to think a bit about whether we want to store this unstructured field.

Alternatives are:

Put it in the witness metadata (the JSON wrapper we use for upload) or client telemetry.

Use only structured data in witnesses.

The danger is that if we actually start relying on it then it becomes a fixed format we have to respect, and if we don't rely on it then what is the point of paying for it?

Do you have any thoughts here?

I am not sure what you mean when you say witness metadata. Do you mean somewhere in the witness with info struct?

type witnessWithInfo struct { // The name of the interface on which this witness was captured. netInterface string srcIP net.IP // The HTTP client's IP address. srcPort uint16 // The HTTP client's port number. dstIP net.IP // The HTTP server's IP address. dstPort uint16 // The HTTP server's port number. observationTime time.Time id akid.WitnessID requestEnd time.Time responseStart time.Time witness *pb.Witness }

I was also skeptical about adding this field. I kept it to pair it up with the OTHER errorType case where we don't know what exactly went wrong, but we do have an error message that can be used to display somewhere (to us or to the user). I am open for the option of omitting it completely.

Yes, that's the structure I was thinking of. We do not have clear design guidelines about what is "info" and what belongs in the IR itself. I think I erred on my very first project by putting processing latency in the IR, when it would have been fine in witnessWithInfo instead.

I do think the error should be part of the witness, so that the witness clearly records its state, but I'm not 100% sure I'm not making the same error again. :)

We can leave the message field as-is for now, but leave a note suggesting additional fields if any structured data needs to be transmitted.

…rver closed types

Add errors field to MethodMeta that indicates client timeouts, server…

8b96486

… timeouts, as well as any agent related errors we faced while creating a witness

shreys7 requested a review from mgritter November 2, 2023 19:46

mgritter reviewed Nov 2, 2023

View reviewed changes

shreys7 mentioned this pull request Nov 2, 2023

[POA-155] Changes for capturing client and server timeouts postmanlabs/observability-shared-libs#201

Open

Remove the connection reset error type and club it with client and se…

e222150

…rver closed types

mgritter approved these changes Nov 3, 2023

View reviewed changes

shreys7 mentioned this pull request Nov 10, 2023

[POA-155] Capture client/server timeouts as well as Agent parsing errors postmanlabs/observability-cli#245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add errors data to MethodMeta message #10

Add errors data to MethodMeta message #10

shreys7 commented Nov 2, 2023

mgritter Nov 2, 2023

mgritter Nov 2, 2023

shreys7 Nov 3, 2023

mgritter Nov 3, 2023

Add errors data to MethodMeta message #10

Are you sure you want to change the base?

Add errors data to MethodMeta message #10

Conversation

shreys7 commented Nov 2, 2023

Why

What

mgritter Nov 2, 2023

Choose a reason for hiding this comment

mgritter Nov 2, 2023

Choose a reason for hiding this comment

shreys7 Nov 3, 2023

Choose a reason for hiding this comment

mgritter Nov 3, 2023

Choose a reason for hiding this comment