Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add outcome to transactions and spans #299

Merged
merged 9 commits into from
Aug 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions specs/agents/error-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,10 @@ The agent support reporting exceptions/errors. Errors may come in one of two for
Agents should include exception handling in the instrumentation they provide, such that exceptions are reported to the APM Server automatically, without intervention. In addition, hooks into logging libraries may be provided such that logged errors are also sent to the APM Server.

Errors may or may not occur within the context of a transaction or span. If they do, then they will be associated with them by recording the trace ID and transaction or span ID. This enables the APM UI to annotate traces with errors.

### Impact on the `outcome`

Tracking an error that's related to a transaction does not impact its `outcome`.
A transaction might have multiple errors associated to it but still return with a 2xx status code.
Hence, the status code is a more reliable signal for the outcome of the transaction.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved
This, in turn, means that the `outcome` is always specific to the protocol.
11 changes: 7 additions & 4 deletions specs/agents/tracing-instrumentation-http.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@ Agents should instrument HTTP request routers/handlers, starting a new transacti

- The transaction `type` should be `request`.
- The transaction `result` should be `HTTP Nxx`, where N is the first digit of the status code (e.g. `HTTP 4xx` for a 404)
- The transaction `outcome` should be `"success"` for HTTP status codes < 500 and `"failure"` for status codes >= 500. \
Status codes in the 4xx range (client errors) are not considered a `failure` as the failure has not been caused by the application itself but by the caller.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the default behavior but we should allow users to capture 4xx errors as errors as some users for e.g. may want to capture 401/403 as errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to just offer the API for now? I'd wait to add another config option until we actually get requests for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added APIs for setting the outcome in the Python implementation: elastic/apm-agent-python@ce13f92b63

So a user could do this somewhere in their code if they determine that the transaction should be considered failed:

import elasticapm
elasticapm.set_transaction_failure()

As there's no browser API to get the status code of a page load, the RUM agent always reports `"unknown"` for those transactions.
- The transaction `name` should be aggregatable, such as the route or handler name. Examples:

- `GET /users/{id}`
- `UsersController#index`

Expand Down Expand Up @@ -40,7 +42,8 @@ We capture spans for outbound HTTP requests. These should have a type of `extern

For outbound HTTP request spans we capture the following http-specific span context:

- `http.url` (the target URL)
- `http.status_code` (the response status code)
- `http.url` (the target URL) \
The captured URL should have the userinfo (username and password), if any, redacted.
- `http.status_code` (the response status code) \
The span's `outcome` should be set to `"success"` if the status code is lower than 400 and to `"failure"` otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above to allow flexibility to users to customize the outcomes based on what they see as errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems like for spans we are considering an erroneous outcome if < 400 and for transactions it is >500.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's >= 400 for spans (includes client errors) and >= 500 for transactions (does not include client errors)


The captured URL should have the userinfo (username and password), if any, redacted.
21 changes: 21 additions & 0 deletions specs/agents/tracing-spans.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,27 @@

The agent should also have a sense of the most common libraries for these and instrument them without any further setup from the app developers.

#### Span outcome

The `outcome` property denotes whether the span represents a success or a failure.
It supports the same values as `transaction.outcome`.
The only semantic difference is that client errors set the `outcome` to `"failure"`.
Agents should try to determine the outcome for spans created by auto instrumentation,
which is especially important for exit spans (spans representing requests to other services).

While the transaction outcome lets you reason about the error rate from the service's point of view,
other services might have a different perspective on that.
For example, if there's a network error so that service A can't call service B,
the error rate of service B is 100% from service A's perspective.
However, as service B doesn't receive any requests, the error rate is 0% from service B's perspective.
The `span.outcome` also allows reasoning about error rates of external services.

felixbarny marked this conversation as resolved.
Show resolved Hide resolved
#### Outcome API

Agents should expose an API to manually override the outcome.
This value must always take precedence over the automatically determined value.
The documentation should clarify that spans with `unknown` outcomes are ignored in the error rate calculation.

#### Span stack traces

Spans may have an associated stack trace, in order to locate the associated source code that caused the span to occur. If there are many spans being collected this can cause a significant amount of overhead in the application, due to the capture, rendering, and transmission of potentially large stack traces. It is possible to limit the recording of span stack traces to only spans that are slower than a specified duration, using the config variable `ELASTIC_APM_SPAN_FRAMES_MIN_DURATION`.
Expand Down
42 changes: 41 additions & 1 deletion specs/agents/tracing-transactions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,44 @@ Transactions are a special kind of span.
They represent the entry into a service.
They are sometimes also referred to as local roots or entry spans.

Transactions are created either by the built-in auto-instrumentation or an agent or the [tracer API](tracing-api.md).
Transactions are created either by the built-in auto-instrumentation or an agent or the [tracer API](tracing-api.md).

#### Transaction outcome

The `outcome` property denotes whether the transaction represents a success or a failure from the perspective of the entity that produced the event.
The APM Server converts this to the [`event.outcome`](https://www.elastic.co/guide/en/ecs/current/ecs-allowed-values-event-outcome.html) field.
This property is optional to preserve backwards compatibility.
If an agent doesn't report the `outcome` (or reports `null`), the APM Server sets the outcome to `"unknown"`.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

- `"failure"`: Indicates that this transaction describes a failed result. \
Note that client errors (such as HTTP 4xx) don't fall into this category as they are not an error from the perspective of the server.
- `"success"`: Indicates that this transaction describes a successful result.
- `"unknown"`: Indicates that there's no information about the outcome.
This is the default value that applies when an outcome has not been set explicitly.
This may be the case when a user tracks a custom transaction without explicitly setting an outcome.
For existing auto-instrumentations, agents should set the outcome either to `"failure"` or `"success"`.

What counts as a failed or successful request depends on the protocol and does not depend on whether there are error documents associated with a transaction.
felixbarny marked this conversation as resolved.
Show resolved Hide resolved

##### Error rate

The error rate of a transaction group is based on the `outcome` of its transactions.

error_rate = failure / (failure + success)

Note that when calculating the error rate,
transactions with an `unknown` or non-existent outcome are not considered.

The calculation just looks at the subset of transactions where the result is known and extrapolates the error rate for the total population.
This avoids that `unknown` or non-existant outcomes reduce the error rate,
which would happen when looking at a mix of old and new agents,
or when looking at RUM data (as page load transactions have an `unknown` outcome).

Also note that this only reflects the error rate as perceived from the application itself.
The error rate perceived from its clients is greater or equal to that.

##### Outcome API

Agents should expose an API to manually override the outcome.
This value must always take precedence over the automatically determined value.
The documentation should clarify that transactions with `unknown` outcomes are ignored in the error rate calculation.