Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry sampling conventions #793

Closed
wants to merge 32 commits into from
Closed
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
e767013
Add OpenTelemetry sampling conventions
jmacd Mar 5, 2024
0126c1d
chlog
jmacd Mar 5, 2024
8646a41
lint
jmacd Mar 5, 2024
157e07b
wip
jmacd Mar 6, 2024
f3c5da1
move into registry
jmacd Mar 6, 2024
1f1ca45
address intended user for each attribute
jmacd Mar 6, 2024
b5a65c4
address term implementations
jmacd Mar 6, 2024
a4c2068
give user perspective
jmacd Mar 6, 2024
b1574bd
clarify attributes can be used for spans
jmacd Mar 6, 2024
42e47f9
finish sentence
jmacd Mar 6, 2024
9badfa4
remove some bits
jmacd Mar 6, 2024
7e56498
apply suggestion
jmacd Mar 6, 2024
e466de8
yamllint
jmacd Mar 7, 2024
c50081e
Update .chloggen/793.yaml
jmacd Mar 7, 2024
73b0571
merge
jmacd Mar 7, 2024
4e8870d
add a tail-sampler example
jmacd Mar 7, 2024
4a1b1df
toc lint
jmacd Mar 7, 2024
21d9b99
Merge branch 'main' into jmacd/sampling_convs
jmacd Mar 7, 2024
e317e02
toc lint
jmacd Mar 7, 2024
e451ed2
Merge branch 'main' of github.com:open-telemetry/semantic-conventions…
jmacd Mar 11, 2024
e114c50
OTEP 235 ref
jmacd Mar 11, 2024
a28f48e
expand on sampling.priority
jmacd Mar 11, 2024
a55667f
expand on sampling.randomness
jmacd Mar 11, 2024
79275a3
expand on logs interpreting tracestate from span context
jmacd Mar 11, 2024
c3650d2
be more generic: sampler not tracer
jmacd Mar 11, 2024
9a7c46b
Apply suggestions from code review
jmacd Mar 11, 2024
c1d6c78
Merge branch 'main' into jmacd/sampling_convs
joaopgrassi Mar 13, 2024
9304f29
remove sampling priority
jmacd Mar 25, 2024
1cb4153
Merge branch 'jmacd/sampling_convs' of github.com:jmacd/semantic-conv…
jmacd Mar 25, 2024
8db652a
all the way removed
jmacd Mar 27, 2024
9ccd8d7
Update docs/sampling/README.md
jmacd May 31, 2024
157fdca
Update docs/sampling/README.md
jmacd May 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .chloggen/793.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Use this changelog template to create an entry for release notes.
#
# If your change doesn't affect end users you should instead start
# your pull request title with [chore] or use the "Skip Changelog" label.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: new_component

# The name of the area of concern in the attributes-registry, (e.g. http, cloud, db)
component: sampling

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Introduce attributes describing priority sampling, probability sampling.

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
# The values here must be integers.
issues: [793]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: |
Adds `sampling.priority` from OpenTracing.
Adds `sampling.randomness` and `sampling.threshold` from [OTEP 235][OTEP235].
[OTEP235]: https://github.com/open-telemetry/oteps/blob/main/text/trace/0235-sampling-threshold-in-trace-state.md
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Semantic Conventions are defined for the following areas:
* [Messaging](messaging/README.md): Semantic Conventions for messaging operations and systems.
* [Object Stores](object-stores/README.md): Semantic Conventions for object stores operations.
* [RPC](rpc/README.md): Semantic Conventions for RPC client and server operations.
* [Sampling](sampling/README.md): Sampling Semantic Conventions.
* [System](system/README.md): System Semantic Conventions.

Semantic Conventions by signals:
Expand Down
1 change: 1 addition & 0 deletions docs/attributes-registry/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Currently, the following namespaces exist:
* [OS](os.md)
* [Process](process.md)
* [RPC](rpc.md)
* [Sampling](sampling.md)
* [Server](server.md)
* [Source](source.md)
* [Thread](thread.md)
Expand Down
27 changes: 27 additions & 0 deletions docs/attributes-registry/sampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Sampling
--->

# Sampling

## Sampling attributes

The following attributes are recognized for telemetry in general.

<!-- semconv registry.sampling(omit_requirement_level) -->
| Attribute | Type | Description | Examples |
|---|---|---|---|
| `sampling.priority` | int | Allows users and instrumentations to prioritize collection of reported telemetry items. [1] | `10`; `1`; `0` |
jmacd marked this conversation as resolved.
Show resolved Hide resolved
| `sampling.randomness` | string | The source of randomness for making probability sampling decisions, when it is not otherwise recorded. [2] | `ce929d0e0e4736` |
| `sampling.threshold` | string | Sampling probability as specified by OpenTelemetry. [3] | `c`; `ff8` |

**[1]:** If greater than 0, a hint to the Tracer to do its best to capture the trace. If 0, a hint to the Tracer to not-capture the trace. If absent, the Tracer should use its default sampling mechanism.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

**[2]:** This attribute is an optional way to express trace randomness, especially for cases where the TraceID is missing or known to be not random. Sampler components set and consume this value. The value is a hex-coded string containing 14 hex digits (56 bits) of randomness. Setting this attribute indicates the source of randomness that was used (and may be used again) for probability sampling. This field is taken to have the same meaning as the OpenTelemetry tracestate "R-value" for probability sampling, which is an alternative to deriving trace randomness from the TraceID specified in OTEP 235.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

**[3]:** This attribute is set to convey sampling probability. Sampler components set and consume this value, which is taken to have the same meaning as the OpenTelemetry tracestate "T-value" for probability sampling. This attribute contains a hexadecimal-coded value containing 1 to 14 hex digits of precision, defining the threshold used to reject, depending on the random variable. This value can be converted into sampling probability as specified in OTEP 235.

The following attributes can be important for making sampling decisions and SHOULD be provided **at span creation time** (if provided at all):

* `sampling.priority`
<!-- endsemconv -->
282 changes: 282 additions & 0 deletions docs/sampling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: Sampling
--->

# Semantic Conventions for Sampling

**Status**: [Experimental][DocumentStatus]

<!-- toc -->

- [Probability sampling](#probability-sampling)
- [Overriding sampling decisions](#overriding-sampling-decisions)
- [Overriding sampling randomness](#overriding-sampling-randomness)
- [Sampling threshold](#sampling-threshold)
- [Sampling randomness](#sampling-randomness)
- [No definition for Scope and Resource attributes](#no-definition-for-scope-and-resource-attributes)
- [Span sampling attributes](#span-sampling-attributes)
- [Logs sampling attributes](#logs-sampling-attributes)
- [Examples](#examples)
- [Head sampling](#head-sampling)
- [Tail sampling](#tail-sampling)

<!-- tocstop -->

These attributes reflect the effect of sampling in a telemetry
collection pipeline. These attributes describe how items of telemetry
were collected, making it possible to define Span-to-Metrics pipelines
jmacd marked this conversation as resolved.
Show resolved Hide resolved
and Logs-to-Metrics pipelines, which accurately count Spans and Log
jmacd marked this conversation as resolved.
Show resolved Hide resolved
Records of telemetry, before sampling, in a probabilistic sense.

These attributes MAY be modified by components in a collection
pipeline to convey successive sampling that has been carried out for a
particular item of telemetry, using the conventions for consistent
sampling described here. In that sense, telemetry consumers
should see these attributes as telemetry metadata.

## Probability sampling

The OpenTelemetry sampling decision is defined in terms of a Threshold
value and a Randomness value, each containing 56 bits of information.

A constant known as _maximum adjusted count_ (`MaxAdjustedCount`),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be just me, but I think that max-something suggests inclusiveness, so this can be confusing. How about AdjustedCountLimit?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with MaxAdjustedCount. It is inclusive with respect to the adjusted count. However, I understand that it can be a little confusing as it is also used as an exclusive upper limit for the threshold and the random value.

with value `0x100000000000000`, (which can also be expressed
`0x1p+56`, `math.Pow(2, 56)`, or `math.Ldexp(1, 56)`), defines the
exclusive upper limit of these values.

Logically, both Threshold (`T`) and Randomness (`R`) are represented
as unsigned integers in the range `0` through `0xffffffffffffff` or
`MaxAdjustedCount - 1`. Items of telemetry are selected (i.e.,
"sampled") when their Threshold value is less than or equal to their
Randomness value, or `T <= R`.

Sampling probability is defined by the following expression:

```
Probability = (MaxAdjustedCount - Threshold) / MaxAdjustedCount
```

In a Span-to-Metrics or Logs-to-Metrics pipeline, each item of
telemetry is representative of an _adjusted count_ number of items in
the original population. Adjusted count is the inverse of sampling
probability, and `MaxAdjustedCount` (defined above) is the inverse of
the smallest supported sampling probability (which can also be
represented as `0x1p-56`, `math.Pow(2, -56)`, or `math.Ldexp(1,
-56)`).
jmacd marked this conversation as resolved.
Show resolved Hide resolved

For the tracing signal, Threshold and Randomness propagate via W3C
Trace Context `tracestate`. When they appear in the `tracestate`, the
Threshold and Randomness properties are called "T-value" and
"R-value"; they are represented in the OpenTelemetry section of the
`tracestate` (having vendor tag `ot`), using properties named `th` and
`rv`, respectively.

For the logs signal, which generally does not record the W3C Trace
Context `tracestate`, sampling attributes are meant to be expressed
using log record attributes with the same definition as T-value and
R-value.

For more information on how to perform and interpret probability
sampling based on these properties, [consult OTEP 235][OTEP235].

### Overriding sampling decisions

Instrumentation authors and end-users that wish to prioritize an item
of telemetry for collection in spite of sampling can add a
`sampling.priority` attribute. This attribute is a suggestion, a way
of recognizing the importance of a certain event and requesting
additional consideration from the collection pipeline. For example, a
user could write this code snippet:

```
if err := doSomething(); err != nil {
if err == VERY_SERIOUS {
span.SetAttribute("sampling.priority", 10000)
Copy link

@kalyanaj kalyanaj Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that even if an original sampling decision was to drop, you can override it by setting this attribute? If so, IMHO, the name "priority" doesn't directly convey the override aspect. Hence, depending on what use cases we want to allow (change decision from drop -> keep, or change decision from keep -> drop, or both), I wonder if this should be a simple bool attribute such as "sampling.keep" or "sampling.include" or "sampling.override".

In the future, if we expect pipelines to prioritize between multiple levels (e.g., if I could keep only X%, prefer the ones with the higher priority), then we could still add a new attribute to convey that relative priority.

To put it a different way, it looks like there are two abstractions we want to introduce:

  1. a notion of overriding the decision
  2. a notion of relative priorities.

Right now, it looks like we are using 2) to solve both of the above, however it is hiding 1). Hence, my question is could we do only 1) above and defer 2) (given that anyway the current spec advises against using those relative priorities & it can be added in the future if really needed).

Thoughts?

Copy link
Contributor

@lmolkova lmolkova Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no existing mechanism to keep a span that was already dropped, so I assume it can only increase or decrease the chances for span that was already sampled in to be exported, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kalyanaj I tried to address your concern in #793 (comment).

@lmolkova Yes. Another way of saying this (which OTEP 235 explains) is that Thresholds cannot decrease as a result of sampling, but they can increase. OTEP 235 writes

In other words, th MUST NOT be decreased (as it is not possible to retroactively adjust an earlier stage's sampling probability), and it MUST be increased if a lower sampling probability was used.

The trace SDK specification confines what we can do for SDKs, in the sense that there is not Sampler decision corresponding with not sampling, but recording the span and possibly changing the decision to export the unsampled span anyway. See e.g.,

open-telemetry/opentelemetry-specification#2918
open-telemetry/opentelemetry-specification#2986

}
return err
}
```

Samplers and sampling processors SHOULD pass items of telemetry to the
exporter, independent of their default sampling mechanism, when the
`sampling.priority` attribute is present and non-zero. Although the
jmacd marked this conversation as resolved.
Show resolved Hide resolved
priority is an integer, mathematical weight is not prescribed and no
other specific behaviors are required.

### Overriding sampling randomness

Sampling system designers are able to override sampling randomness on
an item-by-item basis, which may be done for several reasons, including
jmacd marked this conversation as resolved.
Show resolved Hide resolved
situations where there is no TraceID defined.

When a tracing system purposely uses TraceIDs that do not follow the
W3C Trace Context Level 2 specification for TraceID randomness, and
they wish to use OpenTelemetry sampling components, they can insert
explicit randomness to prevent erroneously taking randomness from the
TraceID.

Another use for overriding sampling randomness is to configure a
different unit of sampling consistency. For example, multiple traces
can be given the same randomness value to ensure that either all or
none of them are sampled consistently.

In the tracing signal, sampling randomness can be overridden by
setting an R-value in the tracestate. In the logging signal, sampling
randomness can be overridden by setting the `sampling.randomness`
attribute.

### Sampling threshold

When determining the Threshold value from an item of telemetry,
sampler implementations SHOULD:
jmacd marked this conversation as resolved.
Show resolved Hide resolved

- use the OpenTelemetry T-value field (`th`) in `tracestate` (spans only)
- use the `sampling.threshold` attribute value, if present in the record attributes (logs only)

In both cases, the Threshold value is represented by one to 14
hexadecimal digits, allowing the use of variable-precision sampling
probability. When fewer than 14 digits are input, the string is
padded with trailing zeros to make the correct number of bits (56).

The zero Threshold value (encoded by a single `0`) corresponds with
100% sampling.

When Threshold is not provided, no information about probability
sampling is available.

### Sampling randomness

When determining the Randomness value from an item of telemetry,
sampler implementations SHOULD:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sampler implementations SHOULD:
sampler implementations SHOULD evaluate the following in order:


- use the `tracestate` OpenTelemetry R-value field (`rv`) if it is present (spans only), or
jmacd marked this conversation as resolved.
Show resolved Hide resolved
- use the `sampling.randomness` attribute value if it is present (logs only), or
- use the least significant 56 bits of the W3C Trace Context TraceID, as described in the W3C Trace Context Level 2 specification.

In the first two cases, where Randomness is explicitly encoded, the
value is represented by exactly 14 hexadecimal digits.

Sampler implementations SHOULD NOT require trace flags to have the Trace
Context Level 2 Random flag set, in case the Trace ID is used as the
source of randomness. Because the Random flag is not widely available
at this time, and because the W3C Trace Context Level 2 specification
was designed for widespread compliance with existing systems, it is
recommended to assume there are 56 bits of randomness.

In case a system knowingly uses TraceIDs that do not conform to the
W3C Trace Context Level 2 specification and they wish to perform
sampling with OpenTelemetry components, they SHOULD synthesize a
random R-value and store it in the `tracestate` (Spans) or the
`sampling.randomness` (Log Records) attribute value.

### No definition for Scope and Resource attributes

We recognize that in some configurations, sampling probability and
even sampling randomness may be set to a constant value.

The `sampling.threshold` and `sampling.randomness` attributes are not
defined for use as Scope or Resource attributes in the present
specification, because it would lead to ambiguity when
`sampling.priority` is also used.

## Span sampling attributes

The following attributes are recognized for Spans. Note that the
equivalents for `sampling.threshold` and `sampling.randomness` are
stored in the `tracestate` for Spans.

<!-- semconv traces.sampling(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| [`sampling.priority`](../attributes-registry/sampling.md) | int | Allows users and instrumentations to prioritize collection of reported telemetry items. [1] | `10`; `1`; `0` | Opt-In |
jmacd marked this conversation as resolved.
Show resolved Hide resolved

**[1]:** If greater than 0, a hint to the Tracer to do its best to capture the trace. If 0, a hint to the Tracer to not-capture the trace. If absent, the Tracer should use its default sampling mechanism.
<!-- endsemconv -->

## Logs sampling attributes

The following attributes are recognized for Logs.

<!-- semconv logs.sampling(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| [`sampling.priority`](../attributes-registry/sampling.md) | int | Allows users and instrumentations to prioritize collection of reported telemetry items. [1] | `10`; `1`; `0` | Opt-In |
| [`sampling.randomness`](../attributes-registry/sampling.md) | string | The source of randomness for making probability sampling decisions, when it is not otherwise recorded. [2] | `ce929d0e0e4736` | Conditionally Required: [3] |
| [`sampling.threshold`](../attributes-registry/sampling.md) | string | Sampling probability as specified by OpenTelemetry. [4] | `c`; `ff8` | Conditionally Required: [5] |

**[1]:** If greater than 0, a hint to the Tracer to do its best to capture the trace. If 0, a hint to the Tracer to not-capture the trace. If absent, the Tracer should use its default sampling mechanism.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

**[2]:** This attribute is an optional way to express trace randomness, especially for cases where the TraceID is missing or known to be not random. Sampler components set and consume this value. The value is a hex-coded string containing 14 hex digits (56 bits) of randomness. Setting this attribute indicates the source of randomness that was used (and may be used again) for probability sampling. This field is taken to have the same meaning as the OpenTelemetry tracestate "R-value" for probability sampling, which is an alternative to deriving trace randomness from the TraceID specified in OTEP 235.

**[3]:** When a `sampling.threshold` is provided, the corresponding 56-bit randomness value is also recorded.

**[4]:** This attribute is set to convey sampling probability. Sampler components set and consume this value, which is taken to have the same meaning as the OpenTelemetry tracestate "T-value" for probability sampling. This attribute contains a hexadecimal-coded value containing 1 to 14 hex digits of precision, defining the threshold used to reject, depending on the random variable. This value can be converted into sampling probability as specified in OTEP 235.

**[5]:** When a 56-bit consistent probability sampler is used.
<!-- endsemconv -->

## Examples

### Head sampling

For example, a span that was selected by a 25% probability sampler
using randomness from the TraceID, has selected field values like:

```
trace_id: 4bf92f3577b34da6a3ce929d0e0e4736
tracestate: ot=tv:c
jmacd marked this conversation as resolved.
Show resolved Hide resolved
```

We can verify that the sampling decision was made correctly as follows.

The trailing 14 hex-digits of randomness are extracted from the
TraceID, forming the Randomness value `0xce929d0e0e4736`. The T-value
`c` is extended with 13 zeros, forming the Threshold value
`0xc0000000000000`. Since `T <= R` is true, the span was correctly sampled.

For a log record, which does not include the `tracestate` field, the
same can be expressed as:

```
trace_id: 4bf92f3577b34da6a3ce929d0e0e4736
attributes:
sampling.threshold: c
```

A log record that does not define the trace_id and was sampled by a
probability sampler requires explicit randomness. For example:

```
attributes:
sampling.threshold: c
sampling.randomness: ce929d0e0e4736
```

### Tail sampling

A span is received with no sampling information (i.e., no `tracestate`
field) is selected by a tail sampler at 10% probability. A
`tracestate` entry is created.

```
trace_id: 4bf92f3577b34da6a3fe929d0e0e4736
tracestate: ot=th:e66
```

A log record containing a TraceID is received with no sampling
attributes and is selected by a tail sampler at 10% probability. A
sampling threshold is inserted
jmacd marked this conversation as resolved.
Show resolved Hide resolved

```
trace_id: 4bf92f3577b34da6a3fe929d0e0e4736
attributes:
sampling.threshold: e66
```

In both cases, the Threshold value e66 corresponds with rejecting a
fraction equal to `0xe66 / 0x1000` or 0.10009765625. Had 5 digits of
precision been used (`e6666`), the exact sampling probability would be
0.10000038147.

[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/tree/v1.26.0/specification/document-status.md
[OTEP235]: https://github.com/open-telemetry/oteps/blob/main/text/trace/0235-sampling-threshold-in-trace-state.md
13 changes: 13 additions & 0 deletions model/logs/sampling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
groups:
- id: logs.sampling
type: attribute_group
brief: 'Semantic convention describing trace attributes related to sampling.'
attributes:
- ref: sampling.priority
jmacd marked this conversation as resolved.
Show resolved Hide resolved
requirement_level: opt_in
- ref: sampling.randomness
requirement_level:
conditionally_required: When a `sampling.threshold` is provided, the corresponding 56-bit randomness value is also recorded.
- ref: sampling.threshold
requirement_level:
conditionally_required: When a 56-bit consistent probability sampler is used.
Loading
Loading