-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the probabilistic sampler processor; add FailClosed configuration, prepare for OTEP 235 support #31946
Changes from 128 commits
e822a9b
1bc6017
d1fd891
85e4472
bb75f8a
6a57b77
7fa8130
36230e7
7bae35c
9010a67
0e27e40
efcdc3d
939c758
b9a1e56
a31266c
690cd64
c8baf29
7f47e4a
422e0b2
787b9fd
ed36f03
d795210
09000f7
a4d467b
e373b9b
fe6a085
396efb1
36de5dd
f1aa0ad
700734e
ae50bdd
a94b8e7
4edcbcb
53bf119
3cdb957
e506847
2cddfeb
f69d6ee
cc02934
1eecc4a
2159107
e0898a6
d0991ed
a002774
3a49922
fca0184
f11e0a4
3f495a6
581095c
7b8fd31
1a6be4f
712cf17
34c0d3b
7742668
8d60168
3779caa
c261ac1
34469e4
8dabf47
d44afb5
1cf9991
365d35d
12a3047
8655f42
468e6c6
23b4423
07841e5
6936bc4
c132f4c
bd13ac9
aa33b1c
06556dc
a4940e6
4f616e9
b4ca3aa
fdd26ac
b2d7504
fe20788
72f03e4
7fbe81a
252c568
f73994a
dddf03e
35f45d1
4282212
6162804
a0567cc
c51c90e
41ffde0
560f1a5
e5e29bc
7e48d0f
e33b28d
cdb359d
92561e6
83245cc
9fc8c3e
ea0773f
ab93f2f
7784b71
328cd9a
a95c83b
73588c1
434a560
8c60fca
08bf3f5
5e0ba11
9c1d6e2
f5a57a1
8fbf133
d98ff2f
af8f19c
1475569
95f5c13
b0fd487
b3f84ac
594852f
7abdf2d
2012193
d3abdd7
e331268
1ff0053
b937dbe
00ab52f
d2cde83
487aad8
1049de3
21ef16d
fecd75e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Use this changelog template to create an entry for release notes. | ||
|
||
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix' | ||
change_type: enhancement | ||
|
||
# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver) | ||
component: probabilisticsamplerprocessor | ||
|
||
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). | ||
note: Adds the `FailClosed` flag to solidify current behavior when randomness source is missing. | ||
|
||
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists. | ||
issues: [31918] | ||
|
||
# (Optional) One or more lines of additional information to render under the primary note. | ||
# These lines will be padded with 2 spaces and then inserted directly into the document. | ||
# Use pipe (|) for multiline entries. | ||
subtext: | ||
|
||
# If your change doesn't affect end users or the exported elements of any package, | ||
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label. | ||
# Optional: The change log or logs in which this entry should be included. | ||
# e.g. '[user]' or '[user, api]' | ||
# Include 'user' if the change is relevant to end users. | ||
# Include 'api' if there is a change to a library API. | ||
# Default: '[user]' | ||
change_logs: [user] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,51 +15,159 @@ | |
[contrib]: https://github.com/open-telemetry/opentelemetry-collector-releases/tree/main/distributions/otelcol-contrib | ||
<!-- end autogenerated section --> | ||
|
||
The probabilistic sampler supports two types of sampling for traces: | ||
|
||
1. `sampling.priority` [semantic | ||
convention](https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table) | ||
as defined by OpenTracing | ||
1. Trace ID hashing | ||
|
||
The `sampling.priority` semantic convention takes priority over trace ID hashing. As the name | ||
implies, trace ID hashing samples based on hash values determined by trace IDs. See [Hashing](#hashing) for more information. | ||
The probabilistic sampler processor supports several modes of sampling | ||
for spans and log records. Sampling is performed on a per-request | ||
basis, considering individual items statelessly. For whole trace | ||
sampling, see | ||
[tailsamplingprocessor](../tailsamplingprocessor/README.md). | ||
|
||
For trace spans, this sampler supports probabilistic sampling based on | ||
a configured sampling percentage applied to the TraceID. In addition, | ||
the sampler recognizes a `sampling.priority` annotation, which can | ||
force the sampler to apply 0% or 100% sampling. | ||
|
||
For log records, this sampler can be configured to use the embedded | ||
TraceID and follow the same logic as applied to spans. When the | ||
TraceID is not defined, the sampler can be configured to apply hashing | ||
to a selected log record attribute. This sampler also supports | ||
sampling priority. | ||
|
||
## Consistency guarantee | ||
|
||
A consistent probability sampler is a Sampler that supports | ||
independent sampling decisions for each span or log record in a group | ||
(e.g. by TraceID), while maximizing the potential for completeness as | ||
follows. | ||
|
||
Consistent probability sampling requires that for any span in a given | ||
trace, if a Sampler with lesser sampling probability selects the span | ||
for sampling, then the span would also be selected by a Sampler | ||
configured with greater sampling probability. | ||
|
||
## Completeness property | ||
|
||
A trace is complete when all of its members are sampled. A | ||
"sub-trace" is complete when all of its descendents are sampled. | ||
|
||
Ordinarily, Trace and Logging SDKs configure parent-based samplers | ||
which decide to sample based on the Context, because it leads to | ||
completeness. | ||
|
||
When non-root spans or logs make independent sampling decisions | ||
instead of using the parent-based approach (e.g., using the | ||
`TraceIDRatioBased` sampler for a non-root span), incompleteness may | ||
result, and when spans and log records are independently sampled in a | ||
processor, as by this component, the same potential for completeness | ||
arises. The consistency guarantee helps minimimize this issue. | ||
|
||
Consistent probability samplers can be safely used with a mixture of | ||
probabilities and preserve sub-trace completeness, provided that child | ||
spans and log records are sampled with probability greater than or | ||
equal to the parent context. | ||
|
||
Using 1%, 10% and 50% probabilities for example, in a consistent | ||
probability scheme the 50% sampler must sample when the 10% sampler | ||
does, and the 10% sampler must sample when the 1% sampler does. A | ||
three-tier system could be configured with 1% sampling in the first | ||
tier, 10% sampling in the second tier, and 50% sampling in the bottom | ||
tier. In this configuration, 1% of traces will be complete, 10% of | ||
traces will be sub-trace complete at the second tier, and 50% of | ||
traces will be sub-trace complete at the third tier thanks to the | ||
consistency property. | ||
|
||
These guidelines should be considered when deploying multiple | ||
collectors with different sampling probabilities in a system. For | ||
example, a collector serving frontend servers can be configured with | ||
smaller sampling probability than a collector serving backend servers, | ||
without breaking sub-trace completeness. | ||
|
||
## Sampling randomness | ||
|
||
To achieve consistency, sampling randomness is taken from a | ||
deterministic aspect of the input data. For traces pipelines, the | ||
source of randomness is always the TraceID. For logs pipelines, the | ||
source of randomness can be the TraceID or another log record | ||
attribute, if configured. | ||
|
||
For log records, the `attribute_source` and `from_attribute` fields determine the | ||
source of randomness used for log records. When `attribute_source` is | ||
set to `traceID`, the TraceID will be used. When `attribute_source` | ||
is set to `record` or the TraceID field is absent, the value of | ||
`from_attribute` is taken as the source of randomness (if configured). | ||
|
||
## Sampling priority | ||
|
||
The sampling priority mechanism is an override, which takes precedence | ||
over the probabilistic decision in all modes. | ||
|
||
🛑 Compatibility note: Logs and Traces have different behavior. | ||
|
||
In traces pipelines, when the priority attribute has value 0, the | ||
configured probability will by modified to 0% and the item will not | ||
pass the sampler. When the priority attribute is non-zero the | ||
configured probability will be set to 100%. The sampling priority | ||
attribute is not configurable, and is called `sampling.priority`. | ||
|
||
In logs pipelines, when the priority attribute has value 0, the | ||
configured probability will by modified to 0%, and the item will not | ||
pass the sampler. Otherwise, the logs sampling priority attribute is | ||
interpreted as a percentage, with values >= 100 equal to 100% | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason for this mismatch? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was very surprised by this. I have preserved the inconsistency and I am not sure what we should do about it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could unify the solutions by varying behavior according to the type of the attribute. If numeric, it's the priority, and if a string, it's the name of the numeric attribute containing the priority. Personally, I'd prefer to choose one for the long term, something like:
It would be just as valid to do it the other way, if we preferred less configuration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this might be the perfect opportunity for this change, but not necessarily part of this PR. |
||
sampling. The logs sampling priority attribute is configured via | ||
`sampling_priority`. | ||
|
||
## Sampling algorithm | ||
|
||
### Hash seed | ||
|
||
The hash seed method uses the FNV hash function applied to either a | ||
Trace ID (spans, log records), or to the value of a specified | ||
attribute (only logs). The hashed value, presumed to be random, is | ||
compared against a threshold value that corresponds with the sampling | ||
percentage. | ||
|
||
This mode requires configuring the `hash_seed` field. This mode is | ||
enabled when the `hash_seed` field is not zero, or when log records | ||
are sampled with `attribute_source` is set to `record`. | ||
|
||
In order for hashing to be consistent, all collectors for a given tier | ||
(e.g. behind the same load balancer) must have the same | ||
`hash_seed`. It is also possible to leverage a different `hash_seed` | ||
at different collector tiers to support additional sampling | ||
requirements. | ||
|
||
This mode uses 14 bits of sampling precision. | ||
|
||
### Error handling | ||
|
||
This processor considers it an error when the arriving data has no | ||
randomess. This includes conditions where the TraceID field is | ||
invalid (16 zero bytes) and where the log record attribute source has | ||
zero bytes of information. | ||
|
||
By default, when there are errors determining sampling-related | ||
information from an item of telemetry, the data will be refused. This | ||
behavior can be changed by setting the `fail_closed` property to | ||
false, in which case erroneous data will pass through the processor. | ||
jmacd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Configuration | ||
|
||
The following configuration options can be modified: | ||
- `hash_seed` (no default): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed. | ||
- `sampling_percentage` (default = 0): Percentage at which traces are sampled; >= 100 samples all traces | ||
|
||
Examples: | ||
- `sampling_percentage` (32-bit floating point, required): Percentage at which items are sampled; >= 100 samples all items, 0 rejects all items. | ||
- `hash_seed` (32-bit unsigned integer, optional, default = 0): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed. | ||
- `fail_closed` (boolean, optional, default = true): Whether to reject items with sampling-related errors. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this a change in behavior from the current implementation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current implementation has an odd inconsistency which this is meant to resolve in a user-configurable way. In cases where there is no randomness, the sampling decision is fixed and therefore depends on the configured probability. For some probabilities, the failure would be open, and for other probabilities, the failure would be closed--depending on the hash seed. I added this test to demonstrate:
So for empty TraceIDs (16 bytes of 0s) are sampled at 56.8% and above, and logs with missing attribute values (0 bytes) are sampled at 82.9% and above. Since I expect most users are configuring probabilities at 50% or below, most users would never see these items of telemetry and never knew there was a problem. This is how I justify the decision to use FailClosed=true by default, because users would have had to be sampling above 56% to see these records before, with the default seed. I am open to both default values, but slightly prefer FailClosed=true. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good! |
||
|
||
```yaml | ||
processors: | ||
probabilistic_sampler: | ||
hash_seed: 22 | ||
sampling_percentage: 15.3 | ||
``` | ||
### Logs-specific configuration | ||
|
||
The probabilistic sampler supports sampling logs according to their trace ID, or by a specific log record attribute. | ||
|
||
The probabilistic sampler optionally may use a `hash_seed` to compute the hash of a log record. | ||
This sampler samples based on hash values determined by log records. See [Hashing](#hashing) for more information. | ||
|
||
The following configuration options can be modified: | ||
- `hash_seed` (no default, optional): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed. | ||
- `sampling_percentage` (required): Percentage at which logs are sampled; >= 100 samples all logs, 0 rejects all logs. | ||
- `attribute_source` (default = traceID, optional): defines where to look for the attribute in from_attribute. The allowed values are `traceID` or `record`. | ||
- `from_attribute` (default = null, optional): The optional name of a log record attribute used for sampling purposes, such as a unique log record ID. The value of the attribute is only used if the trace ID is absent or if `attribute_source` is set to `record`. | ||
- `sampling_priority` (default = null, optional): The optional name of a log record attribute used to set a different sampling priority from the `sampling_percentage` setting. 0 means to never sample the log record, and >= 100 means to always sample the log record. | ||
|
||
## Hashing | ||
|
||
In order for hashing to work, all collectors for a given tier (e.g. behind the same load balancer) | ||
must have the same `hash_seed`. It is also possible to leverage a different `hash_seed` at | ||
different collector tiers to support additional sampling requirements. Please refer to | ||
[config.go](./config.go) for the config spec. | ||
- `attribute_source` (string, optional, default = "traceID"): defines where to look for the attribute in from_attribute. The allowed values are `traceID` or `record`. | ||
- `from_attribute` (string, optional, default = ""): The name of a log record attribute used for sampling purposes, such as a unique log record ID. The value of the attribute is only used if the trace ID is absent or if `attribute_source` is set to `record`. | ||
- `sampling_priority` (string, optional, default = ""): The name of a log record attribute used to set a different sampling priority from the `sampling_percentage` setting. 0 means to never sample the log record, and >= 100 means to always sample the log record. | ||
|
||
Examples: | ||
|
||
Sample 15% of the logs: | ||
Sample 15% of log records according to trace ID using the OpenTelemetry | ||
specification. | ||
|
||
```yaml | ||
processors: | ||
probabilistic_sampler: | ||
|
@@ -76,7 +184,8 @@ processors: | |
from_attribute: logID # value is required if the source is not traceID | ||
``` | ||
|
||
Sample logs according to the attribute `priority`: | ||
Give sampling priority to log records according to the attribute named | ||
`priority`: | ||
|
||
```yaml | ||
processors: | ||
|
@@ -85,6 +194,7 @@ processors: | |
sampling_priority: priority | ||
``` | ||
|
||
## Detailed examples | ||
|
||
Refer to [config.yaml](./testdata/config.yaml) for detailed | ||
examples on using the processor. | ||
Refer to [config.yaml](./testdata/config.yaml) for detailed examples | ||
on using the processor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also deserve some explanation: what's a sampler here? It wasn't defined yet. Is this about chaining collectors together, where the first samples 100%, and the second 90%, and the assumption is that all those 80% would be within the 90%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See revised text. 594852f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright -- I think we might want to make it easier to digest in the future, but let's wait for users to provide feedback.