Adaptive Sampling #365

yurishkuro · 2017-09-01T19:09:46Z

Problem

The most common way of using Jaeger client libraries is with probabilistic sampling which makes a determination if a new trace should be sampled or not. Sampling is necessary to control the amount of tracing data reaching the storage backend. There are two issues with the current approach:

The individual microservices have little insight into what the appropriate sampling rate should be. For example, 0.001 probability (one trace per second per service instance) might seem reasonable, but if the fanout in some downstream services is very high it might flood the tracing backend.
Sampling rates are defined on a per-service basis. If a service has two endpoints with vastly different throughputs, then its sampling rate will be driven on the high QPS endpoint, which may leave the low QPS endpoint never sampled. For example, if the QPS of the endpoints is different by a factor of 100, and the probability is set to 0.001, then the low QPS traffic will have only 1 in 100,000 chance to be sampled.

Proposed Solution

The adaptive sampling is a solution that addresses these issues by:

Assigning sampling probabilities on a service + endpoint basis rather than just the service
Using a lower bound rate limiter to ensure that all endpoints are sampled with a certain minimal rate
Observing the impact of sampling rates on the overall number of traces sampled from a service and dynamically adjusting the per-endpoint sampling rates to meet certain target rates.

Status

Pending open-source of the backend functionality. Client work is done.

sampling data store (Cassandra only) Start moving components of adaptive sampling to OSS #973
adaptive sampling calculator Move adaptive sampling processor #1179
wire up the components from collector/main

robdefeo · 2018-04-19T20:42:01Z

Any idea when the backend functionality will be opensourced?

billowqiu · 2018-05-15T03:33:41Z

"Adaptive Sampling" is now ok in backend?

yurishkuro · 2018-05-15T03:36:54Z

it's coming soon, @black-adder just finished rolling it out internally to all services, so it just needs a bit clean-up (from any internal deps) to move to open source.

billowqiu · 2018-05-15T03:41:19Z

thks @yurishkuro , i am investigating the jaeger and zipkin。

trtg · 2018-06-19T01:37:51Z

@yurishkuro any progress on this being released?

yurishkuro · 2018-06-19T04:28:22Z

question to @black-adder , "he is a-cooking something up"

sergeyklay · 2018-08-07T14:47:30Z

@black-adder any news?

black-adder · 2018-08-07T14:53:15Z

sorry all, I just started to move the pieces over, hopefully we'll have the whole thing in OSS this week.

agxp · 2018-10-11T16:46:36Z

What's the status on this? I would like to configure Jaeger to sample all traces on low load, and on high load sample at a certain probability. It doesn't seem possible currently. Thanks.
@black-adder

pravarag · 2018-12-19T11:59:43Z

Hi, is adaptive sampling still under way? I'm really eager to try it out for my sample app.

yurishkuro · 2018-12-20T21:43:28Z

The PRs are in progress/review. Unfortunately, a higher priority project has delayed this.

Glowdable · 2019-01-17T03:58:17Z

It's very important

capescuba · 2019-02-06T23:14:47Z

Another check in on progress for this. We are considering an implementation of this and this would be a great feature add.

csurfleet · 2019-03-29T15:37:27Z

Any further news on this?

m1o1 · 2019-04-01T16:30:51Z

This feature would help us as well - it seems that under high load, some spans are being dropped / never received by ES (since we are trying to sample all traces currently). We are hoping to sample 100% of "unique" traces (similar to what differentiates traces in "Compare" in the UI) in the last X amount of time. Heard about this idea from an OpenCensus talk, which sounded like they're working on a similar feature in their agent service.

wuyupengwoaini · 2019-04-18T07:58:49Z

Any further news on this? I am looking forward to this feature

yurishkuro · 2019-04-21T23:27:30Z

the main code has been merged, pending wiring into collector's main

trtg · 2019-04-22T02:08:01Z

@capescuba @wuyupengwoaini @adinunzio84 @csurfleet not sure if this helps in your current context, but the way we've been implementing an approximation of this feature for our usecase (moderately high throughput- 10s of thousands of requests per second) is we set keys in redis that control sampling through the use of the sampling priority debug header. In other words we set the default probabilistic sampling rate to zero and the code checks a set of redis keys to know whether it should sample. So for example in the context of my particular scenario we get requests from many different applications and devices, and in redis we set keys denoting which apps or devices we want to trace and what percentage of requests for those apps we want to trace. So if we have a specific issue to debug we set redis keys to trace 100% of requests from the problematic app and some low percentage of requests from apps that we are passively monitoring
. For example in Python we dynamically force tracing based on redis keys by setting this tag:

span.set_tag(ext_tags.SAMPLING_PRIORITY, 1)

wuyupengwoaini · 2019-04-24T09:49:15Z

@trtg Thanks for your advice. In fact, I think the sampling rate strategy can be divided into three steps: 1. The sampling rate is configurable. That is to say there is a system to configure the sampling rate of each interface in each service. 2. The configuration can be dynamically activated (can be implemented by means of a configuration center or the like) 3. The system automatically configures the sampling rate dynamically according to the pressure of the jaeger backend.

Dynamically configuring the sample rate for your use of redis is actually the second step I mentioned above. This can modify the sampling rate in real time, but it is enough.

csurfleet · 2019-05-08T10:45:13Z

Hey all, I've created a nuget package that allows per-request sampling on anything in the incoming HttpRequest, feel free to have a play. Any suggestions etc either let me know or send me a PR ;) I'm using this stuff in production for one app so far, and I'll try to add further stuff later on:
https://github.com/csurfleet/JaegerSamplingFilters
https://www.nuget.org/packages/JaegerSamplingFilters/

DannyNoam · 2019-05-16T09:33:19Z

Hi @yurishkuro! Just reading about the different sampling strategies in Jaeger, and it's slightly unclear to me whether Adaptive Sampling would reference a central configuration (which would be somewhat less verbose than in the 'sampling strategies' file), or whether we'd specify the sampling rate in the service (and that the adaptive sampler would ensure lower QPS endpoints have their fair share of traces). Cheers!

yurishkuro · 2019-06-02T21:32:31Z

@DannyNoam This ticket refers to dynamic adaptive sampling where strategies are automatically calculated based on observed traffic volumes. In jaeger-collector the only configuration would be "target rate of sampled traces / second" (in the current code it's a single global setting, but we're looking to extend it to be configurable per service/endpoint).

rscott231 · 2019-06-14T15:13:51Z

the main code has been merged, pending wiring into collector's main

@yurishkuro any timeline on when this will be wired into collector's main?

lookfwd · 2019-06-26T14:58:36Z

Sampling rates are defined on a per-service basis. If a service has two endpoints with vastly different throughputs

I was thinking that with a small generalization, one could extend the adapting sampling idea to something between head-based and tail-based sampling that also detects outliers and has other benefits of tail-based sampling while requiring fewer computational resources.

As part of the adapting sampling, sampling information is communicated from the collector to the agent.

The format of the communicated data has, I guess, format similar to the Collector Sampling Configuration. In the documentation there's this example:

{
  "service_strategies": [
    {
      "service": "foo",
      "type": "probabilistic",
      "param": 0.8,
      "operation_strategies": [
        {
          "operation": "op1",
          "type": "probabilistic",
          "param": 0.2
        },
        {
          "operation": "op2",
          "type": "probabilistic",
          "param": 0.4
        }
      ]
    },
    {
      "service": "bar",
      "type": "ratelimiting",
      "param": 5
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.5
  }
}

We can change it slightly to this:

{
  "sampling-rules": [
    {"name": "foo_op_1",
     "condition": {"and" : [
       {"==": [{"var" : "span.process.serviceName" }, "foo"]},
       {"==": [{"var" : "span.operationName"}, "op1"]}]
     },
     "strategy": {
       "type": "probabilistic",
       "param": 0.2
     }
    },
    {"name": "foo_op_2",
     "condition": {"and" : [
       {"==": [{"var" : "span.process.serviceName" }, "foo"]},
       {"==": [{"var" : "span.operationName"}, "op2"]}]
     },
     "strategy": {
       "type": "probabilistic",
       "param": 0.4
     }
    },
    {"name": "bar",
     "condition": {"==": [{"var" : "span.process.serviceName" }, "bar"]},
     "strategy": {
       "type": "ratelimiting",
       "param": 5
     }
    },
    {"name": "default",
     "strategy": {
       "type": "probabilistic",
       "param": 0.5
     }
    } 
  ],
  "tables": ...
}

What do we see here?

There's a name to every rule that helps with debugging.
There's a condition to each rule. I use JsonLogic.
If multiple rules match, the first one wins i.e. order within sampling-rules matters.
There's a "tables" field that we will get to in a bit

This small extension enables the client and the agent to do powerful cherry-picking of what to sample.

The rules defined above are evaluated at any point in time a Span changes. More specifically, Spans in Jaeger, can be represented as JSONs that adhere to this model. Examples can be found in example_trace.json:

    {
      "traceId": "AAAAAAAAAAAAAAAAAAAAEQ==",
      "spanId": "AAAAAAAAAAM=",
      "operationName": "example-operation-1",
      "references": [],
      "startTime": "2017-01-26T16:46:31.639875Z",
      "duration": "100000ns",
      "tags": [],
      "process": {
        "serviceName": "example-service-1",
        "tags": []
      },
      "logs": [
        {
          "timestamp": "2017-01-26T16:46:31.639875Z",
          "fields": []
        },
        {
          "timestamp": "2017-01-26T16:46:31.639875Z",
          "fields": []
        }
      ]
    }

Pieces of information are collected at different points in time. e.g. the duration isn't known till the end of the Span, while operationName is known at the beginning. At any point there's a change the rules are re-evaluated and they can yield a "sample" decision. Examples of when a Span model might be updated are:

start
finish
When new tags and logs are added, including when error conditions are detected
carrier injection (spawning children, at which point, depending on the RPC mechanism, a Span might be able to get additional tags like target/child serviceName, operationName, host etc. - since they're likely known e.g. when I do an http request, I know the target endpoint).
RPC timeout or
RPC call return , at which point a Span might get to know if the child decided to sample itself and/or if there were any errors downstream.

The above might seem as high CPU overhead, but I can imagine many optimizations that return quickly if there are no rules that match a Span change. JIT compilation of the rules might also be a way to accelerate further. The high performance of Chrome, DOM and V8 indicate that we might be able to have fast implementations.

Here are some basic examples. To evaluate them, I wrap the example Span presented above to a {"span": ... } and use the JsonLogic prompt one can find here.

The following condition evaluates to true for the example Span:

{"and" : [
  {"==": [{"var" : "span.process.serviceName" }, "example-service-1"]},
  {"==": [{"var" : "span.operationName"}, "example-operation-1"]}]
}

There are two default "force sample" rules in Jaeger. Can we implement them using this framework? Yes:

{"some" : [ {"var" : "span.tags" },
            {"==": [{"var":"key"}, "jaeger-debug-id"]} ]}

and

{"some" : [ {"var" : "span.tags" },
            {"and": [{"==": [{"var":"key"}, "SAMPLING_PRIORITY"]},
                     {">": [{"var":"value"}, 0]}]} ]}

In those two cases the strategy of the rule could return const: 1 which will force sampling.

There are two safety conditions we would like to guarantee:

No matter how bad the configuration, the sampler shouldn’t slow down the collection process
No matter how bad the configuration, the sampler shouldn’t overwhelm the infrastructure

As a result, proper engineering should be put in place. The language of JsonLogic isn't turning complete but still, there could be problems if the Spans have tags with long string values, long arrays or the configuration has rules that are overly long. Rules can be auto-generated, e.g. by some real-time trend analysis system, so it's wise to put some controls in the agent/client that reject configurations that might be slow. On the second point, above, if the rules end-up sampling everything while our infrastructure can handle just 1:100 sampling, we wouldn't like to overwhelm infra. As a result, we might need some form of cascading rules or global mechanism that limits total throughput.

Can we use this form of adaptive sampling to implement rules that sample outliers in terms of duration? Yes. Here's such a rule:

{">": [
  {"var": "span.duration"},
  {"var": [{"cat": ["tables.d_percentiles",
                 ".", {"var" : "span.process.serviceName" },
                 ".", {"var" : "span.operationName" }
                 ]}, 10000000]}]}

We can test it with a (simplified) example span:

{
  "span":     {
      "operationName": "example-operation-1",
      "duration": 100000,
      "process": {
        "serviceName": "example-service-1"
      }
    },
    "tables": {
        "d_percentiles": {
            "example-service-1": {
                "example-operation-1": 50000
            }
        }
    }
}

Note that we now need a tables.d_percentiles dict that can be declared in the configuration. This dict can be dynamically calculated by an "adaptive sampling calculator". It defines a per end-point expected maximum duration, beyond which sampling will be forced (e.g. by a const: 1 strategy). We might want to tag this traffic not just as sampled but also as force sampled so that we ignore it while we calculate percentiles etc. Uniform Sampling can still be used for percentiles and other such statistics, while this extra type of traffic might be used to ensure we don't miss important outliers.

Since we now define and use a Span model, it would be good to also formalize a few more attributes it could have and be potentially useful. For example, we can have "span.parent.operationName", " span.parent.parent.process.seviceName" of a Span (parent might need to be an array - but let's skip this for now). Those don't need to be supported for any middleware or configuration, but if a middleware supports passing downstream some tags or Span attributes (e.g. through baggage) in-band, it's nice to know where to find them. With this extension, one can write rules that sample at a given operation and parent operation. As mentioned before, it might be possible to do this also, when you make the request, on the caller, instead of on the child-span, but parent provides another way to do this.

As a natural extension on the above, we can define an attribute span.child. child might also be an array and have arbitrary attributes but let's skip this for now. An important attribute might be span.child.sampled. This will be true if a child decided to sample itself and false otherwise. This makes sense, because under the extensions we described above, a Span can decide to sample itself at several different points in time (instead of just the beginning) if it detects interesting conditions. Many types of RPC middleware are able to return information upstream e.g. HTTP response headers. This means that a Span might be updated with an up-to-date value for span.child.sampled when a response from a sampled child span is detected. This could be triggered by an error condition on the child, in which case if the current span also decides to sample and propagate upstream, we can have the "call stack" and also sample any subsequent Spans. Those traces might have unusual shape because of missing information and highly likely shouldn't be used for aggregate analytics but they can still be used for alerting and debugging.

yurishkuro · 2019-08-09T03:48:32Z

@lookfwd great proposal! We're actively discussing it right now for an internal project. One major issue we bumped into with it is this: when possible strategies are represented as a list and need to be matched one by one for a given span, it works pretty well if the match process runs only once. But in your proposal there's no specific demarcation event that tells the tracer "do it now", instead matching can run multiple times as tags are being added to the span. The problem is that a list of strategies would typically include a default fallback strategy when nothing custom matches, and the default strategy will always match, even on the first try, so there won't be time to set span tags and potentially match any other custom strategies.

We could introduce the demarcation event artificially, e.g. as some static method in Jaeger or a special span tag that can be used to signal "apply sampling rules now". But it's kind of ugly and requires additional instrumentation in the code.

Thoughts?

lookfwd · 2019-09-06T13:14:28Z

@yurishkuro - sorry, I missed it. The points when one needs to know if sampling is true or false, is when the context is about to be injected to a carrier or when one is about to finish the Span. If those are used as "apply sampling rules now" trigger(s) i.e. lazily evaluating should_sample() when one needs it, should give good full or partial traces as far as I can see.

More specifically I would expect the user/framework to set all the tags before it injects or finish()'s.

default fallback strategy when nothing custom matches, and the default strategy will always match, even on the first try

If the default strategy is a lower bound rate limiter, it should be sufficient to inform about the existence of "weird" spans that e.g. set tags after injecting, without flooding the system.

yurishkuro · 2019-09-07T14:47:56Z

@lookfwd you might be interested in these two PRs (jaegertracing/jaeger-client-node#377, jaegertracing/jaeger-client-node#380), which introduce a shared sampling state for spans in-process and allow delaying sampling decision.

stevejobsmyguru · 2020-02-02T15:24:24Z

@lookfwd great proposal! We're actively discussing it right now for an internal project. One major issue we bumped into with it is this: when possible strategies are represented as a list and need to be matched one by one for a given span, it works pretty well if the match process runs only once. But in your proposal there's no specific demarcation event that tells the tracer "do it now", instead matching can run multiple times as tags are being added to the span. The problem is that a list of strategies would typically include a default fallback strategy when nothing custom matches, and the default strategy will always match, even on the first try, so there won't be time to set span tags and potentially match any other custom strategies.

We could introduce the demarcation event artificially, e.g. as some static method in Jaeger or a special span tag that can be used to signal "apply sampling rules now". But it's kind of ugly and requires additional instrumentation in the code.

Thoughts?

I have some thoughts in high level in the following lines:

At the Collector level, For each Host level, we can run some kind of baseline calculation on key KPIs like # of Error, 90th Response Time or Throughput for every < X> Observation window. This observation interval can be like for every 5 min or every 10 mins or < auto-calculated>. This Observation window can be auto-calculated from Throughput ( or Operation /Sec at every Service Level). If the base line is breached from previous observation window, then we can trigger adaptive sampling from Collector to Agent thru callback function to Agent for each service level. I mean Agent will send to collector, If all is safe within the baseline, Agent will purge within its perimeter. The point here is, there may be broken parent or broken childs (or Spans) for completing end-2-end transaction. Of course , it will be complex design though .

@yurishkuro: I just tried my imagination on high level design. Pls. go thru and take your own call.

agaudreault · 2020-06-03T19:15:33Z

I am not sure where to post this (I also asked in Gitter), but I will ask here since it seems like a problem with adaptive sampling and the standard implementation with the client libraries.

In our collector, we define the remote sampling configuration strategies and for some default endpoints (/health, /metrics, etc.), we set the probabilistic sampling rate of 0.0 (Pretty much like the example in the jaeger documentation). From what I found when debugging, the client librairies will create a GuaranteedThroughputSampler whenever a list of operations is received. The default lowerbound value will always be of 0, but when the RateLimiterSampler is created, if the value is smaller than 1, the default value will be 1. This cause the /health, /metrics, etc. endpoints to be sampled at a rate of 1 TPS.

Is there currently a way to disable the sampling for specific endpoints ? Or perhaps it is a bug that that the RateLimiterSampler cannot be instantiated with a value of 0 ?

We are using Jaeger collector 1.8 and the latest release of jaeger-client-java and jaeger-client-go.

yurishkuro · 2020-06-03T21:10:36Z

@agaudreault-jive it's probably a limitation of the data model of GuaranteedThroughputSampler - it only supports probability value per-endpoint, while the lowerbound rate limiter applies across all endpoints.

// OperationSamplingStrategy defines a sampling strategy that randomly samples a fixed percentage of operation traces.
struct OperationSamplingStrategy {
    1: required string operation
    2: required ProbabilisticSamplingStrategy probabilisticSampling
}

// PerOperationSamplingStrategies defines a sampling strategy per each operation name in the service
// with a guaranteed lower bound per second. Once the lower bound is met, operations are randomly sampled
// at a fixed percentage.
struct PerOperationSamplingStrategies {
    1: required double defaultSamplingProbability
    2: required double defaultLowerBoundTracesPerSecond
    3: required list<OperationSamplingStrategy> perOperationStrategies
    4: optional double defaultUpperBoundTracesPerSecond
}

It's possible to extend the model, but will require pretty substantial changes.

BTW, 1 TPS seems very high for lowerbound, we're using several orders of magnitude smaller value.

Mario-Hofstaetter · 2020-11-17T22:38:29Z

Any update on this feature? There are many merged pull-requests. Will this be finished?
Or will this be obsolete / superseeded by work done in the new opentelemetry collector?

Ashmita152 · 2021-02-07T01:48:58Z

Hi @yurishkuro

I don't know if my skillset is good enough to solve this ticket but I would like to take a stab at it. I would like to ask you if you can summarise the work which is remaining here. My understanding from looking at the two PRs is that we just need to invoke the adaptive sampling processor from collector/main.go.

Thank you.

yurishkuro · 2021-02-07T02:22:38Z

@Ashmita152 you're correct, I think pretty much all of the code is already in the repo, it just needs hooking up in the collector and exposing configuration parameters via flags. It would be fantastic if we can get this in, this last piece was outstanding for over 2yrs.

Ashmita152 · 2021-02-07T02:26:00Z

Sure Yuri, I will give it a try. Thank you.

Yolo-Hao · 2021-05-19T03:38:28Z

any news? work now? or any plan?

yurishkuro · 2021-05-19T04:11:04Z

@joe-elliott picked this up in #2966

mrsuperguo · 2021-12-27T03:03:11Z

I've made some try in my project. I combined rate sampling and rate limiter together. At the mean time, I sampled pod cpu&mem status to auto adjust sampling rate. It goes well in my production env.

yurishkuro · 2022-01-17T16:19:37Z

FYI https://medium.com/jaegertracing/adaptive-sampling-in-jaeger-50f336f4334

yurishkuro added the roadmap label Sep 1, 2017

yurishkuro assigned black-adder Sep 1, 2017

yurishkuro mentioned this issue Sep 1, 2017

adaptive sampling #136

Closed

yurishkuro mentioned this issue Nov 2, 2017

Question: Can I configure a sampling rate? opentracing-contrib/nginx-opentracing#8

Closed

black-adder mentioned this issue Aug 7, 2018

Start moving components of adaptive sampling to OSS #973

Merged

black-adder mentioned this issue Nov 13, 2018

Move adaptive sampling processor #1179

Merged

akshayjshah mentioned this issue Jun 19, 2019

tracing: allow specifying parent traceid/spanid when starting a trace cockroachdb/cockroach#19403

Open

yurishkuro mentioned this issue Jul 26, 2019

support jaeger.tags when calling 'GetSamplingStrategy', e.g. return R… #1687

Closed

This was referenced Aug 14, 2019

Move adaptive sampling code out of internal pkg #1706

Merged

Make some utils public to ease transition #1730

Merged

Make operation_strategies part also be part of default_strategy #1749

Merged

yurishkuro mentioned this issue Oct 4, 2019

Added parameterizedSampling.pdf write-up for Discussion on Parameterized or targeted Sampling and Upstream Filtering open-telemetry/opentelemetry-specification#224

Closed

jamieklassen mentioned this issue Jun 2, 2020

describe distributed tracing operations concourse/docs#335

Merged

joe-elliott mentioned this issue Apr 28, 2021

YAAAAS - Yet Another Attempt At Adaptive Sampling #2966

Merged

stnor mentioned this issue May 14, 2021

More "intelligent" sampler option(s) aws-observability/aws-otel-java-instrumentation#56

Closed

yurishkuro closed this as completed in #2966 Sep 8, 2021

Adaptive Sampling #365

Adaptive Sampling #365

Comments

yurishkuro commented Sep 1, 2017 • edited Loading

Problem

Proposed Solution

Status

robdefeo commented Apr 19, 2018

billowqiu commented May 15, 2018

yurishkuro commented May 15, 2018

billowqiu commented May 15, 2018

trtg commented Jun 19, 2018

yurishkuro commented Jun 19, 2018

sergeyklay commented Aug 7, 2018

black-adder commented Aug 7, 2018

agxp commented Oct 11, 2018 • edited Loading

pravarag commented Dec 19, 2018

yurishkuro commented Dec 20, 2018

Glowdable commented Jan 17, 2019

capescuba commented Feb 6, 2019

csurfleet commented Mar 29, 2019

m1o1 commented Apr 1, 2019

wuyupengwoaini commented Apr 18, 2019

yurishkuro commented Apr 21, 2019

trtg commented Apr 22, 2019 • edited Loading

wuyupengwoaini commented Apr 24, 2019

csurfleet commented May 8, 2019

DannyNoam commented May 16, 2019

yurishkuro commented Jun 2, 2019

rscott231 commented Jun 14, 2019

lookfwd commented Jun 26, 2019 • edited Loading

yurishkuro commented Aug 9, 2019

lookfwd commented Sep 6, 2019

yurishkuro commented Sep 7, 2019

stevejobsmyguru commented Feb 2, 2020

agaudreault commented Jun 3, 2020 • edited Loading

yurishkuro commented Jun 3, 2020

Mario-Hofstaetter commented Nov 17, 2020

Ashmita152 commented Feb 7, 2021

yurishkuro commented Feb 7, 2021

Ashmita152 commented Feb 7, 2021

Yolo-Hao commented May 19, 2021

yurishkuro commented May 19, 2021

mrsuperguo commented Dec 27, 2021

yurishkuro commented Jan 17, 2022

yurishkuro commented Sep 1, 2017 •

edited

Loading

agxp commented Oct 11, 2018 •

edited

Loading

trtg commented Apr 22, 2019 •

edited

Loading

lookfwd commented Jun 26, 2019 •

edited

Loading

agaudreault commented Jun 3, 2020 •

edited

Loading