Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making sampling percentage dynamically adjust (adaptive smpling) #80

Closed
vitalyf007 opened this issue Oct 21, 2015 · 3 comments
Closed

Comments

@vitalyf007
Copy link
Contributor

Problem description

Given automated way of collecting Request and Dependency events in App Insights customers have little to no control over the amount of information their application would produce. By default we collect all requests and dependency calls and there is no way to reduce event rates if application in question is fairly large.

With static sampling feature shipped in [2.0 beta] Sdk in sprint 90 filtering feature shipped in early 91 we do provide customers a way to control volumes but questions remain.

If customer attempts to use sampling the very first problem they will run into is which value they would need to set as sampling percentage. This can be determined by looking at total volume of events for a fairly large period of time, but presents a different problem. Say, application is bursty and produces a lot of volume in a short period of time followed by a long “quiet” period. With static sampling burst period would be sampled down and would produce statistically correct data but “quiet” period may have so little telemetry events generated that static sampling will capture a few events per day/hour and make data incorrect (skewed).

Solution

Proposed solution is to develop “adaptive sampling” mechanism that would very the sampling percentage based on the observed rate of telemetry events generated by the application. As the rate increases (burst situation), sampling percentage will decrease capturing less events. When rate of telemetry events drops (“quiet period”), sampling percentage increases back capturing more events to preserve statistical data correctness.

Thus, sampling percentage may “float” based on the rate of telemetry events produced by the application. Since we’re addressing this on Sdk side, the rate of events is actually rate of events on that box/device, no across entire application.
This solution will work for .Net server-side Sdk. We can potentially port it to other server-side sdk ports later).

Adaptive sampling module can be set up in code or in configuration file. The default configuration file will include adaptive sampling module be default.

Design details

The solution uses existing concept of “telemetry processor” found in 2.0 beta Sdk of AI.

We will employ existing sampling module to do actual sampling and create another telemetry processor to do the math to figure out what sampling percentage is to be applied in a given situation.

To do that, we’ll calculate exponential moving average (see: https://en.wikipedia.org/wiki/Moving_average) of the telemetry items sent to AI data collector (rate of events after sampling). This number will be available all the time the application runs. If application has just started, the calculation of event rate will be reset. As the application continues to run, moving average will more precisely reflect rate of telemetry events.

The process will keep its state in memory without any serialization initially. So, restart of the application will reset state to the initial one.

Having average rate of events produced and effective sampling rate (current sampling rate set on the sampling telemetry processor) we can determine the ‘ideal’ sampling rate given target event rate set as configuration value. If ‘ideal’ sampling percentage is different from the currently effective one, we’ll change corresponding parameter of the sampling telemetry processor to new value.

Sampling percentage will not be changed constantly. Timeouts will be applied (different ones) before sampling percentage will be [further] decreased or increased. Changing sampling rate very frequently may result in bad behavior where request/rdd/events may not be together sampled in or out if sampling percentage changes in between, therefore certain timeout is needed.

Process parameters

The table below contains entire set of parameters used in the process. These will be settable in code or in configuration file. All the parameters may be set from code or via configuration file. Default configuration file will have only MaxTelemetryItemsPerSecond parameter set explicitly to default value.

In addition to all parameters, a callback can be set in code invoked every time sampling percentage algorithm is run (in case customer wants to track/trace sampling percentage change events).

  • InitialSamplingPercentage (default: 100%) - Sampling percentage to apply when the application code starts and no state of the estimation process is available.
  • MaxTelemetryItemsPerSecond (default: 5) - Target maximum number of telemetry items generated by a single box/device per second. This parameter is the main driving factor of sampling percentage changes. Generally speaking, if this parameter is set to 5 and we observe 10 telemetry events generated per second, we’ll set sampling percentage to 50%.
  • MinSamplingPercentage (default: 0.1%) - As sampling percentage varies, what is the minimum value we’re allowed to set.
  • MaxSamplingPercentage (default: 100%) - As sampling percentage varies, what is the maximum value we’re allowed to set.
  • EvaluationIntervalSeconds (default: 15sec) - How frequently do we run sampling percentage evaluation algorithm (along with moving average algorithm).
  • MovingAverageRatio (default: 0.25) - When calculating moving average of telemetry events submitted per second, how much “emphasis” to put on the most recent values vs. historical values.
    With default value we put 25% ‘emphasis” on the most recent value and 75% on historical values.
  • SamplingPercentageDecreaseTimeoutSeconds (default: 2min) - When sampling percentage value changes, how soon after are we allowed to lower sampling percentage again to capture less data.
  • SamplingPercentageIncreaseTimeoutSeconds (default: 15min) - When sampling percentage value changes, how soon after are we allowed to increase sampling percentage again to capture more data.

Sdk Api design

Similar to static sampling we will provide simple building block to enable customers to quickly setup adaptive sampling in code. TelemetryChannelBuilder class allowing to build a list of telemetry processors will receive new extension UseAdaptiveSampling() with the following overloads:

  1. No parameters. Enables adaptive sampling with all default values for algorithm parameters. This is to be used by new customers primarily to “kick the tires”;
  2. maxTelemetryItemsPerSecond parameter. Enables adaptive sampling with all default parameters of the algorithm but custom target telemetry item rate. We expect this one to be used by more advanced customers who have either fewer boxes/servers in the application and willing to capture more telemetry per box or the other way around.
  3. settings, callback parameters. An overload for full customization of the algorithm allowing to set all parameters (here “settings” is a set of settable properties corresponding to all parameters outlined above for the estimation algorithm). Callback parameter allows customer to set up code that is invoked when sampling percentage evaluation algorithm runs. The following parameters will be provided to callback:
    • After-sampling rate of telemetry observed by the algorithm;
    • Current sampling percentage algorithm assumes is applied by the sampling telemetry processor;
    • New sampling percentage to set for sampling telemetry processor in order to make the rate of events “ideal”;
    • Whether or not sampling percentage will be changed after this evaluation (even though current & new sampling percentages may be different, new one may not be applied immediately due to ‘timeout’ or ‘penalty box’ situations, in other words, in cases sampling percentage was changed recently);
    • Algorithm current settings.

A separate telemetry processor AdaptiveSamplingTelemetryProcessor will also be provided. This one is used by the extensions and in itself is a combination of two telemetry processors – existing sampling processor and new sampling percentage estimator processor.

AdaptiveSamplingTelemetryProcessor also contains code to react to sampling percentage change recommendation performed by [internal] estimator processor by setting it as sampling percentage property of the sampling processor.

Sampling percentage evaluation algorithm

SamplingPercentageEstimatorTelemetryProcessor [internal] new class implements code for sampling percentage evaluation algorithm.
It sets up the timer to evaluate sampling percentage and follows this set of steps every time timer fires:

  • Close next interval of the ‘moving average’ counter and get average observed after-sampling telemetry event rate;
  • Calculate suggested sampling percentage so that rate would be “just under” ‘ideal’ target rate provided; adjust this suggested rate if it is below min or above max;
  • Reset the timer if evaluation frequency parameter was changed;
  • See if sampling percentage needs to be changed;
  • Call evaluation callback if provided suppressing all exceptions (we’re on a timer thread in the process here and if that throws, the process would die);
  • If sampling percentage can be changed (suggested is different from current and we’re not in any kind of ‘penalty box’), assume sampling percentage was changed by sampling telemetry processor (this is enforced by the container public telemetry processor), record current and date of change and also reset moving average counter since previous values were taken with different sampling percentage.
@SergeyKanzhelev
Copy link
Contributor

Is there a recommendation we can provide on how to understand effectiveness of default values for these settings and what value to set them: MovingAverageRatio, SamplingPercentageDecreaseTimeoutSeconds and SamplingPercentageIncreaseTimeoutSeconds?

My opinion that adaptive sampling should be THE sampling. So we don't need to complicate API by introducing two distinct extension methods - UseAdaptiveSampling and UseSampling.

Sampling today takes into consideration user id, operation id, etc. I assume this sampling processor will do the same. Would it be nice to detach the logic of sliding window and actual sampling logic? Maybe one processor is "counting" and another is filtering based on adjusted value?

What scenario do you see for change callback? Maybe we can make it via tracing so one can subscribe on EventSource and listen for change events instead? What logic do you expect will be made by customer beside tracing here?

@vitalyf007
Copy link
Contributor Author

  1. Re: setting values & effectiveness of "penalty box" and "moving avg" parameters. These are for fine-tuning the process. Defaults should work for most cases. One example where penalty box parameters may need adjusting is spikes in telemetry production with certain period of time... The idea is to apply defaults and if that does not work, analyze why volume of telemetry is still high and/or throttling is in effect. Analysis may show a need to fine-tune these parameters. I do not think we can provide "rule of thumb" way of adjusting these. Start with defaults, than apply analysis on a case-by-case bases...
  2. Re: "THE sampling". I strongly want to have static sampling available and public. It encapsulates complex sampling logic and in case someone wants to build their own sampling percentage algo, they can using the same approach as we do for adaptive sampling. Adaptive sampling will be the default in config - that is enough, I think.
  3. Re: decoupling sampling itself and "counting". this is, in fact, how adaptive sampling is done. It uses static sampling telemetry processor as is and one more processor - sampling percentage estimator ("counting" processor in your terms) plus a little bit of glue between them.
  4. Re: Callback. Two clear scenarios are tracing and turning it into a metric/counter. The callback provides nice structured event that is very simple to subscribe to (unlike ETW).

@abaranch
Copy link
Contributor

Adaptive Sampling feature was implemented and included in 2.0.0-beta3 SDK version.

TimothyMothra pushed a commit that referenced this issue Oct 25, 2019
TimothyMothra pushed a commit that referenced this issue Oct 25, 2019
mark request as failed for runaway exceptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants