-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making sampling percentage dynamically adjust (adaptive smpling) #80
Comments
Is there a recommendation we can provide on how to understand effectiveness of default values for these settings and what value to set them: My opinion that adaptive sampling should be THE sampling. So we don't need to complicate API by introducing two distinct extension methods - Sampling today takes into consideration user id, operation id, etc. I assume this sampling processor will do the same. Would it be nice to detach the logic of sliding window and actual sampling logic? Maybe one processor is "counting" and another is filtering based on adjusted value? What scenario do you see for change callback? Maybe we can make it via tracing so one can subscribe on EventSource and listen for change events instead? What logic do you expect will be made by customer beside tracing here? |
|
Adaptive Sampling feature was implemented and included in 2.0.0-beta3 SDK version. |
mark request as failed for runaway exceptions
Problem description
Given automated way of collecting Request and Dependency events in App Insights customers have little to no control over the amount of information their application would produce. By default we collect all requests and dependency calls and there is no way to reduce event rates if application in question is fairly large.
With static sampling feature shipped in [2.0 beta] Sdk in sprint 90 filtering feature shipped in early 91 we do provide customers a way to control volumes but questions remain.
If customer attempts to use sampling the very first problem they will run into is which value they would need to set as sampling percentage. This can be determined by looking at total volume of events for a fairly large period of time, but presents a different problem. Say, application is bursty and produces a lot of volume in a short period of time followed by a long “quiet” period. With static sampling burst period would be sampled down and would produce statistically correct data but “quiet” period may have so little telemetry events generated that static sampling will capture a few events per day/hour and make data incorrect (skewed).
Solution
Proposed solution is to develop “adaptive sampling” mechanism that would very the sampling percentage based on the observed rate of telemetry events generated by the application. As the rate increases (burst situation), sampling percentage will decrease capturing less events. When rate of telemetry events drops (“quiet period”), sampling percentage increases back capturing more events to preserve statistical data correctness.
Thus, sampling percentage may “float” based on the rate of telemetry events produced by the application. Since we’re addressing this on Sdk side, the rate of events is actually rate of events on that box/device, no across entire application.
This solution will work for .Net server-side Sdk. We can potentially port it to other server-side sdk ports later).
Adaptive sampling module can be set up in code or in configuration file. The default configuration file will include adaptive sampling module be default.
Design details
The solution uses existing concept of “telemetry processor” found in 2.0 beta Sdk of AI.
We will employ existing sampling module to do actual sampling and create another telemetry processor to do the math to figure out what sampling percentage is to be applied in a given situation.
To do that, we’ll calculate exponential moving average (see: https://en.wikipedia.org/wiki/Moving_average) of the telemetry items sent to AI data collector (rate of events after sampling). This number will be available all the time the application runs. If application has just started, the calculation of event rate will be reset. As the application continues to run, moving average will more precisely reflect rate of telemetry events.
The process will keep its state in memory without any serialization initially. So, restart of the application will reset state to the initial one.
Having average rate of events produced and effective sampling rate (current sampling rate set on the sampling telemetry processor) we can determine the ‘ideal’ sampling rate given target event rate set as configuration value. If ‘ideal’ sampling percentage is different from the currently effective one, we’ll change corresponding parameter of the sampling telemetry processor to new value.
Sampling percentage will not be changed constantly. Timeouts will be applied (different ones) before sampling percentage will be [further] decreased or increased. Changing sampling rate very frequently may result in bad behavior where request/rdd/events may not be together sampled in or out if sampling percentage changes in between, therefore certain timeout is needed.
Process parameters
The table below contains entire set of parameters used in the process. These will be settable in code or in configuration file. All the parameters may be set from code or via configuration file. Default configuration file will have only MaxTelemetryItemsPerSecond parameter set explicitly to default value.
In addition to all parameters, a callback can be set in code invoked every time sampling percentage algorithm is run (in case customer wants to track/trace sampling percentage change events).
With default value we put 25% ‘emphasis” on the most recent value and 75% on historical values.
Sdk Api design
Similar to static sampling we will provide simple building block to enable customers to quickly setup adaptive sampling in code. TelemetryChannelBuilder class allowing to build a list of telemetry processors will receive new extension UseAdaptiveSampling() with the following overloads:
A separate telemetry processor AdaptiveSamplingTelemetryProcessor will also be provided. This one is used by the extensions and in itself is a combination of two telemetry processors – existing sampling processor and new sampling percentage estimator processor.
AdaptiveSamplingTelemetryProcessor also contains code to react to sampling percentage change recommendation performed by [internal] estimator processor by setting it as sampling percentage property of the sampling processor.
Sampling percentage evaluation algorithm
SamplingPercentageEstimatorTelemetryProcessor [internal] new class implements code for sampling percentage evaluation algorithm.
It sets up the timer to evaluate sampling percentage and follows this set of steps every time timer fires:
The text was updated successfully, but these errors were encountered: