Provide information about the quality of a resampled metric #1021

llucax · 2024-08-05T13:57:05Z

What's needed?

We need a way to inform users about the quality of a resampled metric.

For example, if a sample was calculated only using one very old value, the data quality should be low, while if the data was calculated based on many samples and we had up to date samples, then the quality should be high.

This way actors could make more informed decisions on how to use that data.

Proposed solution

Expose resampler SourceProperties via the resampling actor
Add more relevant statistics to SourceProperties
Make FormulaEngines aggregate statistics from the components it uses and expose its own statistics

Use cases

No response

Alternatives and workarounds

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

cwasicki · 2024-08-06T15:38:29Z

In my opinion this is interesting for formulas, e.g. to know how many None's were ignored in the calculation.

llucax · 2024-08-08T11:23:39Z

@frequenz-floss/python-sdk-team unless someone steps in and shows a use case for this, I think I will close.

shsms · 2024-08-08T11:33:10Z

We have often seen lower data rates from components without warning because of site-specific issues. I have seen this happen many times, including last week.

Apps need to be able to identify degraded data quality so that they know to be more conservative in their goals. Without it, they will assume that the latest values have a higher accuracy and will overshoot.

llucax · 2024-08-08T12:09:03Z

But if we assume a small sampling period, which is want to aim for (1s), then you know that the data rate is low or the quality of the data is bad because the resampler will start producing None, right? I agree we need to know when data is degraded, what I'm not sure if the resampler is the best place to do so. I think the resampler should only cover for very short outages, stuff that should be transparent to app developers. Once data is bad enough that you care, the resampler should be fixing it in the first place, right?

llucax · 2024-08-08T12:09:39Z

So one suggestion was to use the LatestValueCache, extending it to expire the last value and store the timestamp of the last value.

shsms · 2024-08-08T12:17:54Z

then you know that the data rate is low or the quality of the data is bad because the resampler will start producing None, right?

I think the resampler shouldn't produce None and expect manual intervention like increasing data age in number of sampling periods to 5. Like Christoph said, that is too disruptive for big locations. The resampler should adjust to max data age, if it determines that data rate is lower than the max data age, such that the buffer will have the latest value. But that's a separate issue I guess.

shsms · 2024-08-08T12:21:36Z

what I'm not sure if the resampler is the best place to do so.

I think it is, because like you said, it tracks source info already and just has to send out one value at startup, and later, whenever the source info is recalculated.

llucax · 2024-08-09T12:01:41Z

I think the resampler shouldn't produce None and expect manual intervention like increasing data age in number of sampling periods to 5.

Let's see if we are talking about the same.

When? If data is not coming, then yes, it should produce None, there is no data. Right? This might happen temporarily or always. If a site is always producing slow data rates, then there is something fucked with that location, and IMHO in that case, yes, we should fix the location or change the period manually, at least from what I understood @thomas-nicolai-frequenz said, the resampling period can't be changed so lightly or the machine learning part can break.

If it happens sporadically, we should be able to recover when the data comes with the normal rate.

Like Christoph said, that is too disruptive for big locations. The resampler should adjust to max data age, if it determines that data rate is lower than the max data age, such that the buffer will have the latest value. But that's a separate issue I guess.

What do you mean by "adjust to the max data age"? Do you mean it should adjust the max_data_age_in_periods so that we get at least one sample for the low rate input? If so, I don´t think we should do that, this is effectively changing the resampling function dynamically depending on the input data rate.

what I'm not sure if the resampler is the best place to do so.
I think it is, because like you said, it tracks source info already and just has to send out one value at startup, and later, whenever the source info is recalculated.

Yeah, but it is done for different reasons. Again, the global resampler is just a way to homogenize the input data assuming the data that comes... comes, and comes at a reasonable rate. If we have no data, the resampler should return None, if you still need to work with an old value, you should save the latest value and the age of this latest value yourself.

So this issue is only about knowing if the data for the last 3 seconds (according to the current defaults we use, resampling period of 1s and max_age_in_periods of 3) is good or bad, and my question still is, do we even need this kind of granularity?

llucax · 2024-08-09T12:07:27Z

OK, looking at the code, I have some interesting findings that I forgot about:

The resampler supports upsampling, and the max_data_age_in_periods considers the input sampling period in this case, not the (output) resampling period:

    max_data_age_in_periods: float = 3.0
    """The maximum age a sample can have to be considered *relevant* for resampling.

    Expressed in number of periods, where period is the `resampling_period`
    if we are downsampling (resampling period bigger than the input period) or
    the *input sampling period* if we are upsampling (input period bigger than
    the resampling period).

    It must be bigger than 1.0.

    Example:
        If `resampling_period` is 3 seconds, the input sampling period is
        1 and `max_data_age_in_periods` is 2, then data older than 3*2
        = 6 seconds will be discarded when creating a new sample and never
        passed to the resampling function.

        If `resampling_period` is 3 seconds, the input sampling period is
        5 and `max_data_age_in_periods` is 2, then data older than 5*2
        = 10 seconds will be discarded when creating a new sample and never
        passed to the resampling function.
    """

If the resampler is downsampling, then amount of time considered to pass samples to the resampling function is constant, but if it is upsampling, it is already dynamic (as it depends on the input sampling period) 😱
The input sampling period is calculated each time a sample comes, but it is an average of the whole lifetime of the input, so if an input rate changes over time, the value used as input sample period will almost not change. This might be good or bad depending on how we see it.

So if some location is sending samples every 5 seconds (consistently and from the start), the resampler should be able to cope with it without issues, data for the last 15 seconds should be used to calculate the current sample. If this didn't happen, maybe we have a bug in the resampler.

cwasicki · 2024-08-09T15:06:30Z

it is already dynamic (as it depends on the input sampling period)

Are you sure that this is done if the input data is not on a fixed sampling period? IIUC it can also be None, which I assumed would be used if we use the raw data as input.

llucax · 2024-08-13T07:20:27Z

I didn't get what do you mean by "the input data is not on a fixed sampling period".

cwasicki · 2024-08-13T11:11:56Z

If we resample irregular sample periods, e.g. if it's done on the raw data from the components I am not sure we can rely on that.

llucax · 2024-08-14T09:18:20Z

So, if we are downsampling, the data considered for the current window is always a fixed time span (max_age_in_periods * resampling_period). If we are upsampling though, then the input samples with the following age are considered for the current window: max_age_in_periods * input_sampling_period, where input_sampling_period is dynamic (will be updated for each received sample as total_time_receiving / total_samples_received), so if the input source rate is stable, it should be more or less constant, but if we have gaps often, then the input_sampling_period will increase as it is an average.

But also for the downsampling case, if a source is flaky at the beginning, we might consider we are actually upsampling the source, because the data rate is too low. Once it recovers, it should be switched to downsampling.

I'm not saying this is what we want, I'm just saying this is what the resampler is doing right now.

llucax added type:enhancement New feature or enhancement visitble to users part:data-pipeline Affects the data pipeline labels Aug 5, 2024

llucax added this to the v1.0.0-rc800 milestone Aug 5, 2024

llucax mentioned this issue Aug 6, 2024

Accounting for unknown (because of missing data) but probably existing power #1024

Open

shsms modified the milestones: v1.0.0-rc800, v1.0.0-rc900 Aug 22, 2024

llucax modified the milestones: v1.0.0-rc900, 1.0.0-rc1000 Sep 2, 2024

llucax modified the milestones: v1.0.0-rc1000, v1.0.0-rc1100 Oct 21, 2024

llucax modified the milestones: v1.0.0-rc1100, v1.0.0-rc1200 Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide information about the quality of a resampled metric #1021

Provide information about the quality of a resampled metric #1021

llucax commented Aug 5, 2024

cwasicki commented Aug 6, 2024

llucax commented Aug 8, 2024

shsms commented Aug 8, 2024

llucax commented Aug 8, 2024

llucax commented Aug 8, 2024

shsms commented Aug 8, 2024

shsms commented Aug 8, 2024

llucax commented Aug 9, 2024

llucax commented Aug 9, 2024

cwasicki commented Aug 9, 2024

llucax commented Aug 13, 2024

cwasicki commented Aug 13, 2024

llucax commented Aug 14, 2024

Provide information about the quality of a resampled metric #1021

Provide information about the quality of a resampled metric #1021

Comments

llucax commented Aug 5, 2024

What's needed?

Proposed solution

Use cases

Alternatives and workarounds

Additional context

cwasicki commented Aug 6, 2024

llucax commented Aug 8, 2024

shsms commented Aug 8, 2024

llucax commented Aug 8, 2024

llucax commented Aug 8, 2024

shsms commented Aug 8, 2024

shsms commented Aug 8, 2024

llucax commented Aug 9, 2024

llucax commented Aug 9, 2024

cwasicki commented Aug 9, 2024

llucax commented Aug 13, 2024

cwasicki commented Aug 13, 2024

llucax commented Aug 14, 2024