[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

wence- · 2024-07-31T11:15:33Z

Describe the bug

(from a review of #16367)

The calculation of whole-column variance in libcudf uses the "textbook" approach. For some set of measurements $X := \{x_i\}_{i=1}^N$ (assuming, wlog, ddof=1):

$$ \text{var}{X} = \frac{\sum_i x_i^2}{N - 1} - \left(\frac{\sum_i x_i}{N}\right)^2 \frac{N}{N-1}. $$

This is known to be numerically inaccurate, especially when the variance is small compared to the values and both terms are large and of similar magnitude.

The usual fix to this is to compute via a two-pass or stable online approach. The former is then (having computed the mean $\bar{x}$):

$$ \text{var}{X} = \frac{\sum_i (x_i - \bar{x})^2}{N - 1} $$

This is the approach used in groupby variance calculation. An online version (Welford's algorithm) is used in rolling variance calculation.

We can see the differences in accuracy:

import cudf

y = cudf.Series([1577836800000000000, 1609372800000000000], dtype="int64").astype("float64")
print("%.32e" % y.var(ddof=1))
print("%.32e" % y.rolling(2, min_periods=0).var().values[1].item())
print("%.32e" % y.groupby(cudf.Series([0, 0])).var().values[0].item())
print("%.32e" % y.values.get().astype("float128").var(ddof=1))
# 4.97259647999910330416731839791104e+32
#               ^^^^ about 1000ulp
# 4.97259647999999970063715022143488e+32
# 4.97259647999999970063715022143488e+32
# 4.97259647999999970063715022143488e+32

We should implement a numerically stable approach for the whole column calculation of the variance.

The text was updated successfully, but these errors were encountered:

bdice · 2024-07-31T12:04:03Z

The offending code is in reduction_operators.cuh:

cudf/cpp/include/cudf/reduction/detail/reduction_operators.cuh

Line 263 in 8def2ec

ResultType var = asum / div - ((mean * mean) * count) / div;

cudf/cpp/include/cudf/reduction/detail/reduction_operators.cuh

Line 44 in 8def2ec

    
           return this_t((this->value + rhs.value), (this->value_squared + rhs.value_squared));

We use the pairwise approach of Chan, Golub, and LeVeque (1983). - Closes rapidsai#16444

wence- · 2024-07-31T17:43:17Z

Opened a fix, but in draft because no tests yet, and probably some xfails in cudf-python tests need to be removed.

wence- added the bug Something isn't working label Jul 31, 2024

wence- mentioned this issue Jul 31, 2024

Align DatetimeIndex APIs with pandas 2.x #16367

Merged

3 tasks

wence- self-assigned this Jul 31, 2024

wence- added a commit to wence-/cudf that referenced this issue Jul 31, 2024

Compute whole column variance using numerically stable approach

17185c5

We use the pairwise approach of Chan, Golub, and LeVeque (1983). - Closes rapidsai#16444

wence- added the libcudf Affects libcudf (C++/CUDA) code. label Jul 31, 2024

wence- mentioned this issue Jul 31, 2024

Compute whole column variance using numerically stable approach #16448

Merged

3 tasks

rapids-bot bot closed this as completed in #16448 Oct 8, 2024

rapids-bot bot closed this as completed in bcf9425 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

wence- commented Jul 31, 2024 •

edited

Loading

bdice commented Jul 31, 2024

wence- commented Jul 31, 2024

[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

Comments

wence- commented Jul 31, 2024 • edited Loading

bdice commented Jul 31, 2024

wence- commented Jul 31, 2024

wence- commented Jul 31, 2024 •

edited

Loading