Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

Closed
wence- opened this issue Jul 31, 2024 · 2 comments · Fixed by #16448
Closed

[BUG] whole-column variance calculation uses numerically unstable algorithm #16444

wence- opened this issue Jul 31, 2024 · 2 comments · Fixed by #16448
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Jul 31, 2024

Describe the bug

(from a review of #16367)

The calculation of whole-column variance in libcudf uses the "textbook" approach. For some set of measurements $X := \{x_i\}_{i=1}^N$ (assuming, wlog, ddof=1):

$$ \text{var}{X} = \frac{\sum_i x_i^2}{N - 1} - \left(\frac{\sum_i x_i}{N}\right)^2 \frac{N}{N-1}. $$

This is known to be numerically inaccurate, especially when the variance is small compared to the values and both terms are large and of similar magnitude.

The usual fix to this is to compute via a two-pass or stable online approach. The former is then (having computed the mean $\bar{x}$):

$$ \text{var}{X} = \frac{\sum_i (x_i - \bar{x})^2}{N - 1} $$

This is the approach used in groupby variance calculation. An online version (Welford's algorithm) is used in rolling variance calculation.

We can see the differences in accuracy:

import cudf

y = cudf.Series([1577836800000000000, 1609372800000000000], dtype="int64").astype("float64")
print("%.32e" % y.var(ddof=1))
print("%.32e" % y.rolling(2, min_periods=0).var().values[1].item())
print("%.32e" % y.groupby(cudf.Series([0, 0])).var().values[0].item())
print("%.32e" % y.values.get().astype("float128").var(ddof=1))
# 4.97259647999910330416731839791104e+32
#               ^^^^ about 1000ulp
# 4.97259647999999970063715022143488e+32
# 4.97259647999999970063715022143488e+32
# 4.97259647999999970063715022143488e+32

We should implement a numerically stable approach for the whole column calculation of the variance.

@wence- wence- added the bug Something isn't working label Jul 31, 2024
@bdice
Copy link
Contributor

bdice commented Jul 31, 2024

The offending code is in reduction_operators.cuh:

ResultType var = asum / div - ((mean * mean) * count) / div;

return this_t((this->value + rhs.value), (this->value_squared + rhs.value_squared));

@wence- wence- self-assigned this Jul 31, 2024
wence- added a commit to wence-/cudf that referenced this issue Jul 31, 2024
We use the pairwise approach of Chan, Golub, and LeVeque (1983).

- Closes rapidsai#16444
@wence- wence- added the libcudf Affects libcudf (C++/CUDA) code. label Jul 31, 2024
@wence-
Copy link
Contributor Author

wence- commented Jul 31, 2024

Opened a fix, but in draft because no tests yet, and probably some xfails in cudf-python tests need to be removed.

@rapids-bot rapids-bot bot closed this as completed in bcf9425 Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants