You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The calculation of whole-column variance in libcudf uses the "textbook" approach. For some set of measurements $X := \{x_i\}_{i=1}^N$ (assuming, wlog, ddof=1):
This is known to be numerically inaccurate, especially when the variance is small compared to the values and both terms are large and of similar magnitude.
The usual fix to this is to compute via a two-pass or stable online approach. The former is then (having computed the mean $\bar{x}$):
Describe the bug
(from a review of #16367)
The calculation of whole-column variance in libcudf uses the "textbook" approach. For some set of measurements$X := \{x_i\}_{i=1}^N$ (assuming, wlog,
ddof=1
):This is known to be numerically inaccurate, especially when the variance is small compared to the values and both terms are large and of similar magnitude.
The usual fix to this is to compute via a two-pass or stable online approach. The former is then (having computed the mean$\bar{x}$ ):
This is the approach used in groupby variance calculation. An online version (Welford's algorithm) is used in rolling variance calculation.
We can see the differences in accuracy:
We should implement a numerically stable approach for the whole column calculation of the variance.
The text was updated successfully, but these errors were encountered: