KL returns negative values? #26

joshisanonymous · 2021-07-08T19:43:08Z

First, thank you so much for this package. Second, I'm getting negative results for some reason when using the KL function in cases where the two probability distributions that I'm comparing are expected to be nearly identical. My distributions are very large (over 4 million column matrices) and contain many zeros, and this seems to be the problem. I'm not very familiar with the math behind KL divergence, so I'm not sure if this is expected or not. I've found that I can get negative results once my matrix gets to 271 or more columns by changing the very first value to 0, like so:

mat <- matrix(
  rep(1:271, each = 2),
  nrow = 2
)
mat[2, 1] <- 0
KL(mat, est.prob = "empirical")

Result:
kullback-leibler 
   -7.181671e-08

Additionally, I thought of smoothing the counts by adding something like 0.01 to every count and this changes the result of KL to be a positive value:

mat <- mat + 0.01
KL(mat, est.prob = "empirical")

Result:
kullback-leibler 
    0.0001433061

Is this all expected or is there an issue somewhere?

The text was updated successfully, but these errors were encountered:

…set their own epsilon value when dealing with division by zero and log(0) cases #26

HajkD · 2021-07-20T17:41:53Z

Hi Josh,

Many thanks for your fantastic question and I am glad to hear that you find philentropy useful!

You are absolutely correct, the KL distance or divergence is always analytically defined as >=0. However, when
implementing these formulas, there are some technical issues that I had to take into account:

division by zero
log 0

Analytically, these cases would simply be set to 0, however computationally, these not-defined zero cases are replaced by a very small value (as was recommended in Cha et al., 2007), in the case of philentropy I chose epsilon = 0.00001 (see e.g. line 1572 in distances.h).

Your initial instinct to add a 0.01 to your input data already works quite well, because you linearly shift your input data by 0.01 and thus avoid the division or log by zero issues.

However, the more elegant solution (from my humble perspective) is not to transform the entire input data, but choosing an appropriate epsilon value that takes the similarity of the compared density functions and the number of input values into account. When comparing huge vectors which are fairly similar (especially at the 0 borders), then even smaller epsilon values will be required to get non-negative KL values near 0.

I now introduced an epsilon argument to philentropy::distance() and philentropy::KL() with default epsilon = 0.00001.

You can see that if you now run your example code

mat <- matrix(
  rep(1:271, each = 2),
  nrow = 2
)
mat[2, 1] <- 0
KL(mat, est.prob = "empirical", epsilon = 0.0000001)

Result:
kullback-leibler 
   0.0001801934

I hope this helps to make better analysis choices and please let me know if this works for you.

Many thanks,
Hajk

HajkD · 2021-07-30T10:46:37Z

I assume the issue has been solved?

HajkD added a commit that referenced this issue Jul 20, 2021

Adding new argument epsilon to distance() and KL() to allow users to …

29a56ef

…set their own epsilon value when dealing with division by zero and log(0) cases #26

HajkD mentioned this issue Jul 20, 2021

adds new cpp funs #27

Merged

HajkD closed this as completed Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KL returns negative values? #26

KL returns negative values? #26

joshisanonymous commented Jul 8, 2021

HajkD commented Jul 20, 2021

HajkD commented Jul 30, 2021

KL returns negative values? #26

KL returns negative values? #26

Comments

joshisanonymous commented Jul 8, 2021

HajkD commented Jul 20, 2021

HajkD commented Jul 30, 2021