Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KL returns negative values? #26

Closed
joshisanonymous opened this issue Jul 8, 2021 · 2 comments
Closed

KL returns negative values? #26

joshisanonymous opened this issue Jul 8, 2021 · 2 comments

Comments

@joshisanonymous
Copy link

First, thank you so much for this package. Second, I'm getting negative results for some reason when using the KL function in cases where the two probability distributions that I'm comparing are expected to be nearly identical. My distributions are very large (over 4 million column matrices) and contain many zeros, and this seems to be the problem. I'm not very familiar with the math behind KL divergence, so I'm not sure if this is expected or not. I've found that I can get negative results once my matrix gets to 271 or more columns by changing the very first value to 0, like so:

mat <- matrix(
  rep(1:271, each = 2),
  nrow = 2
)
mat[2, 1] <- 0
KL(mat, est.prob = "empirical")

Result:
kullback-leibler 
   -7.181671e-08

Additionally, I thought of smoothing the counts by adding something like 0.01 to every count and this changes the result of KL to be a positive value:

mat <- mat + 0.01
KL(mat, est.prob = "empirical")

Result:
kullback-leibler 
    0.0001433061

Is this all expected or is there an issue somewhere?

HajkD added a commit that referenced this issue Jul 20, 2021
…set their own epsilon value when dealing with division by zero and log(0) cases #26
@HajkD
Copy link
Member

HajkD commented Jul 20, 2021

Hi Josh,

Many thanks for your fantastic question and I am glad to hear that you find philentropy useful!

You are absolutely correct, the KL distance or divergence is always analytically defined as >=0. However, when
implementing these formulas, there are some technical issues that I had to take into account:

  • division by zero
  • log 0

Analytically, these cases would simply be set to 0, however computationally, these not-defined zero cases are replaced by a very small value (as was recommended in Cha et al., 2007), in the case of philentropy I chose epsilon = 0.00001 (see e.g. line 1572 in distances.h).

Your initial instinct to add a 0.01 to your input data already works quite well, because you linearly shift your input data by 0.01 and thus avoid the division or log by zero issues.

However, the more elegant solution (from my humble perspective) is not to transform the entire input data, but choosing an appropriate epsilon value that takes the similarity of the compared density functions and the number of input values into account. When comparing huge vectors which are fairly similar (especially at the 0 borders), then even smaller epsilon values will be required to get non-negative KL values near 0.

I now introduced an epsilon argument to philentropy::distance() and philentropy::KL() with default epsilon = 0.00001.

You can see that if you now run your example code

mat <- matrix(
  rep(1:271, each = 2),
  nrow = 2
)
mat[2, 1] <- 0
KL(mat, est.prob = "empirical", epsilon = 0.0000001)

Result:
kullback-leibler 
   0.0001801934

I hope this helps to make better analysis choices and please let me know if this works for you.

Many thanks,
Hajk

@HajkD HajkD mentioned this issue Jul 20, 2021
@HajkD
Copy link
Member

HajkD commented Jul 30, 2021

I assume the issue has been solved?

@HajkD HajkD closed this as completed Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants