短语提取：当左熵或右熵都是0时，score为NaN #1366

allen615 · 2019-12-31T03:28:59Z

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
我在此括号内输入x打钩，代表上述事项确认完毕。

版本号

当前最新版本号是：1.7.5
我使用的版本是：1.7.5

我的问题

对某篇文章进行关键短语提取时，发现短语的score都是NaN，跟进发现是词语的左熵或右熵都是0导致的

触发代码

从 MutualInformationEntropyPhraseExtractor.extractPhrase(text, size) -> occurrence.compute()

package com.hankcs.hanlp.corpus.occurrence;
public class Occurrence
{
...
    /**
     * 输入数据完毕，执行计算
     */
    public void compute()
    {
        entrySetPair = triePair.entrySet();
        double total_mi = 0;
        double total_le = 0;
        double total_re = 0;
        for (Map.Entry<String, PairFrequency> entry : entrySetPair)
        {
            PairFrequency value = entry.getValue();
            value.mi = computeMutualInformation(value);
            value.le = computeLeftEntropy(value);
            value.re = computeRightEntropy(value);
            total_mi += value.mi;
            total_le += value.le;
            total_re += value.re;
        }

        for (Map.Entry<String, PairFrequency> entry : entrySetPair)
        {
            PairFrequency value = entry.getValue();
            // 问题出在下面这句，当total_le或total_re为0时，score为NaN
            // 因对左右信息熵不太了解，不确定下面的处理方式是否可行：
            // 给分母加一个足够小的数，例如：value.score = value.mi / total_mi + value.le / (total_le+0.0001)+ value.re / (total_re+0.0001);
            value.score = value.mi / total_mi + value.le / total_le+ value.re / total_re;   // 归一化
            value.score *= entrySetPair.size();
        }
    }
}

The text was updated successfully, but these errors were encountered:

hankcs · 2019-12-31T03:37:07Z

感谢反馈，已经修复，请参考上面的commit。
如果还有问题，欢迎重开issue。

hankcs added the bug label Dec 31, 2019

hankcs closed this as completed in 7dd79cc Dec 31, 2019

hankcs added a commit that referenced this issue Jan 10, 2020

修复信息熵计算中的除零错误 fix #1366

ad99040

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

短语提取：当左熵或右熵都是0时，score为NaN #1366

短语提取：当左熵或右熵都是0时，score为NaN #1366

allen615 commented Dec 31, 2019

hankcs commented Dec 31, 2019

短语提取：当左熵或右熵都是0时，score为NaN #1366

短语提取：当左熵或右熵都是0时，score为NaN #1366

Comments

allen615 commented Dec 31, 2019

注意事项

版本号

我的问题

触发代码

hankcs commented Dec 31, 2019