Ch03的ngram_segment.py方法print(CoreBiGramTableDictionary.getBiFrequency("商品", "和"))返回频次不对 #1320

panda2019-ai · 2019-11-08T13:37:09Z

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
我在此括号内输入x打钩，代表上述事项确认完毕。

版本号

当前最新版本号是：hanlp-1.7.5
我使用的版本是：hanlp-1.7.5

我的问题

直接git clone到本地的pyhanlp，运行Ch03的ngram_segment.py，返回的1-gram频次为2，2-gram频次为0，Java版的输出是正确的，返回【商品】的词频：2，【商品@和】的频次：1

复现问题

未对代码做修改

步骤

直接在Pycharm中运行的ngram_segment.py

触发代码

   print(CoreDictionary.getTermFrequency("商品"))
   print(CoreBiGramTableDictionary.getBiFrequency("商品", "和"))

期望输出

期望输出1-gram频次为2，2-gram频次为1

实际输出

实际输出1-gram频次为2，2-gram频次为0

其他信息

The text was updated successfully, but these errors were encountered:

panda2019-ai · 2019-11-08T13:57:26Z

猜了一下，可能是编码的问题，看了一下，win下运行训练的时候输出的“my_cws_model.ngram.txt”的编码是GB2312的，把文件改成utf-8的编码，同时删除my_cws_model.ngram.txt.table.bin后，再运行代码得到正确结果了。

panda2019-ai · 2019-11-08T14:07:19Z

我看在DictionaryMaker.java中保存模型时加了UTF-8参数
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path), "UTF-8"));
但是在NGramDictionaryMaker.java中保存模型时都没有加UTF-8参数
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(IOUtil.newOutputStream(path)));
希望作者可以加一个

hankcs · 2019-11-08T15:55:15Z

感谢反馈，如果之后还有类似问题，欢迎继续提出。

panda2019-ai closed this as completed Nov 8, 2019

hankcs added a commit that referenced this issue Nov 8, 2019

NGramDictionaryMaker等默认UTF-8编码 fix #1320

511b978

hankcs added the improvement label Nov 8, 2019

hankcs mentioned this issue Dec 24, 2019

按照《自然语言处理入门》随书代码复现时，发现自定义词典不能完全生效。 #1363

Closed

1 task

hankcs added a commit that referenced this issue Jan 10, 2020

NGramDictionaryMaker等默认UTF-8编码 fix #1320

6b31f02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ch03的ngram_segment.py方法print(CoreBiGramTableDictionary.getBiFrequency("商品", "和"))返回频次不对 #1320

Ch03的ngram_segment.py方法print(CoreBiGramTableDictionary.getBiFrequency("商品", "和"))返回频次不对 #1320

panda2019-ai commented Nov 8, 2019

panda2019-ai commented Nov 8, 2019

panda2019-ai commented Nov 8, 2019

hankcs commented Nov 8, 2019

Ch03的ngram_segment.py方法print(CoreBiGramTableDictionary.getBiFrequency("商品", "和"))返回频次不对 #1320

Ch03的ngram_segment.py方法print(CoreBiGramTableDictionary.getBiFrequency("商品", "和"))返回频次不对 #1320

Comments

panda2019-ai commented Nov 8, 2019

注意事项

版本号

我的问题

复现问题

步骤

触发代码

期望输出

实际输出

其他信息

panda2019-ai commented Nov 8, 2019

panda2019-ai commented Nov 8, 2019

hankcs commented Nov 8, 2019