删除stopwords.txt文件内容后重启，自带停用词任然生效 #1253

achenjie · 2019-07-26T15:12:58Z

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
我在此括号内输入x打钩，代表上述事项确认完毕

版本号

当前最新版本号是：1.7.4
我使用的版本是：1.7.4

我的问题

在data/dictionary目录下，删除stopwords.txt.bin文件，并将stopwords.txt文件删除为空，重启后，重新运行程序，发现原停用词表中的停用词任然生效，即删除无效。

复现问题

步骤

首先……
然后……
接着……

触发代码

text = '员工怎么办理工作证?'
NotionalTokenizer = JClass("com.hankcs.hanlp.tokenizer.NotionalTokenizer")
print(NotionalTokenizer.segment(text)) 

输出结果：
[员工/n, 办理/v, 工作证/n]

接着在data/dictionary目录下，删除stopwords.txt.bin文件，并将stopwords.txt文件删除为空，并在stopwords.txt中添加停用词“员工”，重启后，重新运行程序

text = '员工怎么办理工作证?'
NotionalTokenizer = JClass("com.hankcs.hanlp.tokenizer.NotionalTokenizer")
print(NotionalTokenizer.segment(text)) 

输出结果：
[办理/v, 工作证/n]

自己添加的停用词能生效，但是hanlp自带的停用词任然生效，删除不起作用。

期望输出

期望输出结果：
[怎么/r, 办理/v, 工作证/n]

实际输出

[办理/v, 工作证/n]

其他信息

The text was updated successfully, but these errors were encountered:

hankcs · 2019-07-26T15:26:24Z

NotionalTokenizer会过滤r词性，你可以写个lambda函数自己过滤。

achenjie · 2019-07-26T15:40:13Z

谢谢，但我是引用doc2vec进行文本相似度计算，里面内置了NotionalTokenizer分词器，我想问，如何在python中引用doc2vec时，修改NotionalTokenizer分词器为其他分词器

hankcs · 2019-07-27T01:01:47Z

感谢反馈，已经修复，请参考上面的commit。
如果还有问题，欢迎重开issue。

新增了com.hankcs.hanlp.mining.word2vec.DocVectorModel#enableFilter，你可以使用如下补丁。

hanlp-1.7.4.jar.zip

achenjie · 2019-07-27T03:08:17Z

非常感谢，用了您这边提供的补丁，有以下两个问题：
1、分词器只能选择基类要是seg.segment的吗？我选择最短路径分词就可以进行语义查询，但是选择NLPTokenizer 就会提示报错，
com.hankcs.hanlp.mining.word2vec.DocVectorModel(com.hankcs.hanlp.mining.word2vec.WordVectorModel,com.hankcs.hanlp.seg.Segment,boolean)；
2、我的目的是删除hanlp里面自带的停用词，使自带的停用词在我的程序中不生效，然后添加自己的停用词，但是最后的测试结果是，自带的停用词删除无效，不管是用那个分词器，然后使用CoreStopWordDictionary.apply(term_list)进行删除停用词时，自带的停用词后还是会起作用。

hankcs · 2019-07-28T01:50:47Z

基类必须是Segment，NLPTokenizer不是Segment类
Java可以通过替换Filter来实现，你可以编译一个Java类放到static目录中。

achenjie · 2019-07-28T03:54:02Z

非常感谢作者的耐心解答，问题已解决。

hankcs added the question label Jul 26, 2019

hankcs closed this as completed in c4725b8 Jul 27, 2019

hankcs added the improvement label Jul 27, 2019

achenjie mentioned this issue Jul 27, 2019

删除stopwords.txt文件内容后重启，自带停用词任然生效？ #1254

Closed

1 task

hankcs added a commit that referenced this issue Jan 10, 2020

DocVectorModel支持自定义分词器、开/关停用词过滤器 fix #1253 (comment)

b0158f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

删除stopwords.txt文件内容后重启，自带停用词任然生效 #1253

删除stopwords.txt文件内容后重启，自带停用词任然生效 #1253

achenjie commented Jul 26, 2019 •

edited by hankcs

Loading

hankcs commented Jul 26, 2019

achenjie commented Jul 26, 2019

hankcs commented Jul 27, 2019 •

edited

Loading

achenjie commented Jul 27, 2019

hankcs commented Jul 28, 2019

achenjie commented Jul 28, 2019

删除stopwords.txt文件内容后重启，自带停用词任然生效 #1253

删除stopwords.txt文件内容后重启，自带停用词任然生效 #1253

Comments

achenjie commented Jul 26, 2019 • edited by hankcs Loading

注意事项

版本号

我的问题

复现问题

步骤

触发代码

期望输出

实际输出

其他信息

hankcs commented Jul 26, 2019

achenjie commented Jul 26, 2019

hankcs commented Jul 27, 2019 • edited Loading

achenjie commented Jul 27, 2019

hankcs commented Jul 28, 2019

achenjie commented Jul 28, 2019

achenjie commented Jul 26, 2019 •

edited by hankcs

Loading

hankcs commented Jul 27, 2019 •

edited

Loading