UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence #5

s1162276945 · 2019-12-23T07:07:00Z

File "D:\pycode\Graph4CNER\utils\functions.py", line 16, in read_instance
in_lines = open(input_file, 'r').readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence

将‘r’替换成‘rb’，出现 AttributeError: 'int' object has no attribute 'isdigit'

DianboWork · 2019-12-23T09:24:21Z

我没出现过这样的问题，你可以看看数据集的编码。

DianboWork · 2019-12-23T09:25:18Z

数据格式我列在readme中了

s1162276945 · 2019-12-23T11:11:45Z

好的，我看一下

s1162276945 · 2019-12-23T11:21:21Z

我的数据是从gold-horse项目里面拿出来的，我没改过文件的编码呀

DianboWork · 2019-12-23T11:24:18Z

我给的链接中weiboNER_2nd_conll数据需要处理，处理成README中的格式。

s1162276945 · 2019-12-23T11:38:14Z

一 O
节 O
课 O
的 O
时 O
间 O
真 O
心 O
感 O
动 O
了 O
李 B-PER.NAM
开 I-PER.NAM
复 I-PER.NAM

s1162276945 · 2019-12-23T11:38:57Z

我拿到的数据就是你readme 的格式，大概是文件解压的时候编码出问题了。

DianboWork · 2019-12-23T11:51:47Z

你可以留一下邮箱，我把可以公开的weibo数据发给你。

s1162276945 · 2019-12-23T11:53:02Z

好的，谢谢你，我的邮箱号是[email protected]

s1162276945 · 2019-12-23T12:29:29Z

你可以留一下邮箱，我把可以公开的weibo数据发给你。

确实是数据集的编码问题。
链接：https://pan.baidu.com/s/11VlW0GY4AQsndB18bPKjrg
提取码：xxem

zhangdddong · 2020-01-04T07:04:34Z

本人猜测，你是在window上跑的吧，作者的程序应该是在Linux上跑的。数据集的编码方式一般是UTF-8，window的默认编码方式是GBK，Linux默认编码方式是UTF-8。在window上面使用，需要读的时候写上编码方式。如
in_lines = open(input_file, 'r', encoding='UTF-8').readlines()

DianboWork · 2020-01-06T07:37:06Z

@zhangdddong 正解

kt-rax · 2021-10-04T12:00:53Z

万能解法：
#lines = open(filename).read().split('\n')
# fix UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 417448:
lines = open(filename,encoding='gb18030',errors='ignore').read().split('\n')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence #5

UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence #5

s1162276945 commented Dec 23, 2019

DianboWork commented Dec 23, 2019

DianboWork commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

DianboWork commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

DianboWork commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

s1162276945 commented Dec 23, 2019 •

edited

Loading

zhangdddong commented Jan 4, 2020

DianboWork commented Jan 6, 2020

kt-rax commented Oct 4, 2021

UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence #5

UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence #5

Comments

s1162276945 commented Dec 23, 2019

DianboWork commented Dec 23, 2019

DianboWork commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

DianboWork commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

DianboWork commented Dec 23, 2019

s1162276945 commented Dec 23, 2019

s1162276945 commented Dec 23, 2019 • edited Loading

zhangdddong commented Jan 4, 2020

DianboWork commented Jan 6, 2020

kt-rax commented Oct 4, 2021

s1162276945 commented Dec 23, 2019 •

edited

Loading