Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence #5

Open
s1162276945 opened this issue Dec 23, 2019 · 13 comments

Comments

@s1162276945
Copy link

File "D:\pycode\Graph4CNER\utils\functions.py", line 16, in read_instance
in_lines = open(input_file, 'r').readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x91 in position 2: illegal multibyte sequence

将‘r’替换成‘rb’,出现 AttributeError: 'int' object has no attribute 'isdigit'

@DianboWork
Copy link
Owner

我没出现过这样的问题,你可以看看数据集的编码。

@DianboWork
Copy link
Owner

数据格式我列在readme中了

@s1162276945
Copy link
Author

好的,我看一下

@s1162276945
Copy link
Author

我的数据是从gold-horse项目里面拿出来的,我没改过文件的编码呀

@DianboWork
Copy link
Owner

我给的链接中weiboNER_2nd_conll数据需要处理,处理成README中的格式。

@s1162276945
Copy link
Author

一 O
节 O
课 O
的 O
时 O
间 O
真 O
心 O
感 O
动 O
了 O
李 B-PER.NAM
开 I-PER.NAM
复 I-PER.NAM

@s1162276945
Copy link
Author

我拿到的数据就是你readme 的格式,大概是文件解压的时候编码出问题了。

@DianboWork
Copy link
Owner

你可以留一下邮箱,我把可以公开的weibo数据发给你。

@s1162276945
Copy link
Author

好的,谢谢你,我的邮箱号是[email protected]

@s1162276945
Copy link
Author

s1162276945 commented Dec 23, 2019

你可以留一下邮箱,我把可以公开的weibo数据发给你。

确实是数据集的编码问题。
链接:https://pan.baidu.com/s/11VlW0GY4AQsndB18bPKjrg
提取码:xxem

@zhangdddong
Copy link

本人猜测,你是在window上跑的吧,作者的程序应该是在Linux上跑的。数据集的编码方式一般是UTF-8,window的默认编码方式是GBK,Linux默认编码方式是UTF-8。在window上面使用,需要读的时候写上编码方式。如
in_lines = open(input_file, 'r', encoding='UTF-8').readlines()

@DianboWork
Copy link
Owner

@zhangdddong 正解

@kt-rax
Copy link

kt-rax commented Oct 4, 2021

万能解法:
#lines = open(filename).read().split('\n')
# fix UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 417448:
lines = open(filename,encoding='gb18030',errors='ignore').read().split('\n')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants