@author jackzhenguo
@desc
@tag
@version 
@date 2020/03/08

一行代码找到编码

兴高采烈地，从网页上抓取一段 content

但是，一 print 就不那么兴高采烈了，结果看到一串这个：

b'\xc8\xcb\xc9\xfa\xbf\xe0\xb6\xcc\xa3\xac\xce\xd2\xd3\xc3Python'

这是啥？又 x 又 c 的！

再一看，哦，原来是十六进制字节串 (bytes)，\x 表示十六进制

接下来，你一定想转化为人类能看懂的语言，想到 decode：

In [3]: b'\xc8\xcb\xc9\xfa\xbf\xe0\xb6\xcc\xa3\xac\xce\xd2\xd3\xc3Python'.decode()
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-7d0ea6148880> in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte

马上，一盆冷水泼头上，抛异常了。。。。。

根据提示，UnicodeDecodeError，这是 unicode 解码错误。

原来，decode 默认的编码方法：utf-8

所以排除 b'\xc8\xcb\xc9\xfa\xbf\xe0\xb6\xcc\xa3\xac\xce\xd2\xd3\xc3Python' 使用 utf-8 的编码方式

可是，这不是四选一选择题啊，逐个排除不正确的！

编码方式几十种，不可能逐个排除吧。

那就猜吧！！！！！！！！！！！！！

人生苦短，我用Python

Python，怎忍心让你受累呢~

尽量三行代码解决问题

第一步，安装 chardet 它是 char detect 的缩写。

第二步，pip install chardet

第三步，出结果

In [6]: chardet.detect(b'\xc8\xcb\xc9\xfa\xbf\xe0\xb6\xcc\xa3\xac\xce\xd2\xd3\xc3Python')
Out[6]: {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

编码方法：gb2312

解密字节串：

In [7]: b'\xc8\xcb\xc9\xfa\xbf\xe0\xb6\xcc\xa3\xac\xce\xd2\xd3\xc3Python'.decode('gb2312')
Out[7]: '人生苦短，我用Python'

目前，chardet 包支持的检测编码几十种。

[上一个例子](169.md) [下一个例子](171.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

170.md

170.md

一行代码找到编码

Files

170.md

Latest commit

History

170.md

File metadata and controls

一行代码找到编码