《机器学习实战》笔记:UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte

>>> bayes.spamTest()
Traceback (most recent call last):
  File "", line 1, in
    bayes.spamTest()
  File "E:\AI\FirstPythonProj\bayes.py", line 96, in spamTest
    wordList = textParse(open('MachineLearningSourceCode/Ch04/email/ham/%d.txt' % i).read())

UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence

用的Python 3.6.4 Shell运行,打开Text文档没有发现很明显的非法字符,对应语句改成‘utf-8’,继续其它地方报错。

wordList = textParse(open('MachineLearningSourceCode/Ch04/email/ham/%d.txt' % i,encoding ='utf-8').read())

>>> bayes.spamTest()
Traceback (most recent call last):
  File "", line 1, in
    bayes.spamTest()
  File "E:\AI\FirstPythonProj\bayes.py", line 96, in spamTest
    wordList = textParse(open('MachineLearningSourceCode/Ch04/email/ham/%d.txt' % i,encoding ='utf-8').read())
  File "D:\Python\Python36-32\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 884: invalid start byte

用gbk编码的无法用utf-8解码,继续改回去,在for循环中 加print(i),查看哪个txt文件出问题:

>>> bayes.spamTest()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Traceback (most recent call last):
  File "", line 1, in
    bayes.spamTest()
  File "E:\AI\FirstPythonProj\bayes.py", line 96, in spamTest
    wordList = textParse(open('MachineLearningSourceCode/Ch04/email/ham/%d.txt' % i,encoding ='gbk').read())
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence

查出问题在 文档《23.Txt》中,“SciFinance?is ”改成“SciFinance is ”即可,原文如下:

SciFinance?is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.

你可能感兴趣的:(《机器学习实战》笔记:UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte)