机器学习实战第三章决策树中遇到的,主要是Python的版本问题,下面这段是Python2的写法:
firstStr = myTree.keys()[0]
Python3:先转换成list
firstStr = list(myTree.keys())[0]
使用pickle存储的时候出现错误
错误代码:
try:
with open(fileName, 'w') as fw:
pickle.dump(inputTree, fw)
except IOError as e:
print("File Error : " + str(e))
错误原因:pickle的存储方式默认是二进制
修正:
try:
with open(fileName, 'wb') as fw:
pickle.dump(inputTree, fw)
except IOError as e:
print("File Error : " + str(e))
def spamTest():
docList = []
classList = []
fullList = []
for i in range(1, 26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullList.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read()) # 出错部分
docList.append(wordList)
fullList.extend(wordList)
classList.append(0)
vocabList = bayes.createVocabList(docList)
trainingSet = list(range(50))
testSet = []
for i in range(10):
randIndex = int(random.uniform(0, len(trainingSet)))
testSet.append(trainingSet[randIndex])
del trainingSet[randIndex]
trainMat = []
trainClasses = []
for docIndex in trainingSet:
trainMat.append(bayes.setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V, p1V, pSpam = bayes.trainNB0(array(trainMat), array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = bayes.setOfWords2Vec(vocabList, docList[docIndex])
if bayes.classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
errorCount += 1
print('the error rate is:', float(errorCount) / len(testSet))
1、尝试使用比gbk包含字符更多的gb18030,卒
wordList = textParse(open('email/ham/%d.txt' % i, encoding='gb18030').read())
2、忽略错误,通过
wordList = textParse(open('email/ham/%d.txt' % i, encoding='gb18030', errors='ignore').read())
3、打开文件看看哪个是非法字符,我选择放弃
# spamTest():
def spamTest():
docList = []
classList = []
fullList = []
for i in range(1, 26):
wordList = textParse(open('email/spam/%d.txt' % i, encoding='gb18030', errors='ignore').read())
docList.append(wordList)
fullList.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i, encoding='gb18030', errors='ignore').read())
docList.append(wordList)
fullList.extend(wordList)
classList.append(0)
vocabList = bayes.createVocabList(docList)
trainingSet = range(50) # 需要修改部分
testSet = []
for i in range(10):
randIndex = int(random.uniform(0, len(trainingSet)))
testSet.append(trainingSet[randIndex])
del trainingSet[randIndex] # 出错代码部分
trainMat = []
trainClasses = []
for docIndex in trainingSet:
trainMat.append(bayes.setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docList])
p0V, p1V, pSpam = bayes.trainNB0(array(trainMat), array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = bayes.setOfWords2Vec(vocabList, docList[docIndex])
if bayes.classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
errorCount += 1
print('the error rate is:', float(errorCount) / len(testSet))
python3.x , 出现错误 'range' object doesn't support item deletion
原因:python3.x range返回的是range对象,不返回数组对象
解决方法:
把 trainingSet = range(50) 改为 trainingSet = list(range(50))
出错代码:随机梯度上升算法
# 随机梯度上升算法
def stocGradAscent0(dataMatrix, classLabels):
m, n = shape(dataMatrix)
alpha = 0.01
weights = ones(n)
for i in range(m):
h = sigmoid(sum(dataMatrix[i] * weights))
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i]
return weights
出错原因:error 是一个float64,
weights :
dataMatrix[i] :
在Python中,如果是一个整型n乘以一个列表L, 列表长度会变成n*len(L),而当你用一个浮点数乘以一个列表,自然而然也就出错了,而且我们要的也不是这个结果,而是对于当前向量的每一位乘上一个error。
其实这地方就是Python 中的list和numpy的array混用的问题,对dataMatrix进行强制类型转换就行了(也可以在参数传递之前进行转换,吐槽Python的类型机制)
# 随机梯度上升算法
def stocGradAscent0(dataMatrix, classLabels):
# 强制类型转换,避免array和list混用
dataMatrix = array(dataMatrix)
m, n = shape(dataMatrix)
alpha = 0.01
weights = ones(n)
for i in range(m):
h = sigmoid(sum(dataMatrix[i] * weights))
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i]
return weights
copy对于一个复杂对象的子对象并不会完全复制,什么是复杂对象的子对象呢?就比如序列里的嵌套序列,字典里的嵌套序列等都是复杂对象的子对象。对于子对象,python会把它当作一个公共镜像存储起来,所有对他的复制都被当成一个引用,所以说当其中一个引用将镜像改变了之后另一个引用使用镜像的时候镜像已经被改变了。
deepcopy的时候会将复杂对象的每一层复制一个单独的个体出来。
python3
/ 保留小数位, 3/2 = 1.5; 2/2 = 1.0
// floor(), 3/2 = 1 2//2 = 1