机器学习实战》系列博客主要是实现并理解书中的代码,相当于读书笔记了。毕竟实战不能光看书。动手就能遇到许多奇奇怪怪的问题。博文比较粗糙,需结合书本。博主边查边学,水平有限,有问题的地方评论区请多指教。书中的代码和数据,网上有很多请自行下载。
对于文本字符串,可以用string.split 切分
>>> mySent = 'This book is the best book on python or M.L. I have ever laid eyes upon'
>>> mySent.split()
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>
标点符号也被当成词的一部分,可以使用正则表示式来切分,其中分隔符是除单词,数字外的任意字符串。
>>> import re
>>> regEX = re.compile('\\W*')
>>> listOfTokens = regEX.split(mySent)
>>> listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>
去空格(好像上面的已经把空格去了??)
字符串变小写
>>> [tok for tok in listOfTokens if len(tok)>0]
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>> [tok.lower() for tok in listOfTokens if len(tok)>0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>
文件夹中有各有25个spam 和ham ,随机选择10个做测试集,其余是训练集。这种方法称为:留存交叉验证
随机选择会导致,输出结果有差别。可以重复试验取平均
def textParse(bigString): #输入一个大字符串并解析为字符串列表
import re
listOfTokens = re.split(r'\W*', bigString)
#函数去掉少于2个字符的字符串,并全部转为小写
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList) #添加成[[][][]]形式
fullText.extend(wordList) #添加成[]形式
classList.append(1) #类标签
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList) #调用函数createVocabList生成词表
trainingSet = range(50); testSet=[] #有50个训练样本
for i in range(10): #随机选10个做测试样本
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))#词向量
trainClasses.append(classList[docIndex])#对应的类标签
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))#训练生成3个概率
errorCount = 0
for docIndex in testSet: #验证测试集
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) #词向量
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1 #分类错误加加
print "classification error",docList[docIndex]
print 'the error rate is: ',float(errorCount)/len(testSet)
#return vocabList,fullText
>>> bayes.spamTest()
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is: 0.1
>>> bayes.spamTest()
the error rate is: 0.0
>>> bayes.spamTest()
classification error ['experience', 'with', 'biggerpenis', 'today', 'grow', 'inches', 'more', 'the', 'safest', 'most', 'effective', 'methods', 'of_penisen1argement', 'save', 'your', 'time', 'and', 'money', 'bettererections', 'with', 'effective', 'ma1eenhancement', 'products', 'ma1eenhancement', 'supplement', 'trusted', 'millions', 'buy', 'today']
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is: 0.2