参考书籍:《机器学习实战》
实验说明:预测约会对象对用户是否具有吸引力
输入数据:每个待约会的对象有三个属性,分别是 每年飞行里程数、玩游戏占时间比、每周吃的冰淇淋(单位公升);(ps:我觉得这三个参数,分别代表一个人是否有钱,生活娱乐,饮食习惯)
样本集:有1000个约会对象的数据,并且每个对象有一个标签,标签有三大类,分别是 不喜欢、魅力一般、非常有魅力
实验过程:
1、将样本集90%当做训练集,10%当做测试集,测试classify.py的错误率
2、用户输入一个约会对象的参数,给出分类的标签,为用户提供建议
代码文件:
file2Matrix.py:样本集存在txt文件中,该函数将样本集输入到内存中,以array的方式存储起来
plotDataSet:将样本集的数据画出来(每个样本只能画出两个变量)
autoNorm.py:将参数归一化,每个参数的大小范围不一致
datingClassTest.py:错误率测试
classify.py:预测分类函数
classifyPerson.py:输入某个人的参数,给出预测结果
knn.py:主函数
样本集及源文件下载:点击打开链接
源文件:
file2Matrix.py:样本集存在txt文件中,该函数将样本集输入到内存中,以array的方式存储起来
__author__ = 'root' import numpy as np def file2Matrix(filename): #open file fileHandle=open(filename,mode='r') #read lines, here lines is a list lines=fileHandle.readlines() #for saving data i=0 datingDataSet=np.zeros((len(lines),3)) labels=[] #traverse all lines,save to matrix for line in lines: line=line.strip() listFromLine=line.split('\t') datingDataSet[i,:]=listFromLine[0:3] labels.append(int(listFromLine[-1])) i+=1 #return dataSet and labels return datingDataSet, labels
plotDataSet:将样本集的数据画出来(每个样本只能画出两个变量)
__author__ = 'root' import numpy as np import matplotlib.pyplot as plt def plotDataSet(datingDataSet,labels): fig=plt.figure() ax=fig.add_subplot(111) ax.scatter(datingDataSet[:,0],datingDataSet[:,1],15*np.array(labels[:]),15*np.array(labels[:])) plt.show()
autoNorm.py:将参数归一化,每个参数的大小范围不一致
__author__ = 'root' import file2Matrix import numpy as np def autoNorm(datingDataSet): #get the minimum and maximum value of each feature dataSetMin=datingDataSet.min(axis=0) dataSetMinTiled=np.tile(dataSetMin,(datingDataSet.shape[0],1)) dataSetMax=datingDataSet.max(axis=0) dataSetMaxTiled=np.tile(dataSetMax,(datingDataSet.shape[0],1)) # =(value-min)/(max-min) datingDataSet=(datingDataSet-dataSetMinTiled)/(dataSetMaxTiled-dataSetMinTiled) return datingDataSet,dataSetMin,dataSetMax
datingClassTest.py:错误率测试
__author__ = 'root' import numpy as np import classify def datingClassTest(datingDataSet,labels): #set:ratio of test,k ratio=0.1 k=4 #num of testData lenOfDataSet=datingDataSet.shape[0] numOfTest=int(ratio*lenOfDataSet) print numOfTest #variable:num of error numOfError=0 #traverse all test data for i in range(numOfTest): #prepare input data inX=datingDataSet[i,:] label=labels[i] ans=classify.classify(inX,datingDataSet[numOfTest:lenOfDataSet,:],labels[numOfTest:lenOfDataSet],k) if ans!=label: numOfError+=1.0 print 'predict error' return numOfError/numOfTest
classify.py:预测分类函数
__author__ = 'root' import numpy as np import operator def classify(inX,dataSet,labels,k): #calculate euclidean distance between k and dataSet dataSetSize=dataSet.shape[0] diffMat=np.tile(inX,(dataSetSize,1))-dataSet sqDiffMat=diffMat**2 sqDistances=sqDiffMat.sum(axis=1) distance=sqDistances**0.5 #sort distance, min to max, return index list sortedDistIndicies=distance.argsort() # from 0 to k-1, count times of every class classCount={} for i in range(k): className=labels[sortedDistIndicies[i]] #print classCount.get(className,0) #here parameter 0 means:if className doesn't exist, return classCount[className]=classCount.get(className,0)+1 #sort class count result, i don't understand this method now #parameter reverse=true:from big to small,reverse=flase:from small to big sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True) #print sortedClassCount #print sortedClassCount[0][0] # return result return sortedClassCount[0][0]
classifyPerson.py:输入某个人的参数,给出预测结果
__author__ = 'root' import numpy as np import classify def classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels): resultList=['not at all','a little like','like very much'] k=3 #input data flyMiles=float(raw_input('please input fly miles per year:')) percOfVedioGames=float(raw_input('please input percentage of time you spend playing video games:')) iceCream=float(raw_input('please input how much iceCream you eat every week:')) inX=[flyMiles,percOfVedioGames,iceCream] inX=(inX-dataSetMin)/(dataSetMax-dataSetMin) #predict ans=classify.classify(inX,datingDataSet,labels,k) ans=resultList[ans-1] #print result print 'you may feel this person:',ans
knn.py:主函数
__author__ = 'root' import file2Matrix import plotDataSet import autoNorm import datingDataSetClassifyTest import classifyPerson import numpy as np #get data to ram datingDataSetOri, labels=file2Matrix.file2Matrix('datingTestSet2.txt') print 'datingDataSetOri:\n',datingDataSetOri print 'labels:\n',labels #plot data plotDataSet.plotDataSet(datingDataSetOri,labels) #autonorm data to [0,1] datingDataSet,dataSetMin,dataSetMax=autoNorm.autoNorm(datingDataSetOri) print 'datingDataSet:\n',datingDataSet #test error rate errorRate=datingDataSetClassifyTest.datingClassTest(datingDataSet,labels) print 'errorRate:',errorRate #pridect person classifyPerson.classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels)
总结:
knn的优点:算法简单,易于实现
knn的缺点:1、随着样本集的增加,计算时间线性增长,当特征数量增加时,计算复杂度也线性增加;2、没有训练过程,无法提取出对样本的特征表述
原文链接:点击打开链接