2.2使用K邻近算法改进约会网站配对效果
数据文件在http://www.ituring.com.cn/book/1021随书下载处下载,注意将文件格式改为.rar
一 将文本记录转换为NumPy的解析程序
将下列代码增加到kNN.py中
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat =np.zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
1.readlines函数
参考文章http://www.cnblogs.com/qi09/archive/2012/02/10/2344964.html
2.zeros函数参考文章http://blog.csdn.net/u012005313/article/details/49182211
接着,在Python命令提示符下输入下面命令:
二 分析数据:使用Matplotlib创建散点图
在Python命令行环境中输入下列命令:
import matplotlib import matplotlib.pyplot as plt
fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()
输出效果如图所示
没有样本类别标签的约会数据散点图,难以辨识出图中点究竟属于哪个样本
这时就要使用Matplotlib库提供的scatter函数个性化标记散点图上的点
重新输入上面的代码,在调用scatter函数时使用下列参数:
ax.scatter(datingDataMat[:,1], datingDataMat[:,2], 15.0 * array(datingLabels), 15.0 * array(datingLabels))
这是我在执行过程中出现的错误,这些地方需要稍微注意,
最后,我重新导入NumPy库,得到了新的约会数据散点图
2.2.3 准备数据:归一化数值
将下列代码增加到kNN.py中
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - np.tile(minVals, (m,1))
normDataSet = normDataSet/np.tile(ranges, (m,1)) #element wise divide
return normDataSet, ranges, minVals
三 分类器对约会网站的测试代码
def datingClassTest():
hoRatio = 0.10 #hold out 10%
datingDataMat,datingLabels = file2matrix('F:\MachineLearninginaction\Ch02\datingTestSet2.txt') #load data setfrom file
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f" % (errorCount/float(numTestVecs))
print errorCount
四 约会网站预测函数
def classifyPerson():
#定义喜欢程度
resultList = ['not at all','in small doses','in large doses']
#输入玩游戏时间,飞行公里,和冰激凌消耗量
percentTats = float(raw_input("percentage of time spent playing video games?"))
ffMiles = float(raw_input("frequent flier miles earned per year?"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
#建立knn原始数据
datingDataMat,datingLabels = file2matrix('F:\MachineLearninginaction\Ch02\datingTestSet2.txt')
#特征归一化
normMat,ranges,minVals = autoNorm(datingDataMat)
#将输入量建成三个特征
inArr = np.array([ffMiles,percentTats,iceCream])
classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
print "You will probably like this person: ", resultList[classifierResult - 1]
最后在Python命令行环境中输入如下命令