=====================================================================
github 源码同步:https://github.com/Thinkgamer/Machine-Learning-With-Python
针对数据预处理的一些问题和解决办法
自动填入缺失值的三种策略
本实例采用的处理缺失值得办法是,过滤掉label为空的数据,同时将数据中除了类别标签之外的所有空值填充0,且所用数据为处理后的数据(点击下载训练数据集 测试数据集),下面看代码
#coding:utf-8
'''
Created on 2016/4/25
@author: Gamer Think
'''
import LogisticRegession as lr
from numpy import *
#二分类问题进行分类
def classifyVector(inX,weights):
prob = lr.sigmoid(sum(inX * weights))
if prob>0.5:
return 1.0
else:
return 0.0
#训练和测试
def colicTest():
frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
trainingSet = []; trainingLabels = []
#训练回归模型
for line in frTrain.readlines():
currLine = line.strip().split('\t')
lineArr =[]
for i in range(21):
lineArr.append(float(currLine[i]))
trainingSet.append(lineArr)
trainingLabels.append(float(currLine[21]))
trainWeights = lr.stocGradAscent1(array(trainingSet), trainingLabels, 1000)
errorCount = 0; numTestVec = 0.0
#测试回归模型
for line in frTest.readlines():
numTestVec += 1.0
currLine = line.strip().split('\t')
lineArr =[]
for i in range(21):
lineArr.append(float(currLine[i]))
if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
errorCount += 1
errorRate = (float(errorCount)/numTestVec)
print "the error rate of this test is: %f" % errorRate
return errorRate
def multiTest():
numTests = 10
errorSum = 0.0
for k in range(numTests):
errorSum += colicTest()
print "after %d iterations the average error rate is: %f" % (numTests,errorSum/float(numTests))
if __name__=="__main__":
multiTest()
colicTest():打开测试集和训练集,并对数据进行格式化处理,并调用stocGradAscent1()函数进行回归系数的计算,设置迭代次数为500,这里可以自行设定,在系数计算完之后,导入测试集并计算分类错误率。整体看来,colictest()函数具有完全独立的功能,多次运行的结果可能稍有不同,这是因为其中包含有随机的成分在里边,如果计算的回归系数是完全收敛,那么结果才是确定的
multiTest():调用 colicTest() 10次并求结果的平均值
运行结果如下图: