iris数据集进行KNN分类

数据集说明

Data Set Information:

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

This is an exceedingly simple domain.

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick ‘@’ espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,”Iris-setosa” where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,”Iris-setosa” where the errors are in the second and third features.

Attribute Information:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class:
    – Iris Setosa
    – Iris Versicolour
    – Iris Virginica

数据集来源

http://archive.ics.uci.edu/ml/
关于机器学习的数据很多的,可用来做练习

KNN分类说明

knn算法链接
1.随机的划分测试集和训练集
7:3 和6:4
2.数据是否标准化,以及选取标准化的方法
3.选取合适的k

测试结果

只对选取6:4的进行说明,训练集过多的话,预测效果比较好的
1.不标准化,k=3

>>> reload(iris)
'iris' from 'iris.py'>
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
1.0 0.983333333333
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
1.0 0.983333333333

五次运行,共错了8个
2.不标准化,k=5

>>> iris.irisClassTest()
0.0 1.0
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
3.0 0.95
>>> iris.irisClassTest()
5.0 0.916666666667

五次运行,错了12个

下面选取k=3,进行判断选取标准化的方式对预测结果的影响
3.k=3,max-min归一化

>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
4.0 0.933333333333
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
4.0 0.933333333333
>>> iris.irisClassTest()
2.0 0.966666666667

五次运行,共错了14次
4.k=3,均值方差归一化

>>> iris.irisClassTest()
4.0 0.933333333333
>>> iris.irisClassTest()
3.0 0.95
>>> iris.irisClassTest()
7.0 0.883333333333
>>> iris.irisClassTest()
2.0 0.966666666667
>>> iris.irisClassTest()
5.0 0.916666666667

五次运行,错的次数太多。。。

总结

k值,是否归一化,以及归一化的方式对结果影响比较大的
由于数据集三个样本数量相等,k选的过大时候不会分到训练集某类最大的类中
由于数据中存在异常数据,使归一化过程中的某些值对某些样本影响比较大

完整Python程序

# -*- coding: utf-8 -*-

import urllib
import re 
from numpy import *
import operator

def get_data():
    url_data = urllib.urlopen('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data').read()
    return url_data
def init_data():
    # 对获取的数据进行预处理 ,字符串转换成数字,特征和类别标签分类
    url_data = get_data()
    url_data = re.sub('Iris-setosa','1',url_data)
    url_data = re.sub('Iris-versicolor','2',url_data)
    url_data = re.sub('Iris-virginica','3',url_data)
    lines = re.split('\n',url_data)
    m =len(lines)-2
    dataSet = zeros((m,5))
    dataLabel = zeros((m,1))
    for row in range(m):
        line = re.split(',',lines[row])
        col = len(line)
        for i in range(col):
            dataSet[row,i] = (line[i])
    dataLabel = dataSet[:,-1]
    dataSet = dataSet[:,(0,1,2,3)]
    return dataSet,dataLabel

def classify0(intX,dataSet,labels,k):
    # 产生与dataSet一样大小的矩阵,矩阵的每行数据都是intX
    # 根据矩阵的减法 求聚类最近的k个类 
    dataSetSize=dataSet.shape[0]
    diffMat=tile(intX,(dataSetSize,1))-dataSet
    sqDiffMat=diffMat**2
    sqDistances=sqDiffMat.sum(axis=1)
    distances=sqDistances**0.5
    #  distances.argsort() 的结果是保存 升序后的原数据的id下标
    sortedDistIndicies=distances.argsort()
    classCount={}
    # 取出前k个最近的类别标签,以最多的类别作为该样本的类别
    for i in range(k):
        voteIlabel=labels[sortedDistIndicies[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
    sortedClassCount=sorted(classCount.iteritems(),
                            key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def norm_max_min(dataSet):
    # max - min 归一化
    # 求出每一类的最大值 最小值
    maxVals = dataSet.max(axis=0) 
    minVals = dataSet.min(axis=0)
    ranges = maxVals - minVals
    nromDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    nromDataSet = dataSet-tile(minVals,(m,1))
    nromDataSet = nromDataSet/(tile(ranges,(m,1)))
    return minVals,ranges,nromDataSet

def norm_zScore(dataSet):
    dataSet = array(dataSet)
    # 计算每类的均值 和标准差
    meanVals = dataSet.mean(0)
    stdVals = dataSet.std(0)
    m = dataSet.shape[0]
    normDataSet = dataSet-tile(meanVals,(m,1))
    normDataSet = normDataSet/tile(stdVals,(m,1))
    return meanVals,stdVals,normDataSet
def irisClassTest():
    dataSet,dataLabel = init_data()
    trainSet = []
    trainLabel = []
    testSet = [] 
    testLabel = []
    predictClass = []
    # k 的取值 0 -20 的奇数
    k = 5
    #计算预测正确率
    acc = 0.0
    error = 0.0
    # 随机抽取40%作为测试集
    testLen = int(len(dataLabel)*0.4)
    #trainIndex 这里是下标,在150个下表中随机删除测试集
    trainIndex = range(len(dataLabel))
    testIndex = []

    for index in range(testLen):
        randIndex = int(random.uniform(0,len(trainIndex)))
        testIndex.append(trainIndex[randIndex])
        del(trainIndex[randIndex])
    # 训练集
    for index in trainIndex:
        trainSet.append(dataSet[index])
        trainLabel.append(dataLabel[index])
    # 测试集 
    for index in testIndex:
        testSet.append(dataSet[index])
        testLabel.append(dataLabel[index])
    # 对测试集中的每条数据判断其做在的类别,并计算准确率
    # 对数据进行归一化
    # minVals,ranges,normTrainSet = norm_max_min(array(trainSet))

    meanVals,stdVals,normTrainSet = norm_zScore(array(trainSet))

    for i in range(testLen):
        intX = testSet[i]
        intX = array(intX)
        # normX = (intX-minVals)/ranges
        normX = (intX-meanVals)/stdVals
        classifierResult = classify0(normX,array(normTrainSet),trainLabel,k)
        predictClass .append(classifierResult)
        if(classifierResult != testLabel[i]):
            error= error + 1.0
    acc=1.0-error/float(testLen)
    print error,acc


你可能感兴趣的:(机器学习实战)