KNN算法,决策树

一,k近邻算法概述
k近邻(k-nearest neighbor,KNN)是一种基本的分类与回归算法。于1968年由Cover和Hart提出。k近邻的输入是实例的特征向量,对应于特征空间的点;输出为实例的类别,可以取多类。k近邻算法假设给定一个训练数据集,其中的实例类别已定,分类时,对新的实例,根据其k个最近邻的训练实例的类别,通过多数表决等方式进行预测。因此,k近邻法不具有显式的学习过程。简单的说,给定一个训练数据集,对新的输入实例,在训练集中找到与该实例最近邻的k个实例,这k个实例的多数属于哪个类,就把该输入实例分为这个类。这就是k近邻算法中k的出处,通常k是不大于20的整数。
  k近邻算法的三个基本要素:k值的选择、距离度量、分类决策规则

二,算法原理
假设X_test为待标记的样本,X_train为已标记的数据集,算法原理的伪代码如下:
1,遍历X_train中的所有样本,计算每个样本与X_test的距离,并把距离保存在Distance数组中。
2,对Distance数组进行排序,取距离最近的k个点,记为X_knn。
3,在X_knn中统计每个类别的个数,即class0在X_knn中有几个样本,class1在X_knn中有几个样本等。
4,待标记样本的类别,就是在X_knn中样本个数最多的那个类别。

三,例题一

import operator
import numpy as np
def creatDateSet():
    group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])#创建一个数组
    labels = ['A','A','B','B']
    return group,labels

def classify0(inX, dataSet, labels, k):
    #inX是待分类的样本,dataset为数据集,labels为样本数据对应的标签,k为距离
    dataSetSize = dataSet.shape[0]#shape[0]返回dataSet的行数
    diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
    #将inX重复1次形成datasetsize行的数组,与原来的dataset的坐标差值
    sqDiffMat = diffMat**2#坐标差的平方
    sqDistances = sqDiffMat.sum(axis = 1)#距离的平方
    distances = sqDistances**0.5#距离
    sortedDistIndicies = np.argsort(distances)#返回的数组从小到大排序后对应的数组索引
    x=0
    for i in sortedDistIndicies:
        if distances[i]<=k:
            x+=1
    classCount={}#创建字典
    for i in range(x):#range() 函数可创建一个整数列表,一般用在 for 循环中
        voteIlabel = labels[sortedDistIndicies[i]] #距离为k以内的点分别是哪一类
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1#{'B': 2, 'A': 1}统计了最近的k个点分别有多少属于哪类
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)#按字典内容降序排列
    return sortedClassCount[0][0]

def main():
    group,labels = creatDateSet()
    a = eval(input('请输入2维待分析样本: '))#eval() 函数用来执行一个字符串表达式,并返回表达式的值
    k = eval(input('请输入K(距离): '))
    result = classify0(a, group ,labels,k)
    print(result)

if __name__ == '__main__':
    main()

结果:

请输入2维待分析样本: [2,2]
请输入K(距离): 4
B

决策树

# coding=utf-8
import operator
from math import log
import time
def createDataSet():
  dataSet = [[1, 1, 'yes'],
        [1, 1, 'yes'],
        [1, 0, 'no'],
        [0, 1, 'no'],
        [0, 1, 'no'],
        [0,0,'maybe']]
  labels = ['no surfaceing', 'flippers']
  return dataSet, labels
# 计算香农熵
def calcShannonEnt(dataSet):
  numEntries = len(dataSet)
  labelCounts = {}
  for feaVec in dataSet:
    currentLabel = feaVec[-1]
    if currentLabel not in labelCounts:
      labelCounts[currentLabel] = 0
    labelCounts[currentLabel] += 1
  shannonEnt = 0.0
  for key in labelCounts:
    prob = float(labelCounts[key]) / numEntries
    shannonEnt -= prob * log(prob, 2)
  return shannonEnt
def splitDataSet(dataSet, axis, value):
  retDataSet = []
  for featVec in dataSet:
    if featVec[axis] == value:
      reducedFeatVec = featVec[:axis]
      reducedFeatVec.extend(featVec[axis + 1:])
      retDataSet.append(reducedFeatVec)
  return retDataSet
def chooseBestFeatureToSplit(dataSet):
  numFeatures = len(dataSet[0]) - 1 # 因为数据集的最后一项是标签
  baseEntropy = calcShannonEnt(dataSet)
  bestInfoGain = 0.0
  bestFeature = -1
  for i in range(numFeatures):
    featList = [example[i] for example in dataSet]
    uniqueVals = set(featList)
    newEntropy = 0.0
    for value in uniqueVals:
      subDataSet = splitDataSet(dataSet, i, value)
      prob = len(subDataSet) / float(len(dataSet))
      newEntropy += prob * calcShannonEnt(subDataSet)
    infoGain = baseEntropy - newEntropy
    if infoGain > bestInfoGain:
      bestInfoGain = infoGain
      bestFeature = i
  return bestFeature
# 因为我们递归构建决策树是根据属性的消耗进行计算的,所以可能会存在最后属性用完了,但是分类
# 还是没有算完,这时候就会采用多数表决的方式计算节点分类
def majorityCnt(classList):
  classCount = {}
  for vote in classList:
    if vote not in classCount.keys():
      classCount[vote] = 0
    classCount[vote] += 1
  return max(classCount)
def createTree(dataSet, labels):
  classList = [example[-1] for example in dataSet]
  if classList.count(classList[0]) == len(classList): # 类别相同则停止划分
    return classList[0]
  if len(dataSet[0]) == 1: # 所有特征已经用完
    return majorityCnt(classList)
  bestFeat = chooseBestFeatureToSplit(dataSet)
  bestFeatLabel = labels[bestFeat]
  myTree = {bestFeatLabel: {}}
  del (labels[bestFeat])
  featValues = [example[bestFeat] for example in dataSet]
  uniqueVals = set(featValues)
  for value in uniqueVals:
    subLabels = labels[:] # 为了不改变原始列表的内容复制了一下
    myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,
                                bestFeat, value), subLabels)
  return myTree
def main():
  data, label = createDataSet()
  t1 = time.clock()
  myTree = createTree(data, label)
  t2 = time.clock()
  print(myTree)
  print('execute for ', t2 - t1)
if __name__ == '__main__':
  main()

结果

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

你可能感兴趣的:(KNN算法,决策树)