数据挖掘实践指南读书笔记4

写在之前

本书涉及的源程序和数据都可以在以下网站中找到:http://guidetodatamining.com/
这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。
欢迎转载,转载请注明出处,如有问题欢迎指正。
合集地址:https://www.zybuluo.com/hainingwyx/note/559139

算法评估及kNN

10-flod Cross Validation:将数据集分为10份,使用其中9份进行训练,另外1份用作测试,重复该过程10次。

留一法:n-flod Cross Validation。结果是随机的,不是确定值,和数据的划分有关。缺点在于计算机开销很大。分层采样的时候保证样本的均匀性很重要。

混淆矩阵:行表示测试样本的真实类别,列表示预测器所预测出来的类别。可揭示分类器性能。

# divide data into 10 buckets
import random

def buckets(filename, bucketName, separator, classColumn):
    """the original data is in the file named filename
    bucketName is the prefix for all the bucket names
    separator is the character that divides the columns
    (for ex., a tab or comma and classColumn is the column
    that indicates the class"""

    # put the data in 10 buckets
    numberOfBuckets = 10
    data = {}
    # first read in the data and divide by category
    with open(filename) as f:
        lines = f.readlines()
    for line in lines:
        if separator != '\t':
            line = line.replace(separator, '\t')
        # first get the category
        category = line.split()[classColumn]
        data.setdefault(category, [])   #set the value for dic data
        data[category].append(line)     #all the information 
    # initialize the buckets [[], [], ...]
    buckets = []
    for i in range(numberOfBuckets):
        buckets.append([])       
    # now for each category put the data into the buckets
    for k in data.keys():
        #randomize order of instances for each class
        #data[k] is a list of line
        random.shuffle(data[k])
        bNum = 0
        # divide into buckets
        for item in data[k]:
            buckets[bNum].append(item)
            bNum = (bNum + 1) % numberOfBuckets

    # write to file
    for bNum in range(numberOfBuckets):
        f = open("%s-%02i" % ('tmp/'+bucketName, bNum + 1), 'w')
        for item in buckets[bNum]:
            f.write(item)
        f.close()

# example of how to use this code          
buckets("data/mpgData.txt", 'mpgData',',',0)

分类器评价:Kappa统计量。相对于随机分类器而言的分类器效果。
$$
\kappa =\frac{P(c)-P(r)}{1-P(r)}
$$
$P(c)$是实际分类器的准确率,$P(r)$是随机分类器的精确率。

Kappa区间 性能
<0 比随机方法性能差
0.01-0.2 轻微一致
0.21-0.4 一般一致
0.41-0.6 中度一致
0.61-0.8 高度一致
0.81-1 接近完美

KNN:当有一个样本是比较特别的时候,使用最近邻可能会导致特别样本的存在而出现误分类。改进的办法就是考察k个邻居。离得越近,影响因子就越大。影响因子可以用距离的倒数来表示。

def knn(self, itemVector):
  """returns the predicted class of itemVector using k
  Nearest Neighbors"""
  # changed from min to heapq.nsmallest to get the
  # k closest neighbors
  neighbors = heapq.nsmallest(self.k,
  [(self.manhattan(itemVector, item[1]), item)
  for item in self.data])
  # each neighbor gets a vote
  results = {}
  for neighbor in neighbors: 
  theClass = neighbor[1][0]
  results.setdefault(theClass, 0)
  results[theClass] += 1
  resultList = sorted([(i[1], i[0]) for i in results.items()], reverse=True)
  #get all the classes that have the maximum votes
  maxVotes = resultList[0][0]
  possibleAnswers = [i[1] for i in resultList if i[0] == maxVotes]
  # randomly select one of the classes that received the max votes
  answer = random.choice(possibleAnswers)
  return( answer)

做工程,数据量大的时候算法的效果越好。做论文还是要研究出一个具有少量性能提高的算法。

你可能感兴趣的:(数据挖掘实践指南读书笔记4)