Unsupervised learner--k-Nearest Neighbor

K最近邻(k-Nearest Neighbor，KNN)分类算法

引入背景

最粗暴的分类器，记录所有的训练数据，当测试对象的属性和某个训练对象的属性完全匹配时，变对其进行分类。
测试对象找不到与之完全匹配的训练对象。
测试对象同时与多个训练对象匹配，导致一个训练对象被分割。

原理

KNN是机器学习中最基本的分类器
通过测试不同特征值之间的距离进行分类
如果一个样本在特征空间中的k个最相似（即特征空间中最邻近）的样本中的大多数属于某一个类别，则该样本也属于这个类别。
KNN算法依赖于K的选择，如果k值比较小，相当于只是用了样本中较少的数据进行了“训练”，容易产生==过拟合==，误差过大；如果k值比较大，相当于和估计数据不相近的样本也参与了预测，造成模型偏差过大。一个经验为：k一般低于训练样本数的平方根
距离的选择主要采用欧式距离和曼哈顿距离（范数）漫谈：机器学习中距离和相似性度量方法

步骤

计算测试数据与各个训练数据（已知类别数据）之间的距离；
按照距离递增排序
选择距离值最小的K个点；
确定前K个点所在类别的出现频率；
返回前K个点中出现频率最高的类别作为测试数据的预测分类。

优点与缺点

优点：简单，适合多类分类问题，相对精度高，对异常值不敏感，无数据输入假定。
缺点：计算量大，每一次预测都需要计算所有样本的距离；如样本不平衡，预测结果不准确，空间复杂度高；

Python实现

# ===欧氏距离分类（构建KNN分类器）====
def classify0(x, dataset, labels, k):
    dataset_size = dataset.shape[0]  # shape[0] stands for the num of row
    # step 1: calculate Euclidean distance
    # tile(A, reps): Construct an array by repeating A reps times
    # the following copy numSamples rows for dataSet
    diff_mat = np.tile(x, (dataset_size, 1)) - dataset
    sq_diff_mat = diff_mat ** 2
    sq_distances = sq_diff_mat.sum(axis=1)
    distances = sq_distances ** 0.5
    # step 2: sort the distance
    # argsort() returns the indices that would sort an array in a ascending order
    sorted_dist_indicies = distances.argsort()

    class_count = {}
    for i in range(k):
        # step 3: choose the min k distance
        voted_label = labels[sorted_dist_indicies[i]]
        # step 4: count the times labels occur
        # when the key voteLabel is not in dictionary classCount, get()
        # will return 0
        class_count[voted_label] = class_count.get(voted_label, 0) + 1
        # step 5: the max voted class will return
        # Python 2.x 里面用`class_count.iteritems()
    sorted_class_count = sorted(class_count.iteritems(),
                                key=operator.itemgetter(1),
                                reverse=True)
    return sorted_class_count[0][0]

Unsupervised learner--k-Nearest Neighbor

K最近邻(k-Nearest Neighbor，KNN)分类算法

引入背景

原理

步骤

优点与缺点

Python实现

你可能感兴趣的:(Unsupervised learner--k-Nearest Neighbor)