2020-01-27

我处理它的方式，就是首先创建一个 Python 列表，它包含另一个列表，里面包含数据集中每个点的距离和分类。一旦填充完毕，我们就可以根据距离来排序列表，截取列表的前 K 个值，找到出现次数最多的，就找到了答案。

def k_nearest_neighbors(data, predict, k=3):
    if len(data) >= k:
        warnings.warn('K is set to a value less than total voting groups!')
        
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = sqrt( (features[0]-predict[0])**2 + (features[1]-predict[1])**2 )
            distances.append([euclidean_distance,group])

有一种方式来计算欧氏距离，最简洁的方式就是遵循定义。也就是说，使用 NumPy 会更快一点。由于 KNN 是一种机器学习的爆破方法，我们需要我们能得到的所有帮助。因此，我们可以将函数修改一点。一个选项是：

euclidean_distance = np.sqrt(np.sum((np.array(features)-np.array(predict))**2))
print(euclidean_distance)

这还是很清楚，我们刚刚使用了 NumPy 版本。NumPy 使用 C 优化，是个非常高效的库，很多时候允许我们计算更快的算术。也就是说，NumPy 实际上拥有大量的线性代数函数。例如，这是范数：

euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
print(euclidean_distance)

欧式距离度量两个端点之间的线段长度。欧几里得范数度量向量的模。向量的模就是它的长度，这个是等价的。名称仅仅告诉你你所在的控件。

我打算使用后面那一个，但是我会遵循我的约定，使其易于拆解成代码。如果你不了解 NumPy 的内建功能，你需要去了解如何使用。

现在，for循环之外，我们得到了距离列表，它包含距离和分类的列表。我们打算对列表排序，之后截取前 K 个元素，选取下标 1，它就是分类。

votes = [i[1] for i in sorted(distances)[:k]]

上面，我们遍历了排序后的距离列表的每个列表。排序方法会（首先）基于列表中每个列表的第一个元素。第一个元素是距离，所以执行orted(distances)之后我们就按照从小到大的距离排序了列表。之后我们截取了列表的[:k]，因为我们仅仅对前 K 个感兴趣。最后，在for循环的外层，我们选取了i[1]，其中i就是列表中的列表，它包含[diatance, class]（距离和分类的列表）。按照距离排序之后，我们无需再关心距离，只需要关心分类。

所以现在有三个候选分类了。我们需要寻找出现次数最多的分类。我们会使用 Python 标准库模块collections.Counter。

vote_result = Counter(votes).most_common(1)[0][0]

Collections会寻找最常出现的元素。这里，我们想要一个最常出现的元素，但是你可以寻找前 3 个或者前x个。如果没有[0][0]这部分，你会得到[('r', 3)]（元素和计数的元组的列表）。所以[0][0]会给我们元组的第一个元素。你看到的 3 是'r'的计数。

最后，返回预测结果，就完成了。完整的代码是：

def k_nearest_neighbors(data, predict, k=3):
    if len(data) >= k:
        warnings.warn('K is set to a value less than total voting groups!')
        
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
            distances.append([euclidean_distance,group])

    votes = [i[1] for i in sorted(distances)[:k]]
    vote_result = Counter(votes).most_common(1)[0][0]
    return vote_result

现在，如果我们打算基于我们之前所选的点，来做预测：

result = k_nearest_neighbors(dataset, new_features)
print(result)

非常肯定，我得到了r，这就是预期的值。让我们绘制它吧。

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import warnings
from math import sqrt
from collections import Counter
style.use('fivethirtyeight')

def k_nearest_neighbors(data, predict, k=3):
    if len(data) >= k:
        warnings.warn('K is set to a value less than total voting groups!')
        
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
            distances.append([euclidean_distance,group])

    votes = [i[1] for i in sorted(distances)[:k]]
    vote_result = Counter(votes).most_common(1)[0][0]
    return vote_result

dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in dataset[i]] for i in dataset]
# same as:
##for i in dataset:
##    for ii in dataset[i]:
##        plt.scatter(ii[0],ii[1],s=100,color=i)
        
plt.scatter(new_features[0], new_features[1], s=100)

result = k_nearest_neighbors(dataset, new_features)
plt.scatter(new_features[0], new_features[1], s=100, color = result)  
plt.show()

你可以看到，我们添加了新的点5, 7，它分类为红色的点，符合预期。

这只是小规模的处理，但是如果我们处理乳腺肿瘤数据集呢？我们如何和 Sklearn 的 KNN 算法比较？下一个教程中，我们会将算法用于这个数据集。

2020-01-27

你可能感兴趣的:(2020-01-27)