MNIST手写数字识别实验(KNN)

数据集:MINST
数据预处理参考了https://blog.csdn.net/simple_the_best/article/details/75267863
处理出来有用的信息也就是 28 × 28 28 \times 28 28×28的矩阵和Label信息。
KNN的实现:

def KNN(train_dataset, train_labels, input_vec, distance, k=1):
    dis_labels = []
    n = len(train_dataset)
    for i in range(n):
        vec = train_dataset[i]
        label = train_labels[i]
        dis_labels.append([distance(vec, input_vec), label])
    dis_labels.sort(key=lambda p: p[0])
    dis_labels=dis_labels[:k]
    class_count = {}
    for p in dis_labels:
        class_count[p[1]]=class_count.get(p[1],0)+1
    res = dis_labels[0][0]
    ma = 0
    for p in class_count.items():
        if ma < p[1]:
            res, ma = p
    return res

数据集中有 60000 60000 60000个训练数据, 10000 10000 10000个测试数据。对于使用KNN来说,如果不考虑使用任何优化直接全部拿来用的话,光计算两两之间的距离就需要做 6 ∗ 1 0 4 ∗ 1 0 4 ∗ 28 6*10^4*10^4*28 610410428次运算。在测试阶段对于个人电脑来说不太能接受,选择使用采样后再测试。训练数据每种数字都选取了 1000 1000 1000个,测试数据每种选取 100 100 100个。但是处理出来的结果比较差,正确率都选取不同的 k k k,正确率都只有 20 20% 20多。

testing k = 2
k = 2 done! accuracy = 0.255
testing k = 3
k = 3 done! accuracy = 0.23
testing k = 4
k = 4 done! accuracy = 0.227
testing k = 5
k = 5 done! accuracy = 0.224
testing k = 6
k = 6 done! accuracy = 0.223
testing k = 7
k = 7 done! accuracy = 0.212
testing k = 8
k = 8 done! accuracy = 0.211
testing k = 9
k = 9 done! accuracy = 0.209
testing k = 10
k = 10 done! accuracy = 0.196

虽然MINST有提到:

With some classification methods (particuarly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass.

但是正确率低得超过预期。参考网上其他人的方法后,处理KNN的实现,区别就是有没有二值化处理。二值化操作后,正确率就比较高。

testing k = 2
k = 2 done! accuracy = 0.934
testing k = 3
k = 3 done! accuracy = 0.935
testing k = 4
k = 4 done! accuracy = 0.938
testing k = 5
k = 5 done! accuracy = 0.933
testing k = 6
k = 6 done! accuracy = 0.933
testing k = 7
k = 7 done! accuracy = 0.927
testing k = 8
k = 8 done! accuracy = 0.926
testing k = 9
k = 9 done! accuracy = 0.926
testing k = 10
k = 10 done! accuracy = 0.925

为什么正确率会提高,而且差别很大,不太清楚。

可以观察到,两种情况的表现都和 k k k的没有表现出明显的关系,差别都不是很大,甚至 k k k值在增大的话正确率可能还会下降。所以如果只是测试的话可能随便选个 k k k就可以了。


k = 5 k=5 k=5 时全数据集的测试的正确率:
无操作:
k = 5 done! accuracy = 0.284
二值化:
k = 5 done! accuracy = 0.963

你可能感兴趣的:(MNIST手写数字识别实验(KNN))