【python】【kNN】【OCR】用python实现字符识别

一、问题

OCR(光学字符识别)是机器学习重要的应用之一,一般要经过二值化去噪倾斜校正特征抽取字符切割字符识别后处理等过程。其中难度最大的是字符切割,最关键的步骤是字符识别。一般进行字符识别的方法有kNN,SVM,CNN等方式,其中比较好用的是SVM。作者在这里实现的是相对较为简单的kNN(k近邻)算法,用以完成经典的MNIST数据集的字符识别工作。该数据集的训练集共计60000条数据,测试集共计10000条数据。

二、原理

kNN的核心思想:如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。一般来讲,对于类域的交叉或重叠较多的待分样本集来说,kNN相比其它方法更为适合。

三、解决

①导入数据,观察示例图确定基本解决方案
②定义kNN函数
③调用kNN函数
④比较不同数量的训练集、不同的距离度量函数对结果准确率、时间开销的影响

首先是对数据集内的数据进行导入、划分数据集、观察示例图等常规操作,读者不必在这些代码上花费大量时间,可以跳读下一代码段。

%matplotlib inline

import numpy as np
from PIL import Image
from matplotlib import pyplot as plt
import copy
import scipy.ndimage 
import sys, os

DATASET_PATH = r'D:\文件路径'
DATASET_FILE = os.path.join(DATASET_PATH, 'mnist.npz')

f = np.load(DATASET_FILE)
x_train, y_train = f['x_train'], f['y_train']
x_test, y_test = f['x_test'], f['y_test']

#unit8(无符号的整数,unit8是0~255
def img_show(img):
    plt.imshow(Image.fromarray(np.uint8(img)))
    plt.axis('on') # 关掉坐标轴为 off
    plt.title('image') # 图像题目
    plt.show()
    
img = x_train[0]#训练图像赋给img
print(img.shape)  # (784,)
img = img.reshape(28, 28)  # 把图像的形状变为原来的尺寸
img_show(img)

实际操作中,我们可以只定义一种距离度量方法,最常用的就是欧式距离。这里定义了欧式距离、曼哈顿距离、切比雪夫距离和闵科夫斯基距离四种距离度量方法,目的是对四种方法进行比较。可以在后面的叙述中看到,欧式距离效果最好。

def euclidean_dist(x,y):
    return np.linalg.norm(x-y)
def manhattan_dist(x,y):
    return np.sum(np.abs(x-y))
def chebyshev_dist(x,y):
    return np.max(np.abs(x-y))
def minkowski_dist(x,y):
    return np.sqrt(np.sum(np.square(x-y)))

接下来就是对kNN函数的定义。函数包括四个参数:
x表示向量列表,是拟进行标注的图片的特征矩阵拉伸成向量后的向量;
M表示样本矩阵,用来训练分类器;
k表示目标点的邻居个数;
dtype表示度量方法,共有0,1,2,3四个选项,分别对应上面的四种距离度量方法。
函数输出的是M中与x最近的k个样本的下标。

值得注意的是,这里对x、M都做了拉伸变换。而这二者都必须是array类型才能利用numpy的方法进行距离度量。因此无论传进来是什么类型(实际上是list类型),都先转换为array。

注意,下面M[:10000]代表使用传入的训练集中前10000条数据进行训练,目的是加快训练速度,但也牺牲了一部分精度。实际操作时,这部分可以进行更改。

def KNN(x, M, k, dtype):
    x = np.array(x)
    M = np.array(M)
    orin_dist = []
    dist = []
    dist0 = 0
    idx = []
    for a in M[:10000]:#这里可以调整训练集大小
        if dtype == 0:
            dist0 = euclidean_dist(x,a)
        elif dtype == 1:
            dist0 = manhattan_dist(x,a)
        elif dtype == 2:
            dist0 = chebyshev_dist(x,a)
        elif dtype == 3:
            dist0 = minkowski_dist(x,a)
        dist.append(dist0)
    for i in dist:
        orin_dist.append(i)
    dist.sort()
    for i in range(k):
        idx.append(orin_dist.index(dist[i]))
    return idx

定义查找结果函数。传入的是目标点的k个邻居共同组成的向量,传出的是这k个邻居确定的数字结果。

def find(x_result):
    y_result = y_train[x_result]
    from collections import Counter
    res0 = Counter(y_result).most_common(1)
    res = res0[0][0]
    print("这个数字是"+str(res))
    return res

定义验证函数。若比对成功,说明结果正确;正确率等于正确验证数量除以总验证数量。

def vali(result, r_result):
    r_con = 0
    c_con = 0
    for i in range(len(result)):
        if(result[i] == r_result[i]):
            r_con += 1
        c_con += 1
    print("共"+str(r_con)+"个结果正确,正确率为"+str(r_con/c_con))

这一步完成了对x_train数据集的矩阵拉伸工作。

new_x_train = []
for i in x_train:
    new_x_train.append(np.ravel(i))

这一步完成了kNN函数调用,并输出结果。
注意,这里dtype的位置传的参数是3,代表使用了闵科夫斯基距离。实际操作时,这一参数可以修改。

y_result = []
for i in x_test[:500]:
    x_result = KNN(np.ravel(i),new_x_train,11,3)
    y_result.append(find(x_result))
print(y_result)

在上面的介绍中提到了两处参数的修改。分别调整这两处参数,得到使用1000条、10000条、60000条训练集中的数据进行训练得到的y_train结果(使用的是欧式距离);以及在1000条训练数据下分别使用四种距离度量方法进行度量的所得结果。
将上面得到的结果手动复制到下面,进行正确率比较。

y_r1000  = [7, 2, 1, 0, 0, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 5, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 9, 1, 7, 1, 7, 4, 3, 0, 7, 0, 2, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 8, 4, 7, 5, 6, 1, 3, 6, 1, 3, 1, 9, 1, 7, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 4, 8, 1, 1, 9, 1, 1, 9, 4, 7, 7, 5, 6, 7, 6, 7, 9, 0, 5, 8, 5, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 6, 7, 3, 1, 9, 1, 8, 2, 0, 1, 9, 9, 9, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 6, 2, 5, 0, 1, 1, 1, 0, 7, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 1, 9, 3, 2, 9, 7, 1, 9, 1, 9, 0, 3, 8, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 8, 7, 7, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 8, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 1, 1, 4, 1, 3, 0, 0, 5, 1, 9, 1, 5, 0, 6, 1, 1, 9, 1, 6, 9, 6, 0, 7, 1, 1, 1, 1, 3, 3, 1, 9, 7, 0, 6, 5, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 8, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 9, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 5, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 7, 8, 7, 9, 9, 1, 9, 2, 1, 1, 2, 9, 2, 0, 9, 9, 1, 4, 1, 1, 1, 6, 4, 9, 8, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 1, 9, 5, 3, 3, 1, 3, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 8, 8, 7, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 6, 5, 2, 3, 7, 6, 9, 1, 0, 1]
y_r10000 = [7, 2, 1, 0, 4, 1, 9, 9, 4, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 4, 2, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 6, 0, 4, 1, 9, 1, 7, 8, 1, 1, 7, 1, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 1, 7, 9, 7, 9, 6, 2, 7, 8, 4, 7, 3, 6, 1, 3, 6, 1, 3, 1, 4, 1, 7, 6, 9, 6, 0, 5, 4, 9, 9, 2, 1, 9, 9, 8, 1, 1, 9, 7, 1, 1, 4, 9, 7, 8, 6, 1, 6, 7, 9, 0, 5, 8, 5, 6, 6, 8, 7, 8, 1, 0, 1, 6, 9, 6, 7, 3, 1, 7, 1, 8, 2, 0, 1, 9, 8, 5, 8, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 4, 9, 7, 3, 1, 2, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 9, 2, 3, 0, 1, 1, 1, 0, 9, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 3, 9, 8, 2, 9, 7, 5, 9, 1, 9, 0, 3, 6, 5, 5, 7, 2, 2, 7, 1, 3, 8, 1, 1, 1, 3, 1, 8, 7, 1, 9, 2, 1, 4, 1, 5, 8, 8, 7, 1, 6, 0, 6, 4, 1, 9, 1, 9, 5, 7, 1, 1, 1, 2, 6, 8, 1, 7, 7, 1, 1, 8, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 9, 1, 5, 9, 2, 9, 2, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 2, 9, 4, 1, 3, 0, 0, 3, 1, 9, 1, 5, 3, 5, 1, 7, 9, 1, 6, 9, 6, 0, 7, 1, 1, 2, 1, 5, 3, 1, 9, 7, 8, 6, 6, 1, 1, 3, 8, 1, 0, 5, 1, 3, 1, 8, 0, 6, 1, 8, 5, 1, 9, 9, 4, 6, 7, 2, 8, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 4, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 9, 2, 8, 6, 1, 7, 5, 2, 5, 4, 4, 2, 1, 3, 9, 2, 4, 5, 0, 3, 1, 7, 7, 8, 7, 9, 7, 1, 9, 2, 1, 9, 2, 9, 2, 0, 4, 9, 1, 8, 8, 1, 1, 6, 5, 9, 1, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 9, 8, 3, 3, 8, 1, 3, 9, 1, 1, 6, 8, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 5, 8, 9, 6, 1, 8, 4, 1, 2, 6, 9, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 0, 5, 2, 3, 7, 6, 9, 1, 8, 1]
y_r60000 = [7, 2, 1, 0, 4, 1, 9, 9, 0, 9, 0, 6, 9, 0, 1, 8, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 0, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 1, 5, 1, 1, 9, 4, 6, 3, 5, 0, 6, 0, 4, 1, 9, 1, 7, 8, 4, 3, 7, 1, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 1, 7, 9, 7, 9, 6, 2, 7, 8, 4, 7, 8, 6, 1, 3, 6, 1, 3, 1, 4, 1, 7, 6, 9, 6, 0, 5, 4, 9, 9, 2, 1, 9, 9, 8, 1, 1, 9, 1, 9, 9, 4, 9, 8, 8, 6, 7, 6, 7, 4, 0, 5, 8, 5, 6, 6, 3, 7, 8, 1, 0, 1, 6, 9, 6, 7, 3, 1, 7, 1, 8, 2, 0, 1, 9, 8, 5, 3, 1, 5, 6, 0, 3, 1, 8, 6, 5, 4, 6, 5, 4, 5, 1, 4, 9, 7, 2, 1, 2, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 9, 2, 3, 0, 1, 1, 1, 0, 9, 0, 1, 1, 6, 4, 2, 3, 6, 1, 1, 1, 1, 9, 5, 2, 9, 4, 5, 9, 1, 9, 0, 3, 6, 5, 5, 7, 2, 2, 7, 1, 2, 8, 1, 1, 7, 3, 1, 8, 8, 7, 9, 2, 2, 4, 1, 5, 8, 8, 7, 1, 2, 0, 2, 4, 1, 9, 1, 9, 5, 7, 1, 2, 1, 2, 6, 8, 5, 7, 7, 1, 1, 8, 1, 8, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 5, 9, 2, 6, 4, 1, 8, 9, 2, 9, 1, 0, 4, 0, 0, 2, 8, 1, 7, 1, 7, 9, 0, 2, 1, 8, 1, 3, 0, 0, 3, 1, 9, 1, 5, 2, 8, 1, 7, 9, 3, 0, 9, 2, 0, 7, 1, 1, 2, 1, 8, 3, 1, 9, 7, 8, 6, 6, 1, 1, 3, 8, 1, 0, 5, 1, 3, 1, 5, 0, 6, 1, 8, 5, 1, 8, 4, 4, 6, 8, 2, 5, 0, 6, 5, 6, 3, 7, 2, 0, 8, 8, 5, 4, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 9, 2, 8, 6, 1, 9, 5, 2, 5, 4, 4, 2, 1, 3, 8, 7, 4, 5, 0, 3, 1, 7, 7, 8, 7, 9, 7, 1, 9, 2, 1, 1, 2, 9, 2, 0, 4, 9, 1, 4, 8, 1, 8, 1, 5, 9, 8, 8, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 9, 8, 3, 3, 3, 2, 3, 9, 1, 1, 6, 8, 0, 9, 6, 6, 6, 7, 8, 8, 2, 7, 8, 8, 9, 6, 1, 8, 4, 1, 2, 1, 8, 1, 9, 7, 1, 4, 0, 8, 9, 9, 1, 0, 5, 2, 3, 7, 6, 9, 4, 0, 1]

y_r1000_eu = [7, 2, 1, 0, 0, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 5, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 9, 1, 7, 1, 7, 4, 3, 0, 7, 0, 2, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 8, 4, 7, 5, 6, 1, 3, 6, 1, 3, 1, 9, 1, 7, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 4, 8, 1, 1, 9, 1, 1, 9, 4, 7, 7, 5, 6, 7, 6, 7, 9, 0, 5, 8, 5, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 6, 7, 3, 1, 9, 1, 8, 2, 0, 1, 9, 9, 9, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 6, 2, 5, 0, 1, 1, 1, 0, 7, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 1, 9, 3, 2, 9, 7, 1, 9, 1, 9, 0, 3, 8, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 8, 7, 7, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 8, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 1, 1, 4, 1, 3, 0, 0, 5, 1, 9, 1, 5, 0, 6, 1, 1, 9, 1, 6, 9, 6, 0, 7, 1, 1, 1, 1, 3, 3, 1, 9, 7, 0, 6, 5, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 8, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 9, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 5, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 7, 8, 7, 9, 9, 1, 9, 2, 1, 1, 2, 9, 2, 0, 9, 9, 1, 4, 1, 1, 1, 6, 4, 9, 8, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 1, 9, 5, 3, 3, 1, 3, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 8, 8, 7, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 6, 5, 2, 3, 7, 6, 9, 1, 0, 1]
y_r1000_ma = [7, 2, 1, 0, 9, 1, 4, 9, 9, 7, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 1, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 6, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 1, 1, 7, 1, 1, 4, 3, 0, 7, 0, 3, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 1, 4, 7, 3, 6, 1, 3, 6, 1, 3, 1, 9, 1, 1, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 9, 1, 1, 1, 9, 1, 1, 1, 4, 7, 7, 5, 6, 1, 6, 7, 1, 0, 5, 8, 1, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 1, 7, 3, 1, 7, 1, 9, 2, 0, 1, 9, 9, 1, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 6, 1, 8, 1, 1, 1, 0, 1, 9, 2, 5, 0, 1, 1, 1, 0, 1, 0, 1, 1, 6, 9, 2, 0, 6, 1, 1, 1, 1, 9, 3, 1, 9, 7, 1, 9, 1, 9, 0, 3, 1, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 6, 7, 1, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 9, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 7, 0, 1, 1, 4, 1, 1, 0, 0, 1, 1, 9, 1, 1, 0, 6, 1, 1, 9, 1, 6, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 9, 7, 5, 6, 1, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 1, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 1, 1, 6, 1, 7, 1, 1, 8, 1, 5, 9, 1, 1, 4, 0, 1, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 1, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 1, 8, 7, 9, 9, 1, 9, 1, 1, 1, 2, 9, 2, 0, 4, 9, 1, 1, 1, 1, 1, 1, 4, 9, 1, 1, 3, 7, 6, 0, 0, 3, 1, 1, 0, 6, 1, 9, 5, 3, 3, 1, 1, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 7, 8, 1, 9, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 4, 0, 7, 9, 9, 1, 6, 6, 2, 3, 7, 6, 9, 1, 0, 1]
y_r1000_ch = [4, 2, 1, 2, 9, 0, 7, 1, 4, 5, 0, 6, 6, 0, 5, 0, 0, 3, 4, 1, 5, 6, 5, 4, 4, 4, 2, 5, 3, 1, 0, 9, 2, 1, 1, 6, 5, 3, 1, 3, 1, 5, 3, 3, 9, 0, 9, 5, 5, 7, 6, 0, 0, 0, 4, 1, 2, 1, 4, 8, 5, 2, 3, 7, 4, 5, 7, 4, 5, 0, 2, 0, 0, 5, 0, 5, 3, 4, 0, 2, 9, 8, 9, 5, 4, 2, 0, 0, 5, 1, 9, 6, 1, 3, 3, 5, 9, 3, 9, 6, 1, 0, 9, 0, 1, 5, 4, 2, 5, 5, 4, 0, 2, 0, 9, 4, 4, 1, 4, 5, 0, 2, 0, 4, 3, 5, 9, 4, 6, 0, 5, 5, 5, 3, 0, 1, 6, 9, 5, 4, 6, 3, 3, 1, 3, 6, 6, 6, 0, 8, 5, 1, 4, 5, 1, 4, 5, 8, 0, 9, 5, 2, 0, 0, 5, 6, 5, 3, 1, 9, 4, 6, 3, 0, 0, 0, 2, 0, 2, 5, 0, 8, 6, 5, 0, 5, 0, 1, 6, 5, 4, 1, 2, 2, 0, 3, 1, 5, 5, 6, 5, 5, 9, 3, 1, 0, 0, 2, 2, 9, 4, 1, 5, 3, 0, 0, 9, 0, 3, 4, 7, 0, 0, 9, 8, 0, 4, 8, 1, 2, 0, 9, 5, 1, 0, 2, 5, 6, 1, 0, 5, 1, 8, 1, 2, 0, 6, 1, 9, 9, 0, 3, 6, 4, 1, 0, 0, 4, 5, 2, 2, 0, 7, 7, 1, 8, 5, 2, 4, 0, 0, 5, 1, 6, 4, 2, 0, 0, 5, 0, 0, 5, 2, 0, 5, 0, 6, 5, 5, 0, 6, 6, 4, 2, 1, 5, 0, 0, 0, 0, 1, 3, 8, 2, 4, 0, 2, 0, 1, 3, 5, 5, 6, 2, 5, 8, 1, 0, 3, 1, 0, 2, 5, 9, 5, 5, 4, 5, 6, 4, 0, 3, 1, 0, 6, 0, 1, 2, 2, 6, 6, 0, 2, 0, 1, 2, 0, 8, 3, 0, 0, 0, 1, 2, 8, 6, 7, 1, 5, 1, 1, 5, 4, 0, 0, 1, 5, 5, 6, 2, 5, 5, 0, 4, 5, 0, 0, 8, 3, 4, 4, 2, 5, 7, 5, 1, 6, 2, 2, 3, 3, 3, 6, 1, 9, 3, 1, 4, 0, 7, 6, 3, 0, 5, 9, 5, 8, 5, 1, 9, 1, 5, 5, 9, 5, 1, 3, 2, 6, 6, 8, 5, 3, 3, 2, 5, 0, 1, 9, 0, 1, 2, 3, 2, 1, 0, 0, 3, 7, 3, 3, 5, 3, 3, 9, 0, 5, 1, 3, 0, 5, 5, 6, 1, 2, 1, 5, 5, 1, 5, 0, 5, 5, 0, 1, 5, 2, 0, 5, 7, 3, 0, 9, 1, 5, 3, 1, 5, 1, 1, 3, 3, 4, 8, 0, 1, 0, 7, 0, 1, 0, 3, 5, 6, 3, 5, 2, 7, 1, 4]
y_r1000_mi = [1, 1, 1, 1, 6, 1, 5, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 5, 8, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 6, 7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1, 1, 5, 1, 1, 1, 1, 8, 1, 8, 1, 1, 6, 1, 1, 1, 1, 0, 1, 1, 1, 7, 1, 1, 1, 1, 1, 1, 8, 0, 8, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 8, 1, 1, 1, 8, 5, 1, 1, 8, 1, 1, 0, 1, 1, 1, 8, 1, 1, 5, 1, 1, 8, 6, 1, 7, 1, 1, 1, 1, 1, 1, 6, 1, 1, 8, 6, 1, 1, 1, 8, 1, 1, 1, 6, 1, 1, 1, 6, 1, 1, 1, 8, 1, 0, 1, 1, 7, 1, 5, 1, 1, 1, 1, 1, 1, 1, 6, 1, 5, 1, 1, 0, 8, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 6, 5, 1, 1, 1, 2, 5, 6, 1, 1, 8, 5, 1, 0, 1, 1, 1, 1, 1, 1, 1, 6, 1, 6, 1, 1, 1, 9, 6, 1, 1, 0, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 5, 5, 1, 1, 1, 1, 5, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 8, 1, 1, 6, 1, 1, 8, 5, 8, 1, 8, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 8, 1, 5, 1, 1, 6, 1, 8, 1, 6, 1, 6, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 6, 1, 1, 1, 8, 1, 1, 1, 1, 5, 1, 5, 8, 1, 1, 1, 5, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 6, 8, 1, 1, 1, 1, 1, 1, 5, 1, 1, 4, 0, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 8, 8, 1, 1, 4, 1, 1, 1, 1, 1, 1, 5, 1, 8, 5, 1, 6, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 1, 1, 1, 8, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 8, 6, 1, 8, 8, 1, 1, 5, 1, 6, 8, 1, 1, 5, 1, 5, 1, 1, 1, 1, 1, 1, 5, 0, 1, 1, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1]

调用验证函数,比较不同方法结果的准确率。

vali(y_r1000_mi,y_test[:500])

四、反思

1000训练数据,欧氏距离
1000训练数据,欧氏距离
10000训练数据,欧氏距离
10000训练数据,欧氏距离
60000训练数据,欧氏距离
60000训练数据,欧氏距离
1000训练数据,欧氏距离
1000训练数据,欧氏距离
1000训练数据,曼哈顿距离
1000训练数据,曼哈顿距离
1000训练数据,切比雪夫距离
1000训练数据,切比雪夫距离
1000训练数据,闵科夫斯基距离
1000训练数据,闵科夫斯基距离
数据表明,欧式距离是最适合kNN进行文字识别的距离度量方法;同时笔者估算了一下,以笔者的电脑性能,如果用60000条训练数据,大概要1.4个小时才能跑完10000条测试集数据,时间原因没有进行验证。不过这个时间也是可以接受的。

方案还有可改进的地方,欢迎留言交流。

你可能感兴趣的:(Task)