一、问题
OCR(光学字符识别)是机器学习重要的应用之一,一般要经过二值化、去噪、倾斜校正、特征抽取、字符切割、字符识别、后处理等过程。其中难度最大的是字符切割,最关键的步骤是字符识别。一般进行字符识别的方法有kNN,SVM,CNN等方式,其中比较好用的是SVM。作者在这里实现的是相对较为简单的kNN(k近邻)算法,用以完成经典的MNIST数据集的字符识别工作。该数据集的训练集共计60000条数据,测试集共计10000条数据。
二、原理
kNN的核心思想:如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。一般来讲,对于类域的交叉或重叠较多的待分样本集来说,kNN相比其它方法更为适合。
三、解决
①导入数据,观察示例图确定基本解决方案
②定义kNN函数
③调用kNN函数
④比较不同数量的训练集、不同的距离度量函数对结果准确率、时间开销的影响
首先是对数据集内的数据进行导入、划分数据集、观察示例图等常规操作,读者不必在这些代码上花费大量时间,可以跳读下一代码段。
%matplotlib inline
import numpy as np
from PIL import Image
from matplotlib import pyplot as plt
import copy
import scipy.ndimage
import sys, os
DATASET_PATH = r'D:\文件路径'
DATASET_FILE = os.path.join(DATASET_PATH, 'mnist.npz')
f = np.load(DATASET_FILE)
x_train, y_train = f['x_train'], f['y_train']
x_test, y_test = f['x_test'], f['y_test']
#unit8(无符号的整数,unit8是0~255
def img_show(img):
plt.imshow(Image.fromarray(np.uint8(img)))
plt.axis('on') # 关掉坐标轴为 off
plt.title('image') # 图像题目
plt.show()
img = x_train[0]#训练图像赋给img
print(img.shape) # (784,)
img = img.reshape(28, 28) # 把图像的形状变为原来的尺寸
img_show(img)
实际操作中,我们可以只定义一种距离度量方法,最常用的就是欧式距离。这里定义了欧式距离、曼哈顿距离、切比雪夫距离和闵科夫斯基距离四种距离度量方法,目的是对四种方法进行比较。可以在后面的叙述中看到,欧式距离效果最好。
def euclidean_dist(x,y):
return np.linalg.norm(x-y)
def manhattan_dist(x,y):
return np.sum(np.abs(x-y))
def chebyshev_dist(x,y):
return np.max(np.abs(x-y))
def minkowski_dist(x,y):
return np.sqrt(np.sum(np.square(x-y)))
接下来就是对kNN函数的定义。函数包括四个参数:
x表示向量列表,是拟进行标注的图片的特征矩阵拉伸成向量后的向量;
M表示样本矩阵,用来训练分类器;
k表示目标点的邻居个数;
dtype表示度量方法,共有0,1,2,3四个选项,分别对应上面的四种距离度量方法。
函数输出的是M中与x最近的k个样本的下标。
值得注意的是,这里对x、M都做了拉伸变换。而这二者都必须是array类型才能利用numpy的方法进行距离度量。因此无论传进来是什么类型(实际上是list类型),都先转换为array。
注意,下面M[:10000]代表使用传入的训练集中前10000条数据进行训练,目的是加快训练速度,但也牺牲了一部分精度。实际操作时,这部分可以进行更改。
def KNN(x, M, k, dtype):
x = np.array(x)
M = np.array(M)
orin_dist = []
dist = []
dist0 = 0
idx = []
for a in M[:10000]:#这里可以调整训练集大小
if dtype == 0:
dist0 = euclidean_dist(x,a)
elif dtype == 1:
dist0 = manhattan_dist(x,a)
elif dtype == 2:
dist0 = chebyshev_dist(x,a)
elif dtype == 3:
dist0 = minkowski_dist(x,a)
dist.append(dist0)
for i in dist:
orin_dist.append(i)
dist.sort()
for i in range(k):
idx.append(orin_dist.index(dist[i]))
return idx
定义查找结果函数。传入的是目标点的k个邻居共同组成的向量,传出的是这k个邻居确定的数字结果。
def find(x_result):
y_result = y_train[x_result]
from collections import Counter
res0 = Counter(y_result).most_common(1)
res = res0[0][0]
print("这个数字是"+str(res))
return res
定义验证函数。若比对成功,说明结果正确;正确率等于正确验证数量除以总验证数量。
def vali(result, r_result):
r_con = 0
c_con = 0
for i in range(len(result)):
if(result[i] == r_result[i]):
r_con += 1
c_con += 1
print("共"+str(r_con)+"个结果正确,正确率为"+str(r_con/c_con))
这一步完成了对x_train数据集的矩阵拉伸工作。
new_x_train = []
for i in x_train:
new_x_train.append(np.ravel(i))
这一步完成了kNN函数调用,并输出结果。
注意,这里dtype的位置传的参数是3,代表使用了闵科夫斯基距离。实际操作时,这一参数可以修改。
y_result = []
for i in x_test[:500]:
x_result = KNN(np.ravel(i),new_x_train,11,3)
y_result.append(find(x_result))
print(y_result)
在上面的介绍中提到了两处参数的修改。分别调整这两处参数,得到使用1000条、10000条、60000条训练集中的数据进行训练得到的y_train结果(使用的是欧式距离);以及在1000条训练数据下分别使用四种距离度量方法进行度量的所得结果。
将上面得到的结果手动复制到下面,进行正确率比较。
y_r1000 = [7, 2, 1, 0, 0, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 5, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 9, 1, 7, 1, 7, 4, 3, 0, 7, 0, 2, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 8, 4, 7, 5, 6, 1, 3, 6, 1, 3, 1, 9, 1, 7, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 4, 8, 1, 1, 9, 1, 1, 9, 4, 7, 7, 5, 6, 7, 6, 7, 9, 0, 5, 8, 5, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 6, 7, 3, 1, 9, 1, 8, 2, 0, 1, 9, 9, 9, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 6, 2, 5, 0, 1, 1, 1, 0, 7, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 1, 9, 3, 2, 9, 7, 1, 9, 1, 9, 0, 3, 8, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 8, 7, 7, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 8, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 1, 1, 4, 1, 3, 0, 0, 5, 1, 9, 1, 5, 0, 6, 1, 1, 9, 1, 6, 9, 6, 0, 7, 1, 1, 1, 1, 3, 3, 1, 9, 7, 0, 6, 5, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 8, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 9, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 5, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 7, 8, 7, 9, 9, 1, 9, 2, 1, 1, 2, 9, 2, 0, 9, 9, 1, 4, 1, 1, 1, 6, 4, 9, 8, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 1, 9, 5, 3, 3, 1, 3, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 8, 8, 7, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 6, 5, 2, 3, 7, 6, 9, 1, 0, 1]
y_r10000 = [7, 2, 1, 0, 4, 1, 9, 9, 4, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 4, 2, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 6, 0, 4, 1, 9, 1, 7, 8, 1, 1, 7, 1, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 1, 7, 9, 7, 9, 6, 2, 7, 8, 4, 7, 3, 6, 1, 3, 6, 1, 3, 1, 4, 1, 7, 6, 9, 6, 0, 5, 4, 9, 9, 2, 1, 9, 9, 8, 1, 1, 9, 7, 1, 1, 4, 9, 7, 8, 6, 1, 6, 7, 9, 0, 5, 8, 5, 6, 6, 8, 7, 8, 1, 0, 1, 6, 9, 6, 7, 3, 1, 7, 1, 8, 2, 0, 1, 9, 8, 5, 8, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 4, 9, 7, 3, 1, 2, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 9, 2, 3, 0, 1, 1, 1, 0, 9, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 3, 9, 8, 2, 9, 7, 5, 9, 1, 9, 0, 3, 6, 5, 5, 7, 2, 2, 7, 1, 3, 8, 1, 1, 1, 3, 1, 8, 7, 1, 9, 2, 1, 4, 1, 5, 8, 8, 7, 1, 6, 0, 6, 4, 1, 9, 1, 9, 5, 7, 1, 1, 1, 2, 6, 8, 1, 7, 7, 1, 1, 8, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 9, 1, 5, 9, 2, 9, 2, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 2, 9, 4, 1, 3, 0, 0, 3, 1, 9, 1, 5, 3, 5, 1, 7, 9, 1, 6, 9, 6, 0, 7, 1, 1, 2, 1, 5, 3, 1, 9, 7, 8, 6, 6, 1, 1, 3, 8, 1, 0, 5, 1, 3, 1, 8, 0, 6, 1, 8, 5, 1, 9, 9, 4, 6, 7, 2, 8, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 4, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 9, 2, 8, 6, 1, 7, 5, 2, 5, 4, 4, 2, 1, 3, 9, 2, 4, 5, 0, 3, 1, 7, 7, 8, 7, 9, 7, 1, 9, 2, 1, 9, 2, 9, 2, 0, 4, 9, 1, 8, 8, 1, 1, 6, 5, 9, 1, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 9, 8, 3, 3, 8, 1, 3, 9, 1, 1, 6, 8, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 5, 8, 9, 6, 1, 8, 4, 1, 2, 6, 9, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 0, 5, 2, 3, 7, 6, 9, 1, 8, 1]
y_r60000 = [7, 2, 1, 0, 4, 1, 9, 9, 0, 9, 0, 6, 9, 0, 1, 8, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 0, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 1, 5, 1, 1, 9, 4, 6, 3, 5, 0, 6, 0, 4, 1, 9, 1, 7, 8, 4, 3, 7, 1, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 1, 7, 9, 7, 9, 6, 2, 7, 8, 4, 7, 8, 6, 1, 3, 6, 1, 3, 1, 4, 1, 7, 6, 9, 6, 0, 5, 4, 9, 9, 2, 1, 9, 9, 8, 1, 1, 9, 1, 9, 9, 4, 9, 8, 8, 6, 7, 6, 7, 4, 0, 5, 8, 5, 6, 6, 3, 7, 8, 1, 0, 1, 6, 9, 6, 7, 3, 1, 7, 1, 8, 2, 0, 1, 9, 8, 5, 3, 1, 5, 6, 0, 3, 1, 8, 6, 5, 4, 6, 5, 4, 5, 1, 4, 9, 7, 2, 1, 2, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 9, 2, 3, 0, 1, 1, 1, 0, 9, 0, 1, 1, 6, 4, 2, 3, 6, 1, 1, 1, 1, 9, 5, 2, 9, 4, 5, 9, 1, 9, 0, 3, 6, 5, 5, 7, 2, 2, 7, 1, 2, 8, 1, 1, 7, 3, 1, 8, 8, 7, 9, 2, 2, 4, 1, 5, 8, 8, 7, 1, 2, 0, 2, 4, 1, 9, 1, 9, 5, 7, 1, 2, 1, 2, 6, 8, 5, 7, 7, 1, 1, 8, 1, 8, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 5, 9, 2, 6, 4, 1, 8, 9, 2, 9, 1, 0, 4, 0, 0, 2, 8, 1, 7, 1, 7, 9, 0, 2, 1, 8, 1, 3, 0, 0, 3, 1, 9, 1, 5, 2, 8, 1, 7, 9, 3, 0, 9, 2, 0, 7, 1, 1, 2, 1, 8, 3, 1, 9, 7, 8, 6, 6, 1, 1, 3, 8, 1, 0, 5, 1, 3, 1, 5, 0, 6, 1, 8, 5, 1, 8, 4, 4, 6, 8, 2, 5, 0, 6, 5, 6, 3, 7, 2, 0, 8, 8, 5, 4, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 9, 2, 8, 6, 1, 9, 5, 2, 5, 4, 4, 2, 1, 3, 8, 7, 4, 5, 0, 3, 1, 7, 7, 8, 7, 9, 7, 1, 9, 2, 1, 1, 2, 9, 2, 0, 4, 9, 1, 4, 8, 1, 8, 1, 5, 9, 8, 8, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 9, 8, 3, 3, 3, 2, 3, 9, 1, 1, 6, 8, 0, 9, 6, 6, 6, 7, 8, 8, 2, 7, 8, 8, 9, 6, 1, 8, 4, 1, 2, 1, 8, 1, 9, 7, 1, 4, 0, 8, 9, 9, 1, 0, 5, 2, 3, 7, 6, 9, 4, 0, 1]
y_r1000_eu = [7, 2, 1, 0, 0, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 5, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 9, 1, 7, 1, 7, 4, 3, 0, 7, 0, 2, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 8, 4, 7, 5, 6, 1, 3, 6, 1, 3, 1, 9, 1, 7, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 4, 8, 1, 1, 9, 1, 1, 9, 4, 7, 7, 5, 6, 7, 6, 7, 9, 0, 5, 8, 5, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 6, 7, 3, 1, 9, 1, 8, 2, 0, 1, 9, 9, 9, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 6, 2, 5, 0, 1, 1, 1, 0, 7, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 1, 9, 3, 2, 9, 7, 1, 9, 1, 9, 0, 3, 8, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 8, 7, 7, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 8, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 1, 1, 4, 1, 3, 0, 0, 5, 1, 9, 1, 5, 0, 6, 1, 1, 9, 1, 6, 9, 6, 0, 7, 1, 1, 1, 1, 3, 3, 1, 9, 7, 0, 6, 5, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 8, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 9, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 5, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 7, 8, 7, 9, 9, 1, 9, 2, 1, 1, 2, 9, 2, 0, 9, 9, 1, 4, 1, 1, 1, 6, 4, 9, 8, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 1, 9, 5, 3, 3, 1, 3, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 8, 8, 7, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 6, 5, 2, 3, 7, 6, 9, 1, 0, 1]
y_r1000_ma = [7, 2, 1, 0, 9, 1, 4, 9, 9, 7, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 1, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 6, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 1, 1, 7, 1, 1, 4, 3, 0, 7, 0, 3, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 1, 4, 7, 3, 6, 1, 3, 6, 1, 3, 1, 9, 1, 1, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 9, 1, 1, 1, 9, 1, 1, 1, 4, 7, 7, 5, 6, 1, 6, 7, 1, 0, 5, 8, 1, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 1, 7, 3, 1, 7, 1, 9, 2, 0, 1, 9, 9, 1, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 6, 1, 8, 1, 1, 1, 0, 1, 9, 2, 5, 0, 1, 1, 1, 0, 1, 0, 1, 1, 6, 9, 2, 0, 6, 1, 1, 1, 1, 9, 3, 1, 9, 7, 1, 9, 1, 9, 0, 3, 1, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 6, 7, 1, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 9, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 7, 0, 1, 1, 4, 1, 1, 0, 0, 1, 1, 9, 1, 1, 0, 6, 1, 1, 9, 1, 6, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 9, 7, 5, 6, 1, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 1, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 1, 1, 6, 1, 7, 1, 1, 8, 1, 5, 9, 1, 1, 4, 0, 1, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 1, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 1, 8, 7, 9, 9, 1, 9, 1, 1, 1, 2, 9, 2, 0, 4, 9, 1, 1, 1, 1, 1, 1, 4, 9, 1, 1, 3, 7, 6, 0, 0, 3, 1, 1, 0, 6, 1, 9, 5, 3, 3, 1, 1, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 7, 8, 1, 9, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 4, 0, 7, 9, 9, 1, 6, 6, 2, 3, 7, 6, 9, 1, 0, 1]
y_r1000_ch = [4, 2, 1, 2, 9, 0, 7, 1, 4, 5, 0, 6, 6, 0, 5, 0, 0, 3, 4, 1, 5, 6, 5, 4, 4, 4, 2, 5, 3, 1, 0, 9, 2, 1, 1, 6, 5, 3, 1, 3, 1, 5, 3, 3, 9, 0, 9, 5, 5, 7, 6, 0, 0, 0, 4, 1, 2, 1, 4, 8, 5, 2, 3, 7, 4, 5, 7, 4, 5, 0, 2, 0, 0, 5, 0, 5, 3, 4, 0, 2, 9, 8, 9, 5, 4, 2, 0, 0, 5, 1, 9, 6, 1, 3, 3, 5, 9, 3, 9, 6, 1, 0, 9, 0, 1, 5, 4, 2, 5, 5, 4, 0, 2, 0, 9, 4, 4, 1, 4, 5, 0, 2, 0, 4, 3, 5, 9, 4, 6, 0, 5, 5, 5, 3, 0, 1, 6, 9, 5, 4, 6, 3, 3, 1, 3, 6, 6, 6, 0, 8, 5, 1, 4, 5, 1, 4, 5, 8, 0, 9, 5, 2, 0, 0, 5, 6, 5, 3, 1, 9, 4, 6, 3, 0, 0, 0, 2, 0, 2, 5, 0, 8, 6, 5, 0, 5, 0, 1, 6, 5, 4, 1, 2, 2, 0, 3, 1, 5, 5, 6, 5, 5, 9, 3, 1, 0, 0, 2, 2, 9, 4, 1, 5, 3, 0, 0, 9, 0, 3, 4, 7, 0, 0, 9, 8, 0, 4, 8, 1, 2, 0, 9, 5, 1, 0, 2, 5, 6, 1, 0, 5, 1, 8, 1, 2, 0, 6, 1, 9, 9, 0, 3, 6, 4, 1, 0, 0, 4, 5, 2, 2, 0, 7, 7, 1, 8, 5, 2, 4, 0, 0, 5, 1, 6, 4, 2, 0, 0, 5, 0, 0, 5, 2, 0, 5, 0, 6, 5, 5, 0, 6, 6, 4, 2, 1, 5, 0, 0, 0, 0, 1, 3, 8, 2, 4, 0, 2, 0, 1, 3, 5, 5, 6, 2, 5, 8, 1, 0, 3, 1, 0, 2, 5, 9, 5, 5, 4, 5, 6, 4, 0, 3, 1, 0, 6, 0, 1, 2, 2, 6, 6, 0, 2, 0, 1, 2, 0, 8, 3, 0, 0, 0, 1, 2, 8, 6, 7, 1, 5, 1, 1, 5, 4, 0, 0, 1, 5, 5, 6, 2, 5, 5, 0, 4, 5, 0, 0, 8, 3, 4, 4, 2, 5, 7, 5, 1, 6, 2, 2, 3, 3, 3, 6, 1, 9, 3, 1, 4, 0, 7, 6, 3, 0, 5, 9, 5, 8, 5, 1, 9, 1, 5, 5, 9, 5, 1, 3, 2, 6, 6, 8, 5, 3, 3, 2, 5, 0, 1, 9, 0, 1, 2, 3, 2, 1, 0, 0, 3, 7, 3, 3, 5, 3, 3, 9, 0, 5, 1, 3, 0, 5, 5, 6, 1, 2, 1, 5, 5, 1, 5, 0, 5, 5, 0, 1, 5, 2, 0, 5, 7, 3, 0, 9, 1, 5, 3, 1, 5, 1, 1, 3, 3, 4, 8, 0, 1, 0, 7, 0, 1, 0, 3, 5, 6, 3, 5, 2, 7, 1, 4]
y_r1000_mi = [1, 1, 1, 1, 6, 1, 5, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 5, 8, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 6, 7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1, 1, 5, 1, 1, 1, 1, 8, 1, 8, 1, 1, 6, 1, 1, 1, 1, 0, 1, 1, 1, 7, 1, 1, 1, 1, 1, 1, 8, 0, 8, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 8, 1, 1, 1, 8, 5, 1, 1, 8, 1, 1, 0, 1, 1, 1, 8, 1, 1, 5, 1, 1, 8, 6, 1, 7, 1, 1, 1, 1, 1, 1, 6, 1, 1, 8, 6, 1, 1, 1, 8, 1, 1, 1, 6, 1, 1, 1, 6, 1, 1, 1, 8, 1, 0, 1, 1, 7, 1, 5, 1, 1, 1, 1, 1, 1, 1, 6, 1, 5, 1, 1, 0, 8, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 6, 5, 1, 1, 1, 2, 5, 6, 1, 1, 8, 5, 1, 0, 1, 1, 1, 1, 1, 1, 1, 6, 1, 6, 1, 1, 1, 9, 6, 1, 1, 0, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 5, 5, 1, 1, 1, 1, 5, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 8, 1, 1, 6, 1, 1, 8, 5, 8, 1, 8, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 8, 1, 5, 1, 1, 6, 1, 8, 1, 6, 1, 6, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 6, 1, 1, 1, 8, 1, 1, 1, 1, 5, 1, 5, 8, 1, 1, 1, 5, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 6, 8, 1, 1, 1, 1, 1, 1, 5, 1, 1, 4, 0, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 8, 8, 1, 1, 4, 1, 1, 1, 1, 1, 1, 5, 1, 8, 5, 1, 6, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 1, 1, 1, 8, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 8, 6, 1, 8, 8, 1, 1, 5, 1, 6, 8, 1, 1, 5, 1, 5, 1, 1, 1, 1, 1, 1, 5, 0, 1, 1, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1]
调用验证函数,比较不同方法结果的准确率。
vali(y_r1000_mi,y_test[:500])
四、反思
1000训练数据,欧氏距离
10000训练数据,欧氏距离
60000训练数据,欧氏距离
1000训练数据,欧氏距离
1000训练数据,曼哈顿距离
1000训练数据,切比雪夫距离
1000训练数据,闵科夫斯基距离
数据表明,欧式距离是最适合kNN进行文字识别的距离度量方法;同时笔者估算了一下,以笔者的电脑性能,如果用60000条训练数据,大概要1.4个小时才能跑完10000条测试集数据,时间原因没有进行验证。不过这个时间也是可以接受的。
方案还有可改进的地方,欢迎留言交流。