[cs231n]Assignment1_Knn 代码学习

部分资料来源于网络,仅做个人学习之用

目录

1. Download the CIFAR10 datasets, and load it 

2. Define a K Nearest Neighbor Class

3. Train and Test

4. Cross Validation


1. Download the CIFAR10 datasets, and load it 

Setup code

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

#这是使matplotlib图形内联出现在笔记本中而不是在一个新的窗口 的一个小技巧
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # 设置显示图像的最大范围
plt.rcParams['image.interpolation'] = 'nearest' #设置插值的方式:最邻近差值
plt.rcParams['image.cmap'] = 'gray' # 灰度空间 0-255

%load_ext autoreload
%autoreload 2
""" 在执行用户代码前,重新装入软件的扩展和模块。
 autoreload 意思是自动重新装入。它后面可带参数。参数意思你要查你自己的版本帮助文件。
一般说:
无参:装入所有模块。
0:不执行 装入命令。
1: 只装入所有 %aimport 要装模块
2:装入所有 %aimport 不包含的模块。"""

Load the CIFAR10 data

cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir) # 读取数据集

# 作为一个完整性检查,我们打印出训练和测试数据的大小。
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

Show some CIFAR10 images

classes = ['plane', 'car', 'bird', 'cat', 'dear', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes) # 一共有10类
num_each_class = 7 # 每类选7个

"""
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
即enumerate的里面是一个个的pair,第一维是下标,第二维是每一个值。 
y是pair的第一维也就是种类的下标,y_train是训练集里的每一个种类的值 因此就相当于把所有这个种类的抠出来组成一个下标的list

    >>>seasons = ['Spring', 'Summer', 'Fall', 'Winter']
    >>> list(enumerate(seasons))
    [(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
"""

for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
"""
np.flatnonzero() 输入一个矩阵,返回了其中非零元素的位置.
np.flatnonzero(y_train == y):在作业中给出的用法:不走寻常路,用来返回某个特定元素的位置找出标签中y类的位置
"""
    idxs = np.random.choice(idxs, num_each_class, replace=False)
"""在所有的这些下标中,随机抽取num_each_class个下标,从中选出我们所需的7个样本,然后这个7个元素不能相同(replace=False)
"""
    for i, idx in enumerate(idxs):
# 对所选的样本的位置和样本所对应的图片在训练集中的位置进行循环
        plt_idx = i * num_classes + (y + 1)  # 计算在子图中所占位置
        plt.subplot(num_each_class, num_classes, plt_idx)  # 说明要画的子图的编号
"""
matplotlib.pyplot.subplot(XXX):
该函数输入量为三个整数比如subplot(2,1,1)前两个数表示子图组成的矩阵的行列数,比如有6个子图,排列成3行2列,那就是subplot(3,2,X)。最后一个数表示要画第X个图了。

参数1代表行数、参数2代表列数、参数3代表第几个图,之所以每次都需要输入第1、2个参数,是因为这两个参数是可变的
"""
        plt.imshow(X_train[idx].astype('uint8'))  # 在上一条指令指定好绘制区域后,画图
        plt.axis('off')  # 不显示坐标尺寸
        if i == 0:
            plt.title(cls) # 写上标题,即类别名
plt.show()

为了更有效地执行代码,对数据进行子采样

# train numbers
num_train = 5000
mask = range(num_train)
X_train = X_train[mask]
y_train = y_train[mask]

# test numbers
num_test = 500
mask = range(num_test)
X_test = X_test[mask]
y_test = y_test[mask]
# change 4D to 2D, like (5000, 32, 32, 3) -> (5000, 3072)
"""
np.reshape(X_train.shape[0], -1), 表示:只保留第一维,其余的纬度,
不管多少纬度,重新排列为一维。用-1是偷懒的做法,等同于 28*28。 
reshape后的数据是:共60000行,每一行是784个数据点(feature)。
参数-1就是不知道行数或者列数多少的情况下使用的参数,所以先确定除了参数-1之外的其他参数,
然后通过(总参数的计算) / (确定除了参数-1之外的其他参数) = 该位置应该是多少的参数。

X.reshape(X.shape[0], -1).T可以将一个维度为(a,b,c,d)的矩阵转换为维度为(b∗c∗d, a)的矩阵。
例子1:
>>> X.shape
(209, 64, 64, 3)
>>> X.shape[0]
209
然后,我们说shape[0]就是第一个列的行数,也就是209。
>>> X.reshape(X.shape[0], -1)
(209, 64*64*3)
通过reshape重新建立维度,第一个维度就是X.shape[0],这就是正常的reshape操作;
第二个维度是-1,我们知道X的shape属性是多少,是(209, 64, 64, 3),但是想让X变成209行,
列数不知道是多少,所以也就是209 * 64 * 64 * 3 / 209,也就是64 * 64 * 3。
>>> X.reshape(X.shape[0], -1).T
(64*64*3, 209)

"""
# 为了欧氏距离的计算,将得到的图像数据拉伸成行向量 eg: (32, 32, 3)->(3072,)
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print('X_train shape: ', X_train.shape)
print('X_test shape: ', X_test.shape)

2. Define a K Nearest Neighbor Class

无循环计算欧氏距离

资料来源:https://blog.csdn.net/geekmanong/article/details/51524402

利用广播,只要注意P的形状为m*1, C的形状为1*n即可

广播(m, 1)(1, n)---> (m, n) 

[cs231n]Assignment1_Knn 代码学习_第1张图片

class KNearestNeighbor(object):
    """a KNN classifier with L2 distance"""

    def __init__(self):
        pass
    
    def train(self, X, y):
        """
       训练分类器。这只是记忆所有的训练数据。
        输入:
        - X:形状(num_train, D)的numpy数组,包含训练数据, 包含每个维D的num_train样本。
        - y:一个形状的numpy数组(num_train,),包含训练标签,其中y[i]是X[i]的标签。
        """
        self.X_train = X
        self.y_train = y
    
    def predict(self, X, k = 1, num_loops = 0):
        """
        测试分类器。
        输入:
        - X:形状(num_test, D)的numpy数组,包含测试数据, 由各维D的num_test样本组成。
        - k:投票给预测标签的最近邻居的数量。
        - num_loops:确定是否使用for-loop来计算训练点和测试点之间的L2距离
        返回:
        - pred_y:预测输出y
        """

        # 计算test X and train X 之间的L2距离
        if num_loops == 0:
            # no for-loop, 矢量化
            dists = self.cal_dists_no_loop(X)
        elif num_loops == 1:
            # one for-loop, half-vectorized
            dists = self.cal_dists_one_loop(X)
        elif num_loops == 2:
            # two for-loop, no vectorized
            dists = self.cal_dists_two_loop(X)
        else:
            raise ValueError('Invalid value %d for num_loops' % num_loops)

        # predict the labels
        num_test = X.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            dists_k_min = np.argsort(dists[i])[0:k]    # the closest k distance loc 
            """
argsort(): 输出的结果是从小到大排序后的下标,即结果列表中的第一个值是最小的数的下标,以此类推。

            首先利用距离矩阵dists找出k个与测试样本i最近的训练样本的label,利用np.argsort可以找                                    出dists中最小的k个值的index,然后利用index取出对应的label即可得到close_y
            """
            close_y = self.y_train[dists_k_min]    
# 用到了整型数组访问语法,即取出self.y_train中以dists_k_min中包含的值为下标的内容。
            y_pred[i] = np.argmax(np.bincount(close_y))    
            """
            在得到closest_y之后,找到k近邻中label出现次数最多的label返回,
利用np.bincount(close_y)可以统计y中元素出现的次数,并且返回出现次数,bincount的返回值a的每一项
对应一个值出现次数,例如a[0]代表的是y中0出现次数,a[1]代表y中1出现次数......然后利用argmax求出
出现次数最多的元素,返回即可:
eg: [0,3,1,3,3,1] -> 3 as y_pred[i]
            """
        return y_pred
    
    def cal_dists_no_loop(self, X):
        """
        不用循环的方法则有一点trick,首先将L2距离公式展开,然后分别求平方项以及叉积。
即把计算欧氏距离的式子差的平方展开,变成平方的和减去交叉项的2倍。
        计算没有for循环的距离
        输入:
        - X:形状(num_test, D)的numpy数组,包含测试数据
        由各维D的num_test样本组成。
        返回:
        测试X和训练X之间的距离
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        # (X - X_train)*(X - X_train) = -2X*X_train + X*X + X_train*X_train
        d1 = np.multiply(np.dot(X, self.X_train.T), -2)    # shape (num_test, num_train)
        d2 = np.sum(np.square(X), axis=1, keepdims=True)    # shape (num_test, 1)
        d3 = np.sum(np.square(self.X_train), axis=1)    # shape (1, num_train)
        dists = np.sqrt(d1 + d2 + d3)
        
        return dists
    
    def cal_dists_one_loop(self, X):
        """
        一层循环中循环次数为测试样例的个数,所以在循环体中要实现vector和matrix的距离求解
        使用一个for循环计算距离
        输入:
        - X:形状(num_test, D)的numpy数组,包含测试数据
        由各维D的num_test样本组成。
        返回:
        测试X和训练X之间的距离
直接对整个训练集图片操作,此时self.X_train的大小为5000×3072,而X[i]的大小为1×3072,两者相减会自动对X[i]进行广播,
使其扩展到与self.X_train相同的大小。此时执行sum或者norm操作的话,还需要指定轴,令axis=1。根据我的理解,
不管多少维的矩阵,轴的序号总是从左向右计数,被指定的轴的大小在操作后会被改变。
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            dists[i] = np.sqrt(np.sum(np.square(self.X_train - X[i]), axis=1))
        
        return dists
    
    def cal_dists_two_loop(self, X):
        """
       使用两个for循环计算距离
        输入:
        - X:形状(num_test, D)的numpy数组,包含测试数据
        由各维D的num_test样本组成。
        返回:
        测试X和训练X之间的距离
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                dists[i][j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
                # 计算两个vector的L2距离
        return dists   

3. Train and Test

Create a KNN classifier instance

KNN = KNearestNeighbor()
KNN.train(X_train, y_train)

Compare the value of distance_computation by no loop, one-loop and two-loop

按无循环、单循环和双循环比较distance_computation的值

dists_no_loop = KNN.cal_dists_no_loop(X_test)
dists_one_loop = KNN.cal_dists_one_loop(X_test)
dists_two_loop = KNN.cal_dists_two_loop(X_test)
diff1 = np.linalg.norm(dists_no_loop - dists_one_loop) # 求矩阵的二范数
diff2 = np.linalg.norm(dists_no_loop - dists_two_loop)
print('The difference between no-loop and one-loop is: %f' % diff1)
print('The difference between no-loop and two-loop is: %f' % diff2)
if diff1 < 0.001 and diff2 < 0.001:
    print('Good, the distance matrices are the same!')
else:
    print('Oh, the distance matrices are different')

Compare the speed of distance_computation by no-loop, one-loop and two-loop

def time_func(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    
    t_st = time.time()
    f(*args)
    t_ed = time.time()
    
    return t_ed - t_st

# no-loop
no_loop_time = time_func(KNN.cal_dists_no_loop, X_test)
print('No loop time: %f seconds' % no_loop_time)
one_loop_time = time_func(KNN.cal_dists_one_loop, X_test)
print('One loop time: %f seconds' % one_loop_time)
two_loop_time = time_func(KNN.cal_dists_two_loop, X_test)
print('Two loop time: %f seconds' % two_loop_time)
    

Predict test dataset

# k = 1
y_pred = KNN.predict(X_test, k=1)
num_correct = np.sum(y_pred == y_test)
accuracy = np.mean(y_pred == y_test)  #mean()求均值 
print('Correct %d/%d: The accuracy is %f' % (num_correct, X_test.shape[0], accuracy))

# k = 5
y_pred = KNN.predict(X_test, k=5)
num_correct = np.sum(y_pred == y_test)
accuracy = np.mean(y_pred == y_test)
print('Correct %d/%d: The accuracy is %f' % (num_correct, X_test.shape[0], accuracy))

4. Cross Validation

我们不确定哪个k值是最好的选择。因此,我们现在将通过交叉验证来确定这个超参数的最佳值。

采用5折交叉验证:一份作为测试集, 其余作为训练集

 

"""
首先是把训练集分为5组,使用array_split即可。但需要注意的是,分割结果是一个列表,而不是矩阵。
请务必注意列表和矩阵的区别:列表是Python的基本数据类型,而矩阵是NumPy中的数据类型。
如果弄混了这一点,后面的程序将会非常难以理解。接下来,很关键的一点是如何按照5折交叉验证的要求组合训练集。
"""
num_folds = 5    # split the training dataset to 5 parts
k_classes = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]    # all k, determine the best k

# Split up the training data into folds
X_train_folds = []
y_train_folds = []
X_train_folds = np.split(X_train, num_folds)
y_train_folds = np.split(y_train, num_folds)

# 一本包含不同k值的准确性的字典,即以字典形式存储k和accuracy
k_accuracy = {}

"""
先对k_to_accuracies赋初始值[],利用两层循环进行交叉验证,外层循环为folds数,内层循环为不同的k值
"""
for k in k_classes:
    accuracies = []
    #knn = KNearestNeighbor()
    for i in range(num_folds):
  # 使用concatenate将4个训练集拼在一起,axis=0可以省略
        Xtr = np.concatenate(X_train_folds[:i] + X_train_folds[i+1:]) 
        ytr = np.concatenate(y_train_folds[:i] + y_train_folds[i+1:])
        Xcv = X_train_folds[i]
        ycv = y_train_folds[i]
        KNN.train(Xtr, ytr)
        ycv_pred = KNN.predict(Xcv, k=k, num_loops=0)
        accuracy = np.mean(ycv_pred == ycv)
        accuracies.append(accuracy)
    k_accuracy[k] = accuracies
"""
concatenate()函数根据指定的维度,对一个元组、列表中的list或者ndarray进行连接,函数的参数应当为待连接的矩阵组成的元组。而在这行代码中,并没有传入元组,而是传入了两个列表相加的结果。
首先,这里是列表相加而不是矩阵相加,Python的加号运算符用于列表时会直接把两个列表连接起来。因此相加的结果是一个长度为4的列表,列表中每个元素都是1000×3072的矩阵。将列表传入vstack后,会自动调用元组的构造函数tuple(list)将其转换为元组。之后,在0号轴上连接这4个矩阵,得到一个4000×3072的矩阵。
"""
# Print the accuracy
for k in k_classes:
    for i in range(num_folds):
        print('k = %d, fold = %d, accuracy: %f' % (k, i+1, k_accuracy[k][i]))

绘制交叉验证

for k in k_classes:
    plt.scatter([k] * num_folds, k_accuracy[k])
# 用与标准偏差相对应的误差条绘制趋势线
accuracies_mean = [np.mean(k_accuracy[k]) for k in k_accuracy]
accuracies_std = [np.std(k_accuracy[k]) for k in k_accuracy]
plt.errorbar(k_classes, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

Choose the best k 

best_k = k_classes[np.argmax(accuracies_mean)]
# Use the best k, and test it on the test data
KNN = KNearestNeighbor()
KNN.train(X_train, y_train)
y_pred = KNN.predict(X_test, k=best_k, num_loops=0)
num_correct = np.sum(y_pred == y_test)
accuracy = np.mean(y_pred == y_test)
print('Correct %d/%d: The accuracy is %f' % (num_correct, X_test.shape[0], accuracy))

 

你可能感兴趣的:(cs231n,assignment1,knn)