Assigment1 k-Nearest Neighbor (kNN) exercise (1)

Assigment1 k-Nearest Neighbor (kNN) exercise (1)

一、作业内容
  • (感觉很多深度学习相关课程的入门第一课都是kNN啊。。。可能是因为它具有提高学习兴趣的魔力?或者比较简单??)kNN分类器是一种十分简单暴力的分类器,算法原理简单易懂,就是计算距离以进行比较,它包含两个主要步骤:
1)训练
  • 这里的训练应该加上下引号,因为它其实啥也没干,仅仅读取了训练数据并进行存储以供后续调用
2)测试
  • 对于每一个测试样本,此课程即为测试图像,kNN将遍历一次训练集合,计算该测试图像与训练集中每一张图像的距离(暴力之处),通过排序等方法找出距离最近的k张图像,在这k张图像中,占多数的标签类别就是该测试图像所属的类别

  • 而计算图像的距离主要有两种方式,分别是曼哈顿距离(L1距离)和欧几里得距离(L2距离),具体差别如下图所示:

    Assigment1 k-Nearest Neighbor (kNN) exercise (1)_第1张图片

  • 课程中针对这两种距离计算方法进行了可视化比较,如下图所示。但是具体使用哪种距离度量就要视情况而定了,我们可以通过具体场景下的效果对比来进行探索

    Assigment1 k-Nearest Neighbor (kNN) exercise (1)_第2张图片


二、完成作业
  • 港真,我觉得这课的课程机制真滴好,作业的格式安排简直人性化,不会让人走弯路的感觉(个人感受)。基本上,学生要完成的工作都在py文件里,而且是通过完善函数功能的方式。然后他的notebook中已经有很多的预备代码,调用那些py文件来训练模型、测试数据等,而且notebook中还包含了大部分的说明,感觉做作业的过程是种享受。。。

  • 废话不多说,接下来就开始记录我的完成过程。首先跑一下notebook中的setup代码,主要是几个基本的配置

# Run some setup code for this notebook.

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
  • 接下来就是加载数据,包括训练数据和测试数据(均分为X部分和Y部分)
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
   del X_train, y_train
   del X_test, y_test
   print('Clear previously loaded data.')
except:
   pass

X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
  • 可以看到上述代码进行了输出,主要是输出了数据的shape,如下所示
Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)
  • 可能还对这一套数据集不熟悉?这里就将数据集中每一类的样本都随机挑选几个进行展示,代码如下:
# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()
  • 上述代码中有个函数值得一提,即 numpy.flatnonzero() 。其最基本的功能就是输入一个矩阵,返回矩阵中非零元素的位置,如下所示:
>>> x = np.arange(-2, 3)
>>> x
array([-2, -1,  0,  1,  2])
>>> np.flatnonzero(x)
array([0, 1, 3, 4])


import numpy as np
d = np.array([1,2,3,4,4,3,5,3,6])
tmp = np.flatnonzero(d == 3)
print(tmp)

[2 5 7]
  • 观察上述代码中的tmp,你就能get到 idxs = np.flatnonzero(y_train == y) 这一行代码的作用,它找出了标签中y类的位置。执行这一段可视化代码的结果如下:

Assigment1 k-Nearest Neighbor (kNN) exercise (1)_第3张图片

  • 整个数据集比较大,kNN又比较暴力,所以仅取一个子集来进行练习,代码如下
# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)
  • 接下来就要创建kNN分类器了(其实只是对训练数据进行了简单存储)。我们要做的工作主要是计算距离矩阵,如果训练样本有 N t r Ntr Ntr 个,测试样本有 N t e Nte Nte 个,则距离矩阵应该是个 N t r ∗ N t e Ntr * Nte NtrNte 大小的矩阵,其中元素 [i, j] 表示第 i 个测试样本到第 j 个训练样本的距离。接下来,就是补全 k_nearest_neighbor.py 文件中的 compute_distances_two_loops 方法,它使用粗暴的两层循环来计算测试样本与训练样本之间的距离
def compute_distances_two_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a nested loop over both the training data and the
        test data.

        Inputs:
        - X: A numpy array of shape (num_test, D) containing test data.

        Returns:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          is the Euclidean distance between the ith test point and the jth training
          point.
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                #####################################################################
                # TODO:                                                             #
                # Compute the l2 distance between the ith test point and the jth    #
                # training point, and store the result in dists[i, j]. You should   #
                # not use a loop over dimension, nor use np.linalg.norm().          #
                #####################################################################
                # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
                
                dists[i][j] = np.sqrt(np.sum(np.square(X[i,:] - self.X_train[j,:])))

                # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists
  • 作业中还要求测试上述方法,并进行了可视化,但我觉得意义不大,这里就不做记录了。直接跳到下一个步骤,实现 predict_labels 方法,以结合 compute_distances_two_loops 方法进行测试样本分类
def predict_labels(self, dists, k=1):
        """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # TODO:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort.                             #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            closest_y = self.y_train[np.argsort(dists[i])[:k]]

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            #########################################################################
            # TODO:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            y_pred[i] = np.argmax(np.bincount(closest_y))

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        return y_pred
  • 值得一提的是 numpy.bincount 函数,这里需要记录前k个训练样本中没中类别出现的次数,而 numpy.bincount 函数就能优雅的完成这一任务,下面给出两个例子你就懂了:
# 我们可以看到x中最大的数为7,因此bin的数量为8,那么它的索引值为0->7
x = np.array([0, 1, 1, 3, 2, 1, 7])
# 索引0出现了1次,索引1出现了3次......索引5出现了0次......
np.bincount(x)
#因此,输出结果为:array([1, 3, 1, 1, 0, 0, 0, 1])

# 我们可以看到x中最大的数为7,因此bin的数量为8,那么它的索引值为0->7
x = np.array([7, 6, 2, 1, 4])
# 索引0出现了0次,索引1出现了1次......索引5出现了0次......
np.bincount(x)
#输出结果为:array([0, 1, 1, 0, 1, 0, 1, 1])
  • 进行测试并计算准确率,首先取 k=1
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

# 输出准确率
Got 137 / 500 correct => accuracy: 0.274000
  • 取 k=5
y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

# 输出准确率
Got 139 / 500 correct => accuracy: 0.278000
  • 可以看到k取5的效果会优于k取1,但这是随意取的,后续会使用交叉验证的方法来选择最优的超参数k,在下一篇博客进行记录吧

  • 接下来要将距离计算的效率提升一下,使用单层循环结构的计算方法,需要补全 compute_distances_one_loop 方法,主要是利用了广播机制,减少了一层循环

def compute_distances_one_loop(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a single loop over the test data.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            #######################################################################
            # TODO:                                                               #
            # Compute the l2 distance between the ith test point and all training #
            # points, and store the result in dists[i, :].                        #
            # Do not use np.linalg.norm().                                        #
            #######################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
			  # 利用broadcasting,一次性算出每一张图片与5000张图片的距离
            dists[i, :] = np.sqrt(np.sum(np.square(self.X_train - X[i, :]),axis=1))

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists
  • 还能继续优化吗?能!通过矩阵乘法和两次广播加法,可以不使用循环,就能完成计算任务
def compute_distances_no_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        #########################################################################
        # TODO:                                                                 #
        # Compute the l2 distance between all test points and all training      #
        # points without using any explicit loops, and store the result in      #
        # dists.                                                                #
        #                                                                       #
        # You should implement this function using only basic array operations; #
        # in particular you should not use functions from scipy,                #
        # nor use np.linalg.norm().                                             #
        #                                                                       #
        # HINT: Try to formulate the l2 distance using matrix multiplication    #
        #       and two broadcast sums.                                         #
        #########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        def compute_distances_no_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        #########################################################################
        # TODO:                                                                 #
        # Compute the l2 distance between all test points and all training      #
        # points without using any explicit loops, and store the result in      #
        # dists.                                                                #
        #                                                                       #
        # You should implement this function using only basic array operations; #
        # in particular you should not use functions from scipy,                #
        # nor use np.linalg.norm().                                             #
        #                                                                       #
        # HINT: Try to formulate the l2 distance using matrix multiplication    #
        #       and two broadcast sums.                                         #
        #########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        temp_2xy = np.dot(X,self.X_train.T) * (-2)
        temp_x2 = np.sum(np.square(X),axis=1,keepdims=True)
        temp_y2 = np.sum(np.square(self.X_train),axis=1)
        dists = temp_x2 + temp_2xy + temp_y2
        # 上述四行的作用,构造出了 x^2-2xy+y^2
        # 然后开根号即可
        dists = np.sqrt(dists)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists
  • 这三个计算距离的函数所计算出的结果都是一样的,就不再赘述了,主要是比较一下他们的效率
# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)
  • 输出如下,可以很明显地看到效率差距
Two loop version took 47.337988 seconds
One loop version took 45.795947 seconds
No loop version took 0.334743 seconds

你可能感兴趣的:(Python,深度学习)