KNN全称——k-Nearest Neighbor分类器,它是在Nearest Neighbor的基础上改进而来:
K-Nearest Neighbor分类器的优缺点及应用场景:
SVM全称——支持向量机 Support Vector Machine ,是一种二分类模型,也就是线性模型分类器。
要讲SVM,先从最简单的函数开始——一个线性映射:
举个例子,在高维度情况下,以CIFAR-10为例:
图像数据预处理:
在上面的例子中,所有图像都是使用的原始像素值(从0到255)。在机器学习中,对于输入的特征做归一化(normalization)是必然的。在图像处理中,每个像素点可以看作是一个简单的特征,在一般使用过程中,我们都先将特征“集中”,即训练集中所有的图像计算出一个平均图像值,然后每个图像都减去这个平均值,这样图像的像素值就大约分布在[-127, 127]之间了,下一个常见步骤是,让所有数值分布的区间变为[-1, 1]。
损失函数(loss function):
如何评判分类器的偏差就是当前的问题,解决这问题的方法就是损失函数:
这个函数得到的就是当前分类的偏差值。
正则化(Regularization):
上面损失函数有一个问题。假设有一个数据集和一个权重集W能够正确地分类每个数据(即所有的边界都满足,对于所有的i都有)。
本次实验的环境依然是Google Colab,我们需要把作业包导入云端环境:
接下来的操作代码都在CoLab上运行:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
#这一步就是切换到作业目录,没啥好说的
# enter the foldername in your Drive where you have saved the unzipped
# 'cs231n' folder containing the '.py', 'classifiers' and 'datasets' folders.
# e.g. 'cs231n/assignments/assignment1/cs231n/'
FOLDERNAME = 'assignment1/cs231n'#这里是我自己的云盘目录
assert FOLDERNAME is not None, "[!] Enter the foldername."
%cd drive/My\ Drive
%cp -r $FOLDERNAME ../../
%cd ../../
%cd cs231n/datasets/
!bash get_datasets.sh
%cd ../../
#倒入一些包和设置
# Run some setup code for this notebook.
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
# This is a bit of magic to make matplotlib figures appear inline in the notebook rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# Some more magic so that the notebook will reload external python modules;
#加载外部模块
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
# Load the raw CIFAR-10 data.
#加载原始数据
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
#清理变量以防止多次加载数据
try:
del X_train, y_train
del X_test, y_test
print('Clear previously loaded data.')
except:
pass
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
#作为完备性检查,我们打印出训练和测试数据的大小
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
# Visualize some examples from the dataset.
#可视化数据集中的一些示例
# We show a few examples of training images from each class.
#展示每个类的一些训练图像的例子
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
idxs = np.flatnonzero(y_train == y)
idxs = np.random.choice(idxs, samples_per_class, replace=False)
for i, idx in enumerate(idxs):
plt_idx = i * num_classes + y + 1
plt.subplot(samples_per_class, num_classes, plt_idx)
plt.imshow(X_train[idx].astype('uint8'))
plt.axis('off')
if i == 0:
plt.title(cls)
plt.show()
# Subsample the data for more efficient code execution in this exercise
#对数据进行分组
num_training = 5000
mask = list(range