数据来源:
Kaggle Competition:Digit Recognizer, Learn computer vision fundamentals with the famous MNIST data.
该数据集包含数万条手写数据的图像信息,目标是对于根据有标记的手写数据图像数据建模,从而对未标记的数据进行分类。该比赛是计算机视觉中最为入门级的比赛,通过这个比赛可以掌握处理非结构化数据(图像)的基本流程。
这里根据图像数据的特征选择合适的机器学习模型进行处理,这里采用三种不同的方法来应对手写数字的分类问题:PCA+SVM、KNN以及卷积神经网络,使用到sklearn、keras等常用模块。
# 导入所必要的一些包
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
from time import time
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import neural_network
from sklearn import metrics
import math
import time
from collections import Counter
import keras
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.models import Sequential
import warnings
warnings.filterwarnings('ignore')
# 数据导入并查看基本信息
PATH="E:/kaggle/digit-recognizer/"
train=pd.read_csv(PATH+'train.csv')
print(train.shape)
print(train.info)
(42000, 785)
train.head()
label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 785 columns
可以看到,图像数据就是由像素点的数据组成的,每张图片为28*28=784个像素。MNIST数据集的手写数字图像为黑白图像,即在每个格子中数据的取值只有可能是0或1,现在我们要根据这些像素值来进行分类,在处理的过程中,784个像素可以看做target的784个特征。
首先我们可以试着用传统的方法,SVM来进行图像的分类,在分类之前,我们先用PCA的方法对于数据进行降维,从而达到降低计算开销的作用。
# 训练集测试集划分
X_train=train.drop(['label'],axis='columns',inplace=False)
y_train=train['label']
from sklearn.model_selection import train_test_split
X_tr,X_ts,y_tr,y_ts=train_test_split(X_train,y_train,test_size=0.30,random_state=4)
在主成分分析中,n_components是最重要的参数,代表我们需要保留的主成分个数。通过设置n_component=16,我们可以建立起只有16个值的模型,极大减少运算时间,同时能够不丢失太多的准确率。
n_components = 16
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))
X_train_pca = pca.transform(X_train)
done in 1.828s
# 查看方差直方图
plt.hist(pca.explained_variance_ratio_, bins=n_components, log=True)
pca.explained_variance_ratio_.sum()
0.5953435812797994
根据输出结果我们可以看到,保留前16个主成分能够留住数据59%的主要信息。
使用sklearn包中自带的SVM函数来对于数据进行训练。
param_grid = { "C" : [0.1]
, "gamma" : [0.1]}
rf = SVC()
gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=2, n_jobs=-1, verbose=1)
gs = gs.fit(X_train_pca, y_train)
print(gs.best_score_)
print(gs.best_params_)
Fitting 2 folds for each of 1 candidates, totalling 2 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 20.3s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 20.3s finished
0.9430238095238095
{'C': 0.1, 'gamma': 0.1}
bp = gs.best_params_
t0 = time()
clf = SVC(C=bp['C'], kernel='rbf', gamma=bp['gamma'])
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
done in 18.860s
clf.score(pca.transform(X_ts), y_ts)
0.9568253968253968
可以看到,在我们的验证数据中已经达到了95.6%的精确度,其中SVM的参数分别为C:0.1,gamma:0.1。其中C为惩罚系数,C减小可以防止过拟合,这里使用适当的C使得模型达到最好的泛化性能。gamma为支持向量的多少。
接着我们可以按照要求将结果输出,即对于未打标签的图像,进行实际label的预测。最后的效果可以通过Kaggle的线上平台进行评估分析。
val = pd.read_csv(PATH+'test.csv')
pred = clf.predict(pca.transform(val))
# ImageId,Label
val['Label'] = pd.Series(pred)
val['ImageId'] = val.index +1
sub = val[['ImageId','Label']]
sub.to_csv(PATH+'submission1.csv', index=False)
最终的模型结果为97.1%的准确率,确实是效率较高的一种方法了。
KNN是一种无监督聚类方法,这里构建KNN分类器,其原理是将样本分到样本空间中距离最近的一个类别里。这里设计实现了一个简单的KNN模块。
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# 导入数据的函数
def load_data(data_dir):
train_data = open(data_dir + "train.csv").read()
train_data = train_data.split("\n")[1:-1]
train_data = [i.split(",") for i in train_data]
X_train = np.array([[int(i[j]) for j in range(1,len(i))] for i in train_data])
y_train = np.array([int(i[0]) for i in train_data])
test_data = open(data_dir + "test.csv").read()
test_data = test_data.split("\n")[1:-1]
test_data = [i.split(",") for i in test_data]
X_test = np.array([[int(i[j]) for j in range(0,len(i))] for i in test_data])
return X_train, y_train, X_test
# KNN实现的模块
class simple_knn():
def __init__(self):
pass
def train(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X, k=1):
# 计算样本距离
dists = self.compute_distances(X)
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
k_closest_y = []
labels = self.y_train[np.argsort(dists[i,:])].flatten()
k_closest_y = labels[:k] # 将k个最近邻居的label找到
c = Counter(k_closest_y)
y_pred[i] = c.most_common(1)[0][0]
return(y_pred)
def compute_distances(self, X):
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dot_pro = np.dot(X, self.X_train.T)
sum_square_test = np.square(X).sum(axis = 1)
sum_square_train = np.square(self.X_train).sum(axis = 1)
dists = np.sqrt(-2 * dot_pro + sum_square_train + np.matrix(sum_square_test).T)
return(dists)
X_train, y_train, X_test = load_data(PATH)
batch_size = 2000
k = 3 # 邻居类别的个数(knn的参数)
classifier = simple_knn()
classifier.train(X_train, y_train)
调用KNN模块对于模型进行预测
predictions = []
for i in range(int(len(X_test)/batch_size)):
print("Computing batch " + str(i+1) + "/" + str(int(len(X_test)/batch_size)) + "...")
tic = time.time()
predts = classifier.predict(X_test[i * batch_size:(i+1) * batch_size], k)
toc = time.time()
predictions = predictions + list(predts)
print("Completed this batch in " + str(toc-tic) + " Secs.")
print("Completed predicting the test data.")
Computing batch 1/14...
Completed this batch in 53.51499319076538 Secs.
Computing batch 2/14...
Completed this batch in 43.31397557258606 Secs.
Computing batch 3/14...
Completed this batch in 42.59756851196289 Secs.
Computing batch 4/14...
Completed this batch in 43.00966835021973 Secs.
Computing batch 5/14...
Completed this batch in 43.01448702812195 Secs.
Computing batch 6/14...
Completed this batch in 47.93128275871277 Secs.
Computing batch 7/14...
Completed this batch in 44.85835313796997 Secs.
Computing batch 8/14...
Completed this batch in 44.42547106742859 Secs.
Computing batch 9/14...
Completed this batch in 44.020007610321045 Secs.
Computing batch 10/14...
Completed this batch in 44.085976362228394 Secs.
Computing batch 11/14...
Completed this batch in 43.6392982006073 Secs.
Computing batch 12/14...
Completed this batch in 43.603368282318115 Secs.
Computing batch 13/14...
Completed this batch in 45.03933787345886 Secs.
Computing batch 14/14...
Completed this batch in 44.59685492515564 Secs.
Completed predicting the test data.
out_file = open(PATH+"submission2.csv", "w")
out_file.write("ImageId,Label\n")
for i in range(len(predictions)):
out_file.write(str(i+1) + "," + str(int(predictions[i])) + "\n")
out_file.close()
该方案的准确率为97.114%,准确率有小幅度提高。
尝试一种最基本的神经网络模型:MLP(多层感知机)。这里使用sklearn中的神经网络模块MLPClassifier来处理图像分类的问题。
# 数据导入
train = pd.read_csv(PATH+"train.csv")
test = pd.read_csv(PATH+"test.csv")
Y = train['label'][:10000] # use more number of rows for more training
X = train.drop(['label'], axis = 1)[:10000] # use more number of rows for more training
x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42)
model = neural_network.MLPClassifier(alpha=1e-5, hidden_layer_sizes=(5,), solver='lbfgs', random_state=18)
model.fit(x_train, y_train)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(5,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=18, shuffle=True, solver='lbfgs', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
现在我们就建好了如上的分类器,将验证集的数据输入分类器来检验模型的效果。
predicted = model.predict(x_val)
print("Classification Report:\n %s:" % (metrics.classification_report(y_val, predicted)))
Classification Report:
precision recall f1-score support
0 0.00 0.00 0.00 186
1 0.97 0.81 0.88 210
2 0.12 0.99 0.21 220
3 0.00 0.00 0.00 190
4 0.00 0.00 0.00 188
5 0.00 0.00 0.00 194
6 0.00 0.00 0.00 190
7 0.00 0.00 0.00 233
8 0.00 0.00 0.00 197
9 0.00 0.00 0.00 192
accuracy 0.19 2000
macro avg 0.11 0.18 0.11 2000
weighted avg 0.12 0.19 0.12 2000
:
可以看到利用MLP Model进行分类的结果,可以看到多层感知器分类并不是很适用于这样的图像分类问题,在精确率得分上比较低,这启发我们更换其他的神经网络模型看看是否能取得更好的效果。
为了能够将数据合适地输入模型,还需要对数据进行一些处理。在keras的CNN中,其卷积等模块中的操作已经能够自动实现图像的特征提取,因此不在需要人为设置规则来提取图像中的特征。
Y = train['label']
X = train.drop(['label'], axis=1)
x_train, x_val, y_train, y_val = train_test_split(X.as_matrix(), Y.as_matrix(), test_size=0.10, random_state=42)
调节合适的参数,其中num_classes为类别的数量,这里就是0-9的十个数字的类别,同时我们输入的图像为28*28像素的大小,在每个batch中神经网络将处理128个数据。
# network parameters
batch_size = 128
num_classes = 10
epochs = 5 # Further Fine Tuning can be done
# input image dimensions
img_rows, img_cols = 28, 28
# preprocess the train data
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_train /= 255
# preprocess the validation data
x_val = x_val.reshape(x_val.shape[0], img_rows, img_cols, 1)
x_val = x_val.astype('float32')
x_val /= 255
input_shape = (img_rows, img_cols, 1)
# convert the target variable
y_train = keras.utils.to_categorical(y_train, num_classes)
y_val = keras.utils.to_categorical(y_val, num_classes)
# preprocess the test data
Xtest = test.as_matrix()
Xtest = Xtest.reshape(Xtest.shape[0], img_rows, img_cols, 1)
model = Sequential()
# add first convolutional layer
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
# add second convolutional layer
model.add(Conv2D(64, (3, 3), activation='relu'))
# add one max pooling layer
model.add(MaxPooling2D(pool_size=(2, 2)))
# add one dropout layer
model.add(Dropout(0.25))
# add flatten layer
model.add(Flatten())
# add dense layer
model.add(Dense(128, activation='relu'))
# add another dropout layer
model.add(Dropout(0.5))
# add dense layer
model.add(Dense(num_classes, activation='softmax'))
# complile the model and view its architecur
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_3 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
conv2d_4 (Conv2D) (None, 24, 24, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 64) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 12, 12, 64) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 9216) 0
_________________________________________________________________
dense_3 (Dense) (None, 128) 1179776
_________________________________________________________________
dropout_4 (Dropout) (None, 128) 0
_________________________________________________________________
dense_4 (Dense) (None, 10) 1290
=================================================================
Total params: 1,199,882
Trainable params: 1,199,882
Non-trainable params: 0
_________________________________________________________________
可以看到我们利用keras中的相关模块,搭建起了一个简单的CNN模型,接下来需要将我们的训练数据输入模型,对卷积神经网络进行训练。
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_val, y_val))
accuracy = model.evaluate(x_val, y_val, verbose=0)
print('Test accuracy:', accuracy[1])
WARNING:tensorflow:From D:\download\anaconda\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 37800 samples, validate on 4200 samples
Epoch 1/5
37800/37800 [==============================] - 64s 2ms/step - loss: 0.3448 - acc: 0.8939 - val_loss: 0.0937 - val_acc: 0.9717
Epoch 2/5
37800/37800 [==============================] - 65s 2ms/step - loss: 0.1069 - acc: 0.9687 - val_loss: 0.0520 - val_acc: 0.9821
Epoch 3/5
37800/37800 [==============================] - 63s 2ms/step - loss: 0.0764 - acc: 0.9774 - val_loss: 0.0507 - val_acc: 0.9826
Epoch 4/5
37800/37800 [==============================] - 61s 2ms/step - loss: 0.0624 - acc: 0.9809 - val_loss: 0.0441 - val_acc: 0.9860
Epoch 5/5
37800/37800 [==============================] - 62s 2ms/step - loss: 0.0533 - acc: 0.9835 - val_loss: 0.0326 - val_acc: 0.9883
Test accuracy: 0.9883333333333333
模型预测与输出:
pred = model.predict(Xtest)
y_classes = pred.argmax(axis=-1)
res = pd.DataFrame()
res['ImageId'] = list(range(1,28001))
res['Label'] = y_classes
res.to_csv(PATH+"submission3.csv", index = False)
此模型的结果为98.1%,比SVM和KNN又有所提升。
在MNIST数据集分类的任务中,使用不同模型都能够取得不错的结果,其中,SVM将低维空间中线性不可分的模型转化为高维空间中线性可分的问题,能够直接对图像数据进行运算,效率较高。KNN可以通过无监督的方法自动将数据进行分类,发现其中的common pattern。而CNN也更加广泛应用于图像处理当中,其本质上也是从图像中提取更为抽象的特征进行分类。通过使用不同的层级或不同的卷积核,能够对图像数据进行有效分类处理。
MNIST数据集的图像数据较好处理,可以直接转化为0、1的数据,其中1代表黑色区域,这种情况下,色彩不会影响数据分类的结果,然而实际中,彩色的图片分类不可避免,这个时候,一个像素点可能需要更多的数据才能够表示出来,因此,模型建立或运算上将会更为复杂。
参考kernel:
PCA and SVM on MNIST dataset
kNN from scratch in Python at 97.1%
A Very Comprehensive Tutorial : NN + CNN