图像处理实践 | 基于MNIST数据集的手写数字识别

基于MNIST数据集的手写数字识别

1数据获取与数据集介绍

数据来源:

Kaggle Competition:Digit Recognizer, Learn computer vision fundamentals with the famous MNIST data.

该数据集包含数万条手写数据的图像信息,目标是对于根据有标记的手写数据图像数据建模,从而对未标记的数据进行分类。该比赛是计算机视觉中最为入门级的比赛,通过这个比赛可以掌握处理非结构化数据(图像)的基本流程。

2 预处理与特征提取

这里根据图像数据的特征选择合适的机器学习模型进行处理,这里采用三种不同的方法来应对手写数字的分类问题:PCA+SVM、KNN以及卷积神经网络,使用到sklearn、keras等常用模块。

2.1 数据导入

# 导入所必要的一些包

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
from time import time
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import neural_network
from sklearn import  metrics
import math
import time
from collections import Counter
import keras
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.models import Sequential
import warnings
warnings.filterwarnings('ignore')
# 数据导入并查看基本信息
PATH="E:/kaggle/digit-recognizer/"
train=pd.read_csv(PATH+'train.csv')
print(train.shape)
print(train.info)
(42000, 785)

train.head()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

可以看到,图像数据就是由像素点的数据组成的,每张图片为28*28=784个像素。MNIST数据集的手写数字图像为黑白图像,即在每个格子中数据的取值只有可能是0或1,现在我们要根据这些像素值来进行分类,在处理的过程中,784个像素可以看做target的784个特征。

2.2 利用PCA降维提取特征

首先我们可以试着用传统的方法,SVM来进行图像的分类,在分类之前,我们先用PCA的方法对于数据进行降维,从而达到降低计算开销的作用。

# 训练集测试集划分

X_train=train.drop(['label'],axis='columns',inplace=False)
y_train=train['label']
from sklearn.model_selection import train_test_split
X_tr,X_ts,y_tr,y_ts=train_test_split(X_train,y_train,test_size=0.30,random_state=4)

在主成分分析中,n_components是最重要的参数,代表我们需要保留的主成分个数。通过设置n_component=16,我们可以建立起只有16个值的模型,极大减少运算时间,同时能够不丢失太多的准确率。

n_components = 16
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
          whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))

X_train_pca = pca.transform(X_train)
done in 1.828s
# 查看方差直方图
plt.hist(pca.explained_variance_ratio_, bins=n_components, log=True)
pca.explained_variance_ratio_.sum()
0.5953435812797994

图像处理实践 | 基于MNIST数据集的手写数字识别_第1张图片

根据输出结果我们可以看到,保留前16个主成分能够留住数据59%的主要信息。

3 建立模型

3.1 SVM分类器

使用sklearn包中自带的SVM函数来对于数据进行训练。


param_grid = { "C" : [0.1]
              , "gamma" : [0.1]}
rf = SVC()
gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=2, n_jobs=-1, verbose=1)
gs = gs.fit(X_train_pca, y_train)

print(gs.best_score_)
print(gs.best_params_)
Fitting 2 folds for each of 1 candidates, totalling 2 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   20.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   20.3s finished


0.9430238095238095
{'C': 0.1, 'gamma': 0.1}
bp = gs.best_params_
t0 = time()
clf = SVC(C=bp['C'], kernel='rbf', gamma=bp['gamma'])
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
done in 18.860s
clf.score(pca.transform(X_ts), y_ts)
0.9568253968253968

可以看到,在我们的验证数据中已经达到了95.6%的精确度,其中SVM的参数分别为C:0.1,gamma:0.1。其中C为惩罚系数,C减小可以防止过拟合,这里使用适当的C使得模型达到最好的泛化性能。gamma为支持向量的多少。

接着我们可以按照要求将结果输出,即对于未打标签的图像,进行实际label的预测。最后的效果可以通过Kaggle的线上平台进行评估分析。

val = pd.read_csv(PATH+'test.csv')
pred = clf.predict(pca.transform(val))
# ImageId,Label

val['Label'] = pd.Series(pred)
val['ImageId'] = val.index +1
sub = val[['ImageId','Label']]
sub.to_csv(PATH+'submission1.csv', index=False)

最终的模型结果为97.1%的准确率,确实是效率较高的一种方法了。

3.2 KNN

KNN是一种无监督聚类方法,这里构建KNN分类器,其原理是将样本分到样本空间中距离最近的一个类别里。这里设计实现了一个简单的KNN模块。

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) 
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# 导入数据的函数

def load_data(data_dir):
    train_data = open(data_dir + "train.csv").read()
    train_data = train_data.split("\n")[1:-1]
    train_data = [i.split(",") for i in train_data]
    X_train = np.array([[int(i[j]) for j in range(1,len(i))] for i in train_data])
    y_train = np.array([int(i[0]) for i in train_data])

    test_data = open(data_dir + "test.csv").read()
    test_data = test_data.split("\n")[1:-1]
    test_data = [i.split(",") for i in test_data]
    X_test = np.array([[int(i[j]) for j in range(0,len(i))] for i in test_data])

    return X_train, y_train, X_test

# KNN实现的模块
class simple_knn():

    def __init__(self):
        pass

    def train(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X, k=1):
        # 计算样本距离       
        dists = self.compute_distances(X)
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)

        for i in range(num_test):
            k_closest_y = []
            labels = self.y_train[np.argsort(dists[i,:])].flatten()
            k_closest_y = labels[:k]    # 将k个最近邻居的label找到
            c = Counter(k_closest_y)
            y_pred[i] = c.most_common(1)[0][0]

        return(y_pred)

    def compute_distances(self, X):
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]

        dot_pro = np.dot(X, self.X_train.T)
        sum_square_test = np.square(X).sum(axis = 1)
        sum_square_train = np.square(self.X_train).sum(axis = 1)
        dists = np.sqrt(-2 * dot_pro + sum_square_train + np.matrix(sum_square_test).T)

        return(dists)
X_train, y_train, X_test = load_data(PATH)
batch_size = 2000
k = 3  # 邻居类别的个数(knn的参数)
classifier = simple_knn()
classifier.train(X_train, y_train)

调用KNN模块对于模型进行预测

predictions = []
for i in range(int(len(X_test)/batch_size)):
    print("Computing batch " + str(i+1) + "/" + str(int(len(X_test)/batch_size)) + "...")
    tic = time.time()
    predts = classifier.predict(X_test[i * batch_size:(i+1) * batch_size], k)
    toc = time.time()
    predictions = predictions + list(predts)
    print("Completed this batch in " + str(toc-tic) + " Secs.")
print("Completed predicting the test data.")
Computing batch 1/14...
Completed this batch in 53.51499319076538 Secs.
Computing batch 2/14...
Completed this batch in 43.31397557258606 Secs.
Computing batch 3/14...
Completed this batch in 42.59756851196289 Secs.
Computing batch 4/14...
Completed this batch in 43.00966835021973 Secs.
Computing batch 5/14...
Completed this batch in 43.01448702812195 Secs.
Computing batch 6/14...
Completed this batch in 47.93128275871277 Secs.
Computing batch 7/14...
Completed this batch in 44.85835313796997 Secs.
Computing batch 8/14...
Completed this batch in 44.42547106742859 Secs.
Computing batch 9/14...
Completed this batch in 44.020007610321045 Secs.
Computing batch 10/14...
Completed this batch in 44.085976362228394 Secs.
Computing batch 11/14...
Completed this batch in 43.6392982006073 Secs.
Computing batch 12/14...
Completed this batch in 43.603368282318115 Secs.
Computing batch 13/14...
Completed this batch in 45.03933787345886 Secs.
Computing batch 14/14...
Completed this batch in 44.59685492515564 Secs.
Completed predicting the test data.
out_file = open(PATH+"submission2.csv", "w")
out_file.write("ImageId,Label\n")
for i in range(len(predictions)):
    out_file.write(str(i+1) + "," + str(int(predictions[i])) + "\n")
out_file.close()

该方案的准确率为97.114%,准确率有小幅度提高。

3.3 NN Model

尝试一种最基本的神经网络模型:MLP(多层感知机)。这里使用sklearn中的神经网络模块MLPClassifier来处理图像分类的问题。

# 数据导入
train = pd.read_csv(PATH+"train.csv")
test = pd.read_csv(PATH+"test.csv")

Y = train['label'][:10000] # use more number of rows for more training 
X = train.drop(['label'], axis = 1)[:10000] # use more number of rows for more training 
x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42)
model = neural_network.MLPClassifier(alpha=1e-5, hidden_layer_sizes=(5,), solver='lbfgs', random_state=18)
model.fit(x_train, y_train)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(5,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=18, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

现在我们就建好了如上的分类器,将验证集的数据输入分类器来检验模型的效果。

predicted = model.predict(x_val)
print("Classification Report:\n %s:" % (metrics.classification_report(y_val, predicted)))
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       186
           1       0.97      0.81      0.88       210
           2       0.12      0.99      0.21       220
           3       0.00      0.00      0.00       190
           4       0.00      0.00      0.00       188
           5       0.00      0.00      0.00       194
           6       0.00      0.00      0.00       190
           7       0.00      0.00      0.00       233
           8       0.00      0.00      0.00       197
           9       0.00      0.00      0.00       192

    accuracy                           0.19      2000
   macro avg       0.11      0.18      0.11      2000
weighted avg       0.12      0.19      0.12      2000
:

可以看到利用MLP Model进行分类的结果,可以看到多层感知器分类并不是很适用于这样的图像分类问题,在精确率得分上比较低,这启发我们更换其他的神经网络模型看看是否能取得更好的效果。

3.4 CNN

3.4.1 数据处理和准备

为了能够将数据合适地输入模型,还需要对数据进行一些处理。在keras的CNN中,其卷积等模块中的操作已经能够自动实现图像的特征提取,因此不在需要人为设置规则来提取图像中的特征。

Y = train['label']
X = train.drop(['label'], axis=1)

x_train, x_val, y_train, y_val = train_test_split(X.as_matrix(), Y.as_matrix(), test_size=0.10, random_state=42)

调节合适的参数,其中num_classes为类别的数量,这里就是0-9的十个数字的类别,同时我们输入的图像为28*28像素的大小,在每个batch中神经网络将处理128个数据。

# network parameters 
batch_size = 128
num_classes = 10
epochs = 5 # Further Fine Tuning can be done

# input image dimensions
img_rows, img_cols = 28, 28
# preprocess the train data 
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_train /= 255

# preprocess the validation data
x_val = x_val.reshape(x_val.shape[0], img_rows, img_cols, 1)
x_val = x_val.astype('float32')
x_val /= 255

input_shape = (img_rows, img_cols, 1)

# convert the target variable 
y_train = keras.utils.to_categorical(y_train, num_classes)
y_val = keras.utils.to_categorical(y_val, num_classes)

# preprocess the test data
Xtest = test.as_matrix()
Xtest = Xtest.reshape(Xtest.shape[0], img_rows, img_cols, 1)

3.4.2 搭建神经网络模型

model = Sequential()

# add first convolutional layer
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))

# add second convolutional layer
model.add(Conv2D(64, (3, 3), activation='relu'))

# add one max pooling layer 
model.add(MaxPooling2D(pool_size=(2, 2)))

# add one dropout layer
model.add(Dropout(0.25))

# add flatten layer
model.add(Flatten())

# add dense layer
model.add(Dense(128, activation='relu'))

# add another dropout layer
model.add(Dropout(0.5))

# add dense layer
model.add(Dense(num_classes, activation='softmax'))

# complile the model and view its architecur
model.compile(loss=keras.losses.categorical_crossentropy,  optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_3 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 24, 24, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 64)        0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 12, 12, 64)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 9216)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               1179776   
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                1290      
=================================================================
Total params: 1,199,882
Trainable params: 1,199,882
Non-trainable params: 0
_________________________________________________________________


可以看到我们利用keras中的相关模块,搭建起了一个简单的CNN模型,接下来需要将我们的训练数据输入模型,对卷积神经网络进行训练。
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_val, y_val))
accuracy = model.evaluate(x_val, y_val, verbose=0)
print('Test accuracy:', accuracy[1])
WARNING:tensorflow:From D:\download\anaconda\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 37800 samples, validate on 4200 samples
Epoch 1/5
37800/37800 [==============================] - 64s 2ms/step - loss: 0.3448 - acc: 0.8939 - val_loss: 0.0937 - val_acc: 0.9717
Epoch 2/5
37800/37800 [==============================] - 65s 2ms/step - loss: 0.1069 - acc: 0.9687 - val_loss: 0.0520 - val_acc: 0.9821
Epoch 3/5
37800/37800 [==============================] - 63s 2ms/step - loss: 0.0764 - acc: 0.9774 - val_loss: 0.0507 - val_acc: 0.9826
Epoch 4/5
37800/37800 [==============================] - 61s 2ms/step - loss: 0.0624 - acc: 0.9809 - val_loss: 0.0441 - val_acc: 0.9860
Epoch 5/5
37800/37800 [==============================] - 62s 2ms/step - loss: 0.0533 - acc: 0.9835 - val_loss: 0.0326 - val_acc: 0.9883
Test accuracy: 0.9883333333333333

模型预测与输出:

pred = model.predict(Xtest)
y_classes = pred.argmax(axis=-1)
res = pd.DataFrame()
res['ImageId'] = list(range(1,28001))
res['Label'] = y_classes
res.to_csv(PATH+"submission3.csv", index = False)

此模型的结果为98.1%,比SVM和KNN又有所提升。

4 模型总结

在MNIST数据集分类的任务中,使用不同模型都能够取得不错的结果,其中,SVM将低维空间中线性不可分的模型转化为高维空间中线性可分的问题,能够直接对图像数据进行运算,效率较高。KNN可以通过无监督的方法自动将数据进行分类,发现其中的common pattern。而CNN也更加广泛应用于图像处理当中,其本质上也是从图像中提取更为抽象的特征进行分类。通过使用不同的层级或不同的卷积核,能够对图像数据进行有效分类处理。

MNIST数据集的图像数据较好处理,可以直接转化为0、1的数据,其中1代表黑色区域,这种情况下,色彩不会影响数据分类的结果,然而实际中,彩色的图片分类不可避免,这个时候,一个像素点可能需要更多的数据才能够表示出来,因此,模型建立或运算上将会更为复杂。

参考kernel:

PCA and SVM on MNIST dataset

kNN from scratch in Python at 97.1%

A Very Comprehensive Tutorial : NN + CNN

你可能感兴趣的:(Deep,Learning)