Kaggle Digit Recognizer 基于sklearn实现的手写数字识别 for MNIST data

Kaggle Digit Recognizer 基于sklearn实现的手写数字识别 for MNIST data

一、手写数字识别数据集

手写数字识别数据集是非常著名的数据集。

介绍和下载地址:http://yann.lecun.com/exdb/mnist/

我的训练集和测试集:https://www.kaggle.com/c/digit-recognizer/data

二、解决方案(附详细注释)

# coding=utf-8
import numpy
from sklearn.decomposition import PCA
from sklearn.svm import SVC

COMPONENT_NUM = 35  # 设置pca降维的维度值

print('Read training data...')
with open('train.csv', 'r') as reader:
    reader.readline()  # 去掉第一行表头
    train_label = []
    train_data = []
    for line in reader.readlines():
        data = list(
            map(int, line.rstrip().split(',')))  # map()函数接收两个参数,一个是函数,一个是序列,map将传入的函数依次作用到序列的每个元素,并把结果作为新的list返回。
        train_label.append(data[0])
        train_data.append(data[1:])
print('Loaded ' + str(len(train_label)))

print('Reduction...')
train_label = numpy.array(train_label)  # 将list转换成numpy数组
train_data = numpy.array(train_data)
print "1"
print train_data.shape  # 原始数据集的维度
pca = PCA(n_components=COMPONENT_NUM, whiten=True)
pca.fit(train_data)  # Fit the model with X
train_data = pca.transform(train_data)  # Fit the model with X and 在X上完成降维.
print "2"
print train_data.shape  # 降维后数据集的维度

print('Train SVM...')
svc = SVC()
svc.fit(train_data, train_label)  # 训练SVM

print('Read testing data...')
with open('test.csv', 'r') as reader:  # 加载测试集
    reader.readline()
    test_data = []
    for line in reader.readlines():
        pixels = list(map(int, line.rstrip().split(',')))
        test_data.append(pixels)
print('Loaded ' + str(len(test_data)))

print('Predicting...')
test_data = numpy.array(test_data)
test_data = pca.transform(test_data)
predict = svc.predict(test_data)

print('Saving...')  # 保存预测结果
with open('predict.csv', 'w') as writer:
    writer.write('"ImageId","Label"\n')
    count = 0
    for p in predict:
        count += 1
        writer.write(str(count) + ',"' + str(p) + '"\n')

三、效果

一次训练的预测准确度高达98.2%,其实解决手写数字识别的方法有很多,训练多个分类器做boosting也可以,会更准确,甚至采用CNN(卷积神经网络)也能达到非常好的效果。

你可能感兴趣的:(Data,Game)