原文:https://blog.csdn.net/zouxy09/article/details/48903179
参考:https://www.cnblogs.com/upright/p/4191757.html
一、概述
分类算法为例,大致可以分为线性和非线性两大派别。线性算法有著名的逻辑回归、朴素贝叶斯、最大熵等,非线性算法有随机森林、决策树、神经网络、核机器等等。线性算法举的大旗是训练和预测的效率比较高,但最终效果对特征的依赖程度较高,需要数据在特征层面上是线性可分的。因此,使用线性算法需要在特征工程上下不少功夫,尽量对特征进行选择、变换或者组合等使得特征具有区分性。而非线性算法则牛逼点,可以建模复杂的分类面,从而能更好的拟合数据。
那在我们选择了特征的基础上,哪个机器学习算法能取得更好的效果呢?机器学习社区的力量是强大的,码农界的共识是不重复造轮子!因此,对某些较为成熟的算法,总有某些优秀的库可以直接使用,省去了大伙调研的大部分时间。
基于目前使用python较多,而python界中远近闻名的机器学习库要数scikit-learn莫属了。这个库优点很多。简单易用,接口抽象得非常好,而且文档支持实在感人。本文中,我们可以封装其中的很多机器学习算法,然后进行一次性测试,从而便于分析取优。当然了,针对具体算法,超参调优也非常重要。
二、Scikit-learn的python实践
scikit-learn已经包含在Anaconda中。也可以在官方下载源码包进行安装。本文代码里封装了如下机器学习算法,我们修改数据加载函数,即可一键测试:
classifiers = {'NB':naive_bayes_classifier,
'KNN':knn_classifier,
'LR':logistic_regression_classifier,
'RF':random_forest_classifier,
'DT':decision_tree_classifier,
'SVM':svm_classifier,
'SVMCV':svm_cross_validation,
'GBDT':gradient_boosting_classifier
}
mnist_traintest.py
#!/usr/bin/evn python
# -*- coding: utf-8 -*-
"""
__Author__ : Du ZH
__Create_time__ : 2018/11/5 上午10:56
__File__ : mnist_traintest.py
__Software__ : PyCharm
用MNIST数据集试用sklearn不同分类算法 OK
"""
import sys
import os
import time
from sklearn import metrics
import numpy as np
import pickle
# Multinomial Naive Bayes Classifier
def naive_bayes_classifier(train_x, train_y):
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=0.01)
model.fit(train_x, train_y)
return model
# KNN Classifier
def knn_classifier(train_x, train_y):
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(train_x, train_y)
return model
# Logistic Regression Classifier
def logistic_regression_classifier(train_x, train_y):
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2')
model.fit(train_x, train_y)
return model
# Random Forest Classifier
def random_forest_classifier(train_x, train_y):
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=8)
model.fit(train_x, train_y)
return model
# Decision Tree Classifier
def decision_tree_classifier(train_x, train_y):
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(train_x, train_y)
return model
# GBDT(Gradient Boosting Decision Tree) Classifier
def gradient_boosting_classifier(train_x, train_y):
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200)
model.fit(train_x, train_y)
return model
# SVM Classifier
def svm_classifier(train_x, train_y):
from sklearn.svm import SVC
model = SVC(kernel='rbf', probability=True)
model.fit(train_x, train_y)
return model
# SVM Classifier using cross validation
def svm_cross_validation(train_x, train_y):
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
model = SVC(kernel='rbf', probability=True)
param_grid = {'C': [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001]}
grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print(para, val)
model = SVC(kernel='rbf', C=best_parameters['C'], gamma=best_parameters['gamma'], probability=True)
model.fit(train_x, train_y)
return model
def read_data(data_file):
import gzip
f = gzip.open(data_file, "rb")
# train, val, test = pickle.load(f)
# 改:使用了_Unpickler对象,并且对象的构造函数使用了“bytes”编码,这样就能正确读取pickle格式的文件了
Myunpickle = pickle._Unpickler(file=f, fix_imports=True, encoding= "bytes", errors = "strict")
train, val, test = Myunpickle.load()
f.close()
train_x = train[0]
train_y = train[1]
test_x = test[0]
test_y = test[1]
return train_x, train_y, test_x, test_y
if __name__ == '__main__':
data_file = "/Users/Downloads/mnist.pkl.gz" # ""mnist.pkl.gz"
thresh = 0.5
model_save_file = None
model_save = {}
test_classifiers = ['NB', 'KNN', 'LR', 'RF', 'DT', 'SVM', 'GBDT']
classifiers = {'NB': naive_bayes_classifier,
'KNN': knn_classifier,
'LR': logistic_regression_classifier,
'RF': random_forest_classifier,
'DT': decision_tree_classifier,
'SVM': svm_classifier,
'SVMCV': svm_cross_validation,
'GBDT': gradient_boosting_classifier
}
print('reading training and testing data...')
train_x, train_y, test_x, test_y = read_data(data_file)
num_train, num_feat = train_x.shape
num_test, num_feat = test_x.shape
is_binary_class = (len(np.unique(train_y)) == 2)
print('******************** Data Info *********************')
print('#training data: %d, #testing_data: %d, dimension: %d' % (num_train, num_test, num_feat))
for classifier in test_classifiers:
print('******************* %s ********************' % classifier)
start_time = time.time()
model = classifiers[classifier](train_x, train_y)
print('training took %fs!' % (time.time() - start_time))
predict = model.predict(test_x)
if model_save_file != None:
model_save[classifier] = model
if is_binary_class:
precision = metrics.precision_score(test_y, predict)
recall = metrics.recall_score(test_y, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
accuracy = metrics.accuracy_score(test_y, predict)
print('accuracy: %.2f%%' % (100 * accuracy))
if model_save_file != None:
pickle.dump(model_save, open(model_save_file, 'wb'))
三、数据集介绍
本次使用mnist手写体库进行实验:http://deeplearning.net/data/mnist/mnist.pkl.gz。共5万训练样本和1万测试样本。
MNIST数据集由手写数字图像组成,分为6w个训练集示例和1w个测试示例。在许多论文以及本教程中,60000的官方训练集被划分为5w个示例和1w个验证示例的实际训练集(用于选择超参数,如学习率和模型的大小)。所有数字图像已被规格化,并以28×28像素的固定大小的图像为中心。在原始数据集中,图像的每个像素由0到255之间的值表示,其中0是黑色,255是白色,并且两者之间的任何颜色都是不同的灰色阴影。
下面是MNIST数字的一些例子:
为了方便起见,我们对数据集进行了处理,以便在Python中更容易使用。可在这里下载。pickled的文件代表3个列表的元组:训练集、验证集和测试集。三个列表中的每一个都是从图像列表和每个图像的类标签列表形成的一对。图像被表示为784(28×28)个浮点值的一维数组,浮点值介于0和1(0代表黑色,1代表白色)之间。标签是介于0和9之间的数字,表示图像代表的数字。下面的代码块演示如何加载数据集。
import cPickle, gzip, numpy
# Load the dataset
f = gzip.open(’mnist.pkl.gz’, ’rb’)
train_set, valid_set, test_set = cPickle.load(f)
f.close()
当使用数据集时,我们通常将其划分为小批量(参见随机梯度下降)。我们鼓励您将数据集存储到共享变量中,并在给定固定和已知的批量大小的情况下,基于迷你批量索引访问它。共享变量背后的原因与使用GPU有关。当将数据复制到GPU存储器时,存在很大的开销。如果您将根据请求复制数据(在需要时每个小批量单独复制),就像如果不使用共享变量时所做的那样,由于这种开销,GPU代码不会比CPU代码快很多(甚至可能更慢)。但是,如果数据在Theano共享变量中,那么在构造共享变量时,可以在单个调用中将整个数据复制到GPU上。随后,GPU可以通过从这个共享变量中取出一个片来访问任何小批量,而不需要从CPU内存中复制任何信息,从而绕过开销。因为数据点及其标签通常具有不同的性质(标签通常是整数,而数据点是实数),我们建议为标签和数据使用不同的变量。此外,我们重新要求对训练集、验证集和测试集使用不同的变量,以使代码更易读(导致6个不同的共享变量)。
由于现在数据在一个变量中,而minibatch被定义为该变量的切片,因此通过指示其索引及其大小来定义minibatch就更自然了。在我们的设置中,批处理大小在整个代码执行过程中保持不变,因此函数实际上只需要索引来标识在哪些数据点上工作。下面的代码演示了如何存储数据以及如何访问迷你批处理:
def shared_dataset(data_xy):
""" Function that loads the dataset into shared variables
The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ‘‘floatX‘‘ as well
# (‘‘shared_y‘‘ does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn’t make sense) therefore instead of returning
# ‘‘shared_y‘‘ we will have to cast it to int. This little hack
# lets us get around this issue
return shared_x, T.cast(shared_y, ’int32’)
test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)
batch_size = 500 # size of the minibatch
# accessing the third minibatch of the training set
data = train_set_x[2 * 500: 3 * 500]
label = train_set_y[2 * 500: 3 * 500]
数据必须作为浮点存储在GPU上(由GeaNo.CONFIG.FLUATX给出的用于在GPU上存储的右Dype)。为了绕过标签的短信息,我们将它们存储为浮点,然后将其转换为int。
------------------------------------------------------------------------
注意:如果在GPU上运行代码,而使用的数据集太大,无法适应内存,代码将崩溃。在这种情况下,应该将数据存储在共享变量中。然而,您可以将足够小的数据块(几个小批)存储在共享变量中,并在培训期间使用它。一旦你通过了块,更新它存储的值。这样可以减少CPU内存和GPU内存之间的数据传输次数。
四、测试结果
/Applications/anaconda3/bin/python /Users/PycharmProjects/pic_dir/mnist_traintest.py
reading training and testing data...
******************** Data Info *********************
#training data: 50000, #testing_data: 10000, dimension: 784
******************* NB ********************
training took 0.801910s!
accuracy: 83.69%
******************* KNN ********************
training took 15.772094s!
accuracy: 96.64%
******************* LR ********************
training took 57.367839s!
accuracy: 92.00%
******************* RF ********************
training took 2.246345s!
accuracy: 93.86%
******************* DT ********************
training took 14.266614s!
accuracy: 87.18%
******************* SVM ********************
training took 2796.227941s!
accuracy: 94.35%
******************* GBDT ********************
training took 4174.791769s!
accuracy: 96.18%
Process finished with exit code 0
在这个数据集中,由于数据分布的团簇性较好(如果对这个数据库了解的话,看它的t-SNE映射图就可以看出来。由于任务简单,其在deep learning界已被认为是toy dataset),因此KNN的效果不赖。GBDT是个非常不错的算法,在kaggle等大数据比赛中,状元探花榜眼之列经常能见其身影。三个臭皮匠赛过诸葛亮,还是被验证有道理的,特别是三个臭皮匠还能力互补的时候!
还有一个在实际中非常有效的方法,就是融合这些分类器,再进行决策。例如简单的投票,效果都非常不错。建议在实践中,大家都可以尝试下。