所有的数据集都可以从sklearn.datasets中获得
在评估时,一般使用F1指标,即使用了调和平均数,除了具备平均功能,还会对那些召回率和精血率更加接近的模型给予更高的分数。
线性分类器通过累加计算每个维度的特征与各自权重的乘机来帮助类别决策。
线性分类器是最为基本和常用的模型,一般将其表现性能作为基准,下面的代码中使用了LogisticsRegression和SGDClassifier
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
column_names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
'Normal Nucleoli', 'Mitoses', 'Class']
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/'
'breast-cancer-wisconsin/breast-cancer-wisconsin.data',
names = column_names )
data = data.replace('?',np.nan)
data = data.dropna(how='any')
# print(data.shape)
# 随机采样25%的数据用于测试,剩下的75%用于构建训练集合。
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:10]],
data[column_names[10]], test_size=0.25, random_state=33)
# print(y_train.value_counts())
# print(y_test.value_counts())
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
lr = LogisticRegression()
sgdc=SGDClassifier()
lr.fit(X_train,y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train,y_train)
sgdc_Y_predict = sgdc.predict(X_test)
SVM是根据训练样本的分布,搜索所有可能的线性分类器中最佳的那个。我们把可以用来真正帮助决策最优线性分类模型的数据点叫做支持向量。
这里使用的数据是手写体数字。
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
digits = load_digits()
print(digits.data.shape)
# 随机采样25%的数据用于测试,剩下的75%用于构建训练集合。
X_train, X_test, y_train, y_test = train_test_split(digits.data,
digits.target, test_size=0.25, random_state=33)
print(y_train.shape)
print(X_train.shape)
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.fit_transform(X_test)
lsvc=LinearSVC()
lsvc.fit(X_train,y_train)
y_predict=lsvc.predict(X_test)
print('The Accuracy of Linnear SVC is', lsvc.score(X_test,y_test))
print(classification_report(y_test,y_predict,target_names=digits.target_names.astype(str)))
支持向量机之所以受人欢迎,是在于其精妙的模型假设,可以帮助我们在海量甚至高维度的数据中,筛选对预测任务最为有效的少数训练样本。不仅减少了数据内存,而且提高了模型的预测性能。然而付出的代价就是更多的计算代价(CPU资源和计算时间)
是一个非常简单,但是实用性很强的分类模型。它会单独考量每一维度特征被分类的条件概率,进而综合这些概率,并对其所在的特征向量做出分类预测。
适用于文本分类(互联网新闻分类,垃圾邮件筛选)
使用20类新闻文本数据
from sklearn.datasets import fetch_20newsgroups
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
news = fetch_20newsgroups(subset='all')
print(len(news.data))
print(news.data[0])
X_train, X_test, y_train, y_test = train_test_split(news.data,
news.target, test_size=0.25, random_state=33)
vec = CountVectorizer()
X_train = vec.fit_transform(X_train)
X_test = vec.transform(X_test)
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_predict = mnb.predict(X_test)
print("Accuracy:",mnb.score(X_test,y_test))
print( classification_report(y_test,y_predict,target_names = news.target_names))
K邻近是非常直观的机器学习模型。该模型没有参数训练过程。
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
#K邻近分类器
from sklearn.neighbors import KNeighborsClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.25, random_state=33)
# print(iris.data.shape)
# print(iris.DESCR)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
#使用K邻近分类器
knc = KNeighborsClassifier()
knc.fit(X_train,y_train)
y_predict = knc.predict(X_test)
print("Accuracy:", knc.score(X_test,y_test))
print(classification_report(y_test,y_predict,target_names=iris.target_names))
使用了梯度上升决策树、随机森林分类模型、单一决策树。随机森林分类模型经常被用作基线系统。
集成模型科技整合多种模型,或者多次就一种类型的模型进行建模。所得到的综合模型具有更高的表现性能和更好的稳定性。
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
# 人工选取pclass、age以及sex作为判别乘客是否能够生还的特征。
# print(titanic.head())
X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']
# 对于缺失的年龄信息,我们使用全体乘客的平均年龄代替,这样可以在保证顺利训练模型的同时,尽可能不影响预测任务。
X['age'].fillna(X['age'].mean(), inplace=True)
print(X.info())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_pred = dtc.predict(X_test)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_pred = gbc.predict(X_test)
# 输出单一决策树在测试集上的分类准确性,以及更加详细的精确率、召回率、F1指标。
print ('The accuracy of decision tree is', dtc.score(X_test, y_test))
print (classification_report(dtc_y_pred, y_test))
# 输出随机森林分类器在测试集上的分类准确性,以及更加详细的精确率、召回率、F1指标。
print (The accuracy of random forest classifier is', rfc.score(X_test, y_test))
print (classification_report(rfc_y_pred, y_test))
# 输出梯度提升决策树在测试集上的分类准确性,以及更加详细的精确率、召回率、F1指标。
print ('The accuracy of gradient tree boosting is', gbc.score(X_test, y_test))
print (classification_report(gbc_y_pred, y_test))