最近,在kaggle上找到一位大牛写的机器学习算法总结,感觉流程清晰,内容详实,因此翻译并分享下,由于作者不明原因将原文删除了,所以没法放上原文地址,文中主要以代码实践的方式展开各种算法,原理方面参考文中的地址连接(这是自己加上的),以便随时查阅~
我们将使用经典的Iris数据集。该数据集包含有关三种不同类型鸢尾花的信息:
数据集包含四个变量的度量:
Iris数据集有许多有趣的特点:
#python {cmd="G:\\Anaconda3\\python.exe" output='html'}
# packages to load
# Check the versions of libraries
# Python version
import warnings
warnings.filterwarnings('ignore')
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
import numpy
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# numpy
import numpy as np # linear algebra
print('numpy: {}'.format(np.__version__))
# pandas
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
print('pandas: {}'.format(pd.__version__))
import seaborn as sns
print('seaborn: {}'.format(sns.__version__))
sns.set(color_codes=True)
import matplotlib.pyplot as plt
print('matplotlib: {}'.format(matplotlib.__version__))
#matplotlib inline
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
#matplotlib inline
from sklearn.metrics import accuracy_score
# Importing metrics for evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
数据收集是收集和测量数据、信息或任何感兴趣的变量的过程,以标准化和确定的方式,使收集器能够回答或测试假设并评估特定集合的结果
Iris数据集由3种不同类型的鸢尾花(Setosa、versicolor和virginica)组成,它的花瓣和萼片长度,储存在150x4的numpy.ndarray中。
读取数据
读取数据、查看数据类型、数据形状以及基本信息
#python {cmd="G:\\Anaconda3\\python.exe" output='html'}
# import Dataset to play with it
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
print(type(dataset))
print(dataset.shape)
print(dataset.size)
print(dataset.info())
dataset['Species'].unique()
dataset["Species"].value_counts()
dataset.head(5) #开头
dataset.tail() #结尾
dataset.sample(5) #随机
dataset.describe()
dataset.where(dataset ['Species']=='Iris-setosa')
dataset[dataset['SepalLengthCm']>7.2]
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
dataset = pd.read_csv('../data/Iris.csv')
import seaborn as sns
import matplotlib.pyplot as plt
# Modify the graph above by assigning each species an individual color.
sns.FacetGrid(dataset, hue="Species", size=5) \
.map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
.add_legend()
plt.show()
该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.FacetGrid.map.html#seaborn.FacetGrid.map
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('../data/Iris.csv')
dataset.plot(kind='box', subplots=True, layout=(2,3), sharex=False, sharey=False)
plt.show()
该函数详细说明参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?highlight=plot#pandas.DataFrame.plot
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.boxplot(x="Species", y="PetalLengthCm", data=dataset )
plt.show()
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
ax= sns.boxplot(x="Species", y="PetalLengthCm", data=dataset)
ax= sns.stripplot(x="Species", y="PetalLengthCm", data=dataset, jitter=True, edgecolor="gray")
plt.show()
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
ax= sns.boxplot(x="Species", y="PetalLengthCm", data=dataset)
ax= sns.stripplot(x="Species", y="PetalLengthCm", data=dataset, jitter=True, edgecolor="gray")
boxtwo = ax.artists[2]
boxtwo.set_facecolor('red')
boxtwo.set_edgecolor('black')
boxthree=ax.artists[1]
boxthree.set_facecolor('yellow')
boxthree.set_edgecolor('black')
plt.show()
该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot
我们也可以创建每个输入变量的直方图来获得分布的概念
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('../data/Iris.csv')
dataset.hist(figsize=(15,20))
plt.show()
看起来可能有两个输入变量有 高斯分布。这一点值得注意,因为我们可以使用算法来利用这个假设
该函数详细说明参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html?highlight=hist#pandas.DataFrame.hist
对角线是直方图,其它的部分是两个变量之间的散点图
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('../data/Iris.csv')
pd.plotting.scatter_matrix(dataset,figsize=(10,10))
plt.show()
注意一些属性对的对角分组。这表明了一种高相关性和可预测的关系。
该函数详细说明参考: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html?highlight=plotting
小提琴图又称核密度图,它是结合了箱形图和核密度图的图,将箱形图和密度图用一个图来显示,因为形状很像小提琴,所以被称作小提琴图。
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.violinplot(data=dataset,x="Species", y="PetalLengthCm")
plt.show()
该函数详细说明参考:http://seaborn.pydata.org/generated/seaborn.violinplot.html?highlight=violinplot#seaborn.violinplot
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.pairplot(dataset, hue="Species")
plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.pairplot.html?highlight=pairplot#seaborn.pairplot
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.FacetGrid(dataset, hue="Species", size=5).map(sns.kdeplot, "PetalLengthCm").add_legend()
plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
sns.jointplot(x="SepalLengthCm", y="SepalWidthCm", data=dataset, size=10,ratio=10, kind='hex',color='green')
plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot
每个点 x = x 1 , x 2 , … , x d x={x_1,x_2,…,x_d } x=x1,x2,…,xd定义一个有限傅里叶序列:
f ( t ) = x 1 2 + x 2 s i n ( t ) + x 3 c o s ( t ) + x 4 s i n ( 2 t ) + x 5 c o s ( 2 t ) + … f(t)=\frac {x_1}{\sqrt2}+x_2 sin(t)+x_3 cos(t)+x_4 sin(2t)+ x_5 cos(2t)+ … f(t)=2x1+x2sin(t)+x3cos(t)+x4sin(2t)+x5cos(2t)+…
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
from pandas.plotting import andrews_curves
andrews_curves(dataset.drop("Id", axis=1), "Species",colormap='rainbow')
plt.show()
原理参考: http://www.jucs.org/jucs_11_11/visualization_of_high_dimensional/jucs_11_11_1806_1819_garc_a_osorio.pdf
该函数详细说明参考: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.andrews_curves.html?highlight=andrews_curves
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
sns.heatmap(dataset.corr(),annot=True,cmap='cubehelix_r') #draws heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()
该函数详细说明参考: http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap
#python {cmd="G:\\Anaconda3\\python.exe" output='html' matplotlib=true}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('../data/Iris.csv')
from pandas.tools.plotting import radviz
radviz(dataset.drop("Id", axis=1), "Species")
plt.show()
该函数详细说明参考: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.radviz.html?highlight=radviz#pandas.plotting.radviz
数据清理的主要目标
需要解决问题包括
在本节中,应用了 将近20个学习算法
分类算法的评价术语
分类算法的评价指标
# K-Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier
Model = KNeighborsClassifier(n_neighbors=8)
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
K近邻法算法原理参考:https://www.cnblogs.com/pinard/p/6061661.html
from sklearn.neighbors import RadiusNeighborsClassifier
Model=RadiusNeighborsClassifier(radius=8.0)
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
#summary of the predictions made by the classifier
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
#Accouracy score
print('accuracy is ', accuracy_score(y_test,y_pred))
K近邻法和限定半径最近邻法类库参数 https://www.cnblogs.com/pinard/p/6065607.html
# LogisticRegression
from sklearn.linear_model import LogisticRegression
Model = LogisticRegression()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
逻辑回归算法原理参考:https://www.cnblogs.com/pinard/p/6029432.html
from sklearn.linear_model import PassiveAggressiveClassifier
Model = PassiveAggressiveClassifier()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
被动攻击算法原理参考:http://scikit-learn.org/0.19/modules/linear_model.html#passive-aggressive
在机器学习中,朴素贝叶斯分类器是一组简单的“概率分类器”,基于贝叶斯定理,特征之间有很强的(朴素的)独立性假设,下面是高斯朴素贝叶斯实验:
GaussianNB假设特征的先验概率为正态分布,即如下式:
P ( x i ∣ y ) = 1 2 π σ y 2 exp ( − ( x i − μ y ) 2 2 σ y 2 ) P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right) P(xi∣y)=2πσy21exp(−2σy2(xi−μy)2)
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
Model = GaussianNB()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
朴素贝叶斯算法原理参考:https://www.cnblogs.com/pinard/p/6069267.html
# MultinomialNB
from sklearn.naive_bayes import MultinomialNB
Model = MultinomialNB()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
P ( x i ∣ y ) = P ( i ∣ y ) x i + ( 1 − P ( i ∣ y ) ) ( 1 − x i ) P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i) P(xi∣y)=P(i∣y)xi+(1−P(i∣y))(1−xi)
# BernoulliNB
from sklearn.naive_bayes import BernoulliNB
Model = BernoulliNB()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
# Support Vector Machine
from sklearn.svm import SVC
Model = SVC()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
支持向量机算法原理参考:http://www.cnblogs.com/pinard/p/6097604.html
# Support Vector Machine's
from sklearn.svm import NuSVC
Model = NuSVC()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
# Linear Support Vector Classification
from sklearn.svm import LinearSVC
Model = LinearSVC()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
# Decision Tree's
from sklearn.tree import DecisionTreeClassifier
Model = DecisionTreeClassifier()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
决策树算法原理参考:https://www.cnblogs.com/pinard/p/6050306.html
# ExtraTreeClassifier
from sklearn.tree import ExtraTreeClassifier
Model = ExtraTreeClassifier()
Model.fit(X_train, y_train)
y_pred = Model.predict(X_test)
# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
print('accuracy is',accuracy_score(y_pred,y_test))
from sklearn.neural_network import MLPClassifier
Model=MLPClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
# Summary of the predictions
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
该函数详细说明参考:http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
随机森林是一种元估计,它适用于数据集的各种子样本上的许多决策树分类器,并使用平均法来提高预测精度和控制过拟合。子样本大小总是与原始输入样本大小相同,但是如果bootstrap=True(默认),则用替换的方式绘制样本。
随机森林的实验
from sklearn.ensemble import RandomForestClassifier
Model=RandomForestClassifier(max_depth=2)
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
随机森林算法原理参考:https://www.cnblogs.com/pinard/p/6156009.html
from sklearn.ensemble import BaggingClassifier
Model=BaggingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
Bagging算法原理参考:https://www.cnblogs.com/pinard/p/6156009.html
from sklearn.ensemble import AdaBoostClassifier
Model=AdaBoostClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
Adaboost算法原理参考:https://www.cnblogs.com/pinard/p/6133937.html
from sklearn.ensemble import GradientBoostingClassifier
Model=GradientBoostingClassifier()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
梯度提升树(GBDT)算法原理参考:https://www.cnblogs.com/pinard/p/6140514.html
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Model=LinearDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
线性判别分析LDA算法原理参考:https://www.cnblogs.com/pinard/p/6244265.html
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Model=QuadraticDiscriminantAnalysis()
Model.fit(X_train,y_train)
y_pred=Model.predict(X_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_pred,y_test))
二次判别分析QDA算法原理参考:http://scikit-learn.org/stable/modules/lda_qda.html#dimensionality-reduction-using-linear-discriminant-analysis
[1] Fisher R A. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS[J]. Annals of Human Genetics, 2012, 7(2):179-188.
[2] Andrews D F. Plots of High-Dimensional Data[J]. Biometrics, 1972, 28(1):125-136.