版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:[email protected],如有任何学术交流,可随时联系。
1 数据预处理
-
DF加上表头
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa import pandas as pd import matplotlib.pyplot as plt import numpy as np iris_data = pd.read_csv('C:\\ML\\MLData\\iris.data') iris_data.columns = ['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm', 'class'] iris_data.head() 复制代码
-
读取图片
from PIL import Image img=Image.open('test.jpg') plt.imshow(img) plt.show() 复制代码
-
数值描述(数值区间)
iris_data.describe() 复制代码
-
高级可视化库pairplot
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sb sb.pairplot(iris_data.dropna(), hue='class') 复制代码
-
高级可视化库 violinplot分布范围(花瓣相对可以区分出不同特征)
plt.figure(figsize=(10, 10)) for column_index, column in enumerate(iris_data.columns): if column == 'class': continue plt.subplot(2, 2, column_index + 1) sb.violinplot(x='class', y=column, data=iris_data) 复制代码
- 版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:[email protected],如有任何学术交流,可随时联系。
2 构造分类器(sklearn.cross_validation过期)
-
测试集与训练集
from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split all_inputs = iris_data[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']].values all_classes = iris_data['class'].values (training_inputs, testing_inputs, training_classes, testing_classes) = train_test_split(all_inputs, all_classes, train_size=0.75, random_state=1) 复制代码
-
参数设置详解
from sklearn.tree import DecisionTreeClassifier # 1.criterion gini or entropy(基于gini系数和熵值来指定) # 2.splitter best or random 前者是在所有特征中找最好的切分点 后者是在部分特征中(数据量大的时候) # 3.max_features None(所有) 特征小于50的时候一般使用所有的 ,log2,sqrt,N # 4.max_depth 数据少或者特征少的时候可以不管这个值,如果模型样本量多,特征也多的情况下,可以尝试限制下 # 5.min_samples_split 如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分 # 如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。 # 6.min_samples_leaf 这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被 # 剪枝,如果样本量不大,不需要管这个值,大些如10W可是尝试下5 # 7.min_weight_fraction_leaf 这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起 # 被剪枝默认是0,就是不考虑权重问题。一般来说,如果我们有较多样本有缺失值, # 或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。 # 8.max_leaf_nodes 通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。 # 如果加了限制,算法会建立在最大叶子节点数内最优的决策树。 # 如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制 # 具体的值可以通过交叉验证得到。 # 9.class_weight 指定样本各类别的的权重,主要是为了防止训练集某些类别的样本过多 # 导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重 # 如果使用“balanced”,则算法会自己计算权重,样本量少的类别所对应的样本权重会高。 # 10.min_impurity_split 这个值限制了决策树的增长,如果某节点的不纯度 # (基尼系数,信息增益,均方差,绝对差)小于这个阈值 # 则该节点不再生成子节点。即为叶子节点 。 decision_tree_classifier = DecisionTreeClassifier() # Train the classifier on the training set decision_tree_classifier.fit(training_inputs, training_classes) # Validate the classifier on the testing set using classification accuracy decision_tree_classifier.score(testing_inputs, testing_classes) 0.9736842105263158 复制代码
-
版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:[email protected],如有任何学术交流,可随时联系。
3 交叉验证
from sklearn.model_selection import KFold
# 但目前train_test_split已被cross_validation被废弃了
# 废弃 from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import numpy as np
decision_tree_classifier = DecisionTreeClassifier()
# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
# 10倍交叉验证
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)
print (cv_scores)
#kde=False
sb.distplot(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
[1. 0.93333333 1. 0.93333333 0.93333333 0.86666667
0.93333333 0.93333333 1. 1. ]
复制代码
decision_tree_classifier = DecisionTreeClassifier(max_depth=1)
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)
print (cv_scores)
sb.distplot(cv_scores, kde=False)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
复制代码
-
4 参数网格
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold decision_tree_classifier = DecisionTreeClassifier() parameter_grid = {'max_depth': [1, 2, 3, 4, 5], 'max_features': [1, 2, 3, 4]} cross_validation = StratifiedKFold(10) grid_search = GridSearchCV(decision_tree_classifier, param_grid=parameter_grid, cv=cross_validation) grid_search.fit(all_inputs, all_classes) print('Best score: {}'.format(grid_search.best_score_)) print('Best parameters: {}'.format(grid_search.best_params_)) 复制代码
-
5 heatmap堆叠热力图使用
grid_visualization = [] for grid_pair in grid_search.cv_results_['mean_test_score']: grid_visualization.append(grid_pair) grid_visualization = np.array(grid_visualization) grid_visualization.shape = (5, 4) sb.heatmap(grid_visualization, cmap='Blues') plt.xticks(np.arange(4) + 0.5, grid_search.param_grid['max_features']) plt.yticks(np.arange(5) + 0.5, grid_search.param_grid['max_depth'][::-1]) plt.xlabel('max_features') plt.ylabel('max_depth') 复制代码
-
6 生成决策树iris_dtc.dot文件
import sklearn.tree as tree from sklearn.externals.six import StringIO with open('C:\\ML\\MLData\\iris_dtc.dot', 'w') as out_file: out_file = tree.export_graphviz(decision_tree_classifier, out_file=out_file) 复制代码
-
7 下载解析器
http://www.graphviz.org/ Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains. 复制代码
dot -Tpdf iris_dtc.dot -o iris.pdf
复制代码
-
8 多参数网格以及交叉验证(最新版)
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import KFold random_forest_classifier = RandomForestClassifier() parameter_grid = {'n_estimators': [5, 10, 25, 50], 'criterion': ['gini', 'entropy'], 'max_features': [1, 2, 3, 4], 'warm_start': [True, False]} cross_validation = StratifiedKFold(10) grid_search = GridSearchCV(random_forest_classifier, param_grid=parameter_grid, cv=cross_validation) grid_search.fit(all_inputs, all_classes) print('Best score: {}'.format(grid_search.best_score_)) print('Best parameters: {}'.format(grid_search.best_params_)) Best score: 0.9664429530201343 Best parameters: {'criterion': 'gini', 'max_features': 2, 'n_estimators': 5, 'warm_start': False} grid_search.best_estimator_ RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features=2, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) 复制代码
4 总结
sklearn多参数网格和交叉验证的使用,版本很重要,不然都运行不了。
版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:[email protected],如有任何学术交流,可随时联系。
秦凯新 于深圳 201812082235