顾名思义以树的形式判断决策,实质上和一个‘读心术’的游戏类似,提问者可以问n个问题,回答者只能回答是和否,
通过n次的回答,最终得到结论。
import numpy as np
#导入画图工具
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import tree,datasets
from sklearn.model_selection import train_test_split
wine = datasets.load_wine()
X = wine.data[:,:2]#只取前2个属性
y = wine.target
X_train,X_test,y_train,y_test = train_test_split(X,y)
clf = tree.DecisionTreeClassifier(max_depth=5)
clf.fit(X_train,y_train)
print(clf.predict_proba([[0.11,0.28]]))
print(clf.score(X_train,y_train))
X的2个属性即为问题,决策树的深度即为n,y的值即为最终结论。
cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000','#00FF00','#0000FF'])
x_min, x_max = X[:,0].min() -1,X[:,0].max()+1
y_min, y_max = X[:,1].min() -1,X[:,1].max()+1
xx,yy = np.meshgrid(np.arange(x_min,x_max,.02),np.arange(y_min,y_max,.02))
z = clf.predict(np.c_[xx.ravel(),yy.ravel()])
z = z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx,yy,z,cmap=cmap_light)
plt.scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold,edgecolor='k',s=20)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.show()
通过plt画出图像,可以发现当n越大时决策树的表现越高。
随机森林顾名思义就是多个树组成,由于当n的值打到一定程度上时,会发现0和1的概率
居然达到了100%,这是因为决策树容易过拟合,所以采用随机森林,每个树的方向不同,
得出的结论也不同,最后取平均值,则可解决过拟合的问题。
###随机森林
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=6,random_state=3,max_features=1)
forest.fit(X_train,y_train)
print(forest.predict_proba([[0.11,0.28]]))
print(forest.score(X_train,y_train))
cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000','#00FF00','#0000FF'])
x_min, x_max = X[:,0].min() -1,X[:,0].max()+1
y_min, y_max = X[:,1].min() -1,X[:,1].max()+1
xx,yy = np.meshgrid(np.arange(x_min,x_max,.02),np.arange(y_min,y_max,.02))
z = forest.predict(np.c_[xx.ravel(),yy.ravel()])
z = z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx,yy,z,cmap=cmap_light)
plt.scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold,edgecolor='k',s=20)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.show()
首先下载数据包:成年人工资表
下好的数据包为.data格式,需要改为.csv格式(可用excel打开)
导入pandas库:提取文件里的数据,处理数据
#导入pandas库
import pandas as pd
data = pd.read_csv('adult.csv',header=None, index_col=False , names = ['年龄','单位性质','权重','学历','受教育时长',
'婚姻状况','职业','家庭情况','种族','性别',
'资产所得','资产损失','周工作时长','原籍',
'收入'])
data_lite = data[['年龄','单位性质','学历','性别','周工作时长','职业','收入']]
display(data_lite.head(6))
#get_dummies将文本转为数值
data_dummies = pd.get_dummies(data_lite)
display(data_dummies.head(6))
features =data_dummies.loc[:,'年龄':'职业_ Transport-moving']##取年龄到职业的属性
X=features.values
y=data_dummies['收入_ >50K'].values#取收入大于50K的分类
#随机森林
import numpy as np
from sklearn import tree,datasets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
from sklearn.ensemble import RandomForestClassifier
forest = tree.DecisionTreeClassifier(max_depth=5)
#forest = RandomForestClassifier(n_estimators=6,random_state=3,max_features=1) #随机森林
forest.fit(X_train,y_train)
print(forest.score(X_train,y_train))
new = [[37,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]]
pre = forest.predict(new)
pre1 = forest.predict_proba(new)[0][1]*100
if pre == 1:
print('恭喜你工资过5万的可能性高达:%{}'.format(pre1))
else:
print('很遗憾,你工资过5万的可能性只有:%{:.2f}'.format(pre1))