决策树与随机森林python机器学习--实现工资推算

  • 决策树的概念:

顾名思义以树的形式判断决策,实质上和一个‘读心术’的游戏类似,提问者可以问n个问题,回答者只能回答是和否,

通过n次的回答,最终得到结论。

  • 决策树的构建:

import numpy as np
#导入画图工具
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import tree,datasets

from sklearn.model_selection import train_test_split

wine = datasets.load_wine()
X = wine.data[:,:2]#只取前2个属性
y = wine.target

X_train,X_test,y_train,y_test = train_test_split(X,y)
clf = tree.DecisionTreeClassifier(max_depth=5)
clf.fit(X_train,y_train)

print(clf.predict_proba([[0.11,0.28]]))
print(clf.score(X_train,y_train))

X的2个属性即为问题,决策树的深度即为n,y的值即为最终结论。

  • 图形显示:

cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000','#00FF00','#0000FF'])
                            
x_min, x_max = X[:,0].min() -1,X[:,0].max()+1
y_min, y_max = X[:,1].min() -1,X[:,1].max()+1
xx,yy = np.meshgrid(np.arange(x_min,x_max,.02),np.arange(y_min,y_max,.02))
z = clf.predict(np.c_[xx.ravel(),yy.ravel()])
z = z.reshape(xx.shape)

plt.figure()
plt.pcolormesh(xx,yy,z,cmap=cmap_light)

plt.scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold,edgecolor='k',s=20)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

plt.show()

通过plt画出图像,可以发现当n越大时决策树的表现越高。

  • 随机森林:

随机森林顾名思义就是多个树组成,由于当n的值打到一定程度上时,会发现0和1的概率

居然达到了100%,这是因为决策树容易过拟合,所以采用随机森林,每个树的方向不同,

得出的结论也不同,最后取平均值,则可解决过拟合的问题。


###随机森林
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=6,random_state=3,max_features=1)
forest.fit(X_train,y_train)

print(forest.predict_proba([[0.11,0.28]]))
print(forest.score(X_train,y_train))


cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000','#00FF00','#0000FF'])
                            
x_min, x_max = X[:,0].min() -1,X[:,0].max()+1
y_min, y_max = X[:,1].min() -1,X[:,1].max()+1
xx,yy = np.meshgrid(np.arange(x_min,x_max,.02),np.arange(y_min,y_max,.02))
z = forest.predict(np.c_[xx.ravel(),yy.ravel()])
z = z.reshape(xx.shape)

plt.figure()
plt.pcolormesh(xx,yy,z,cmap=cmap_light)

plt.scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold,edgecolor='k',s=20)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

plt.show()
  • 实例:实现工资推算

首先下载数据包:成年人工资表

下好的数据包为.data格式,需要改为.csv格式(可用excel打开)

导入pandas库:提取文件里的数据,处理数据

#导入pandas库
import pandas as pd

data = pd.read_csv('adult.csv',header=None, index_col=False , names = ['年龄','单位性质','权重','学历','受教育时长',
                                                                      '婚姻状况','职业','家庭情况','种族','性别',
                                                                      '资产所得','资产损失','周工作时长','原籍',
                                                                      '收入'])

data_lite = data[['年龄','单位性质','学历','性别','周工作时长','职业','收入']]
display(data_lite.head(6))

#get_dummies将文本转为数值
data_dummies = pd.get_dummies(data_lite)
display(data_dummies.head(6))

features =data_dummies.loc[:,'年龄':'职业_ Transport-moving']##取年龄到职业的属性
X=features.values
y=data_dummies['收入_ >50K'].values#取收入大于50K的分类

#随机森林
import numpy as np
from sklearn import tree,datasets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
from sklearn.ensemble import RandomForestClassifier
forest = tree.DecisionTreeClassifier(max_depth=5)
#forest = RandomForestClassifier(n_estimators=6,random_state=3,max_features=1) #随机森林
forest.fit(X_train,y_train)

print(forest.score(X_train,y_train))

new = [[37,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]]

pre = forest.predict(new)
pre1 = forest.predict_proba(new)[0][1]*100

if pre == 1:
    print('恭喜你工资过5万的可能性高达:%{}'.format(pre1))
else:
    print('很遗憾,你工资过5万的可能性只有:%{:.2f}'.format(pre1))

 

你可能感兴趣的:(python机器学习,python机器学习)