phone_nember
电话本备注 nickname
birthday
sex
password
身份证号码
邮箱
学历
住房
积分
签到
ip
gps
wifi
application time
login time
ua(user agent)
渠道
app version
推荐人/联系人
imei
device id
分辨率
mobile type
os(operating system)
address
company
人行征信
运营商
社保公积金
特征选择 (feature_selection)
当数据预处理完成后,我们需要选择有意义的特征输入机器学习的算法和模型进行训练。
通常来说,从两个方面考虑来选择特征:
如果一个特征不发散,例如方差接近于0,也就是说样本在这个特征上基本上没有差异,这个特征对于样本的区分并没有什么用。
这点比较显见,与目标相关性高的特征,应当优选选择。除移除低方差法外,本文介绍的其他方法均从相关性考虑。
根据特征选择的形式又可以将特征选择方法分为3种:
特征选择主要有两个目的:
拿到数据集,一个特征选择方法,往往很难同时完成这两个目的。通常情况下,选择一种自己最熟悉或者最方便的特征选择方法(往往目的是降维,而忽略了对特征和数据理解的目的)。接下来将结合 Scikit-learn提供的例子 介绍几种常用的特征选择方法,它们各自的优缺点和问题。
假设某特征的特征值只有0和1,并且在所有输入样本中,95%的实例的该特征取值都是1,那就可以认为这个特征作用不大。如果100%都是1,那这个特征就没意义了。当特征值都是离散型变量的时候这种方法才能用,如果是连续型变量,就需要将连续变量离散化之后才能用。而且实际当中,一般不太会有95%以上都取某个值的特征存在,所以这种方法虽然简单但是不太好用。可以把它作为特征选择的预处理,先去掉那些取值变化小的特征,然后再从接下来提到的的特征选择方法中选择合适的进行进一步的特征选择。
In [3]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
Out[3]:
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
果然, VarianceThreshold 移除了第一列特征,第一列中特征值为0的概率达到了5/6.
单变量特征选择的原理是分别单独的计算每个变量的某个统计指标,根据该指标来判断哪些变量重要,剔除那些不重要的变量。
对于分类问题(y离散),可采用:
对于回归问题(y连续),可采用:
这种方法比较简单,易于运行,易于理解,通常对于理解数据有较好的效果(但对特征优化、提高泛化能力来说不一定有效)。
经典的卡方检验是检验定性自变量对定性因变量的相关性。比如,我们可以对样本进行一次chi2 测试来选择最佳的两项特征:
In [4]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
Out[4]:
(150, 4)
In [5]:
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
Out[5]:
(150, 2)
皮尔森相关系数是一种最简单的,能帮助理解特征和响应变量之间关系的方法,该方法衡量的是变量之间的线性相关性,结果的取值区间为[-1,1],-1表示完全的负相关,+1表示完全的正相关,0表示没有线性相关。
In [8]:
import numpy as np
from scipy.stats import pearsonr
np.random.seed(0)
size = 300
x = np.random.normal(0, 1, size)
# pearsonr(x, y)的输入为特征矩阵和目标向量,能够同时计算 相关系数 和p-value.
print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)))
print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))
Lower noise (0.7182483686213841, 7.32401731299835e-49)
Higher noise (0.057964292079338155, 0.3170099388532475)
这个例子中,我们比较了变量在加入噪音之前和之后的差异。当噪音比较小的时候,相关性很强,p-value很低。
我们使用Pearson相关系数主要是为了看特征之间的相关性,而不是和因变量之间的。
递归消除特征法使用一个基模型来进行多轮训练,每轮训练后,移除若干权值系数的特征,再基于新的特征集进行下一轮训练。
对特征含有权重的预测模型(例如,线性模型对应参数coefficients),RFE通过递归减少考察的特征集规模来选择特征。首先,预测模型在原始特征上训练,每个特征指定一个权重。之后,那些拥有最小绝对值权重的特征被踢出特征集。如此往复递归,直至剩余的特征数量达到所需的特征数量。
RFECV 通过交叉验证的方式执行RFE,以此来选择最佳数量的特征:对于一个数量为d的feature的集合,他的所有的子集的个数是2的d次方减1(包含空集)。指定一个外部的学习算法,比如SVM之类的。通过该算法计算所有子集的validation error。选择error最小的那个子集作为所挑选的特征。
In [29]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
rf = RandomForestClassifier()
iris=load_iris()
X,y=iris.data,iris.target
rfe = RFE(estimator=rf, n_features_to_select=3)
X_rfe = rfe.fit_transform(X,y)
X_rfe.shape
Out[29]:
(150, 3)
In [30]:
X_rfe[:5,:]
Out[30]:
array([[5.1, 1.4, 0.2],
[4.9, 1.4, 0.2],
[4.7, 1.3, 0.2],
[4.6, 1.5, 0.2],
[5. , 1.4, 0.2]])
In [ ]:
#### Embedded
使用SelectFromModel选择特征 (Feature selection using SelectFromModel)
使用L1范数作为惩罚项的线性模型(Linear models)会得到稀疏解:大部分特征对应的系数为0。当你希望减少特征的维度以用于其它分类器时,可以通过 feature_selection.SelectFromModel 来选择不为0的系数。
特别指出,常用于此目的的稀疏预测模型有 linear_model.Lasso(回归), linear_model.LogisticRegression 和 svm.LinearSVC(分类)
In [31]:
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X,y)
model = SelectFromModel(lsvc, prefit=True)
X_embed = model.transform(X)
X_embed.shape
Out[31]:
(150, 3)
In [32]:
X_embed[:5,:]
Out[32]:
array([[5.1, 3.5, 1.4],
[4.9, 3. , 1.4],
[4.7, 3.2, 1.3],
[4.6, 3.1, 1.5],
[5. , 3.6, 1.4]])
模型上线后可能会遇到的问题:
考虑一下业务所需要的变量是什么。
从上述方法中找到最贴合当前使用场景的几种方法。
In [29]:
import pandas as pd
import numpy as np
df_train = pd.read_csv('train.csv')
df_train.head()
Out[29]:
PassengerId | label | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1)变量重要性
2)共线性
在做很多基于空间划分思想的模型的时候,我们必须关注变量之间的相关性。单独看两个变量的时候我们会使用皮尔逊相关系数。
In [6]:
df_train.corr()
Out[6]:
PassengerId | label | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
PassengerId | 1.000000 | -0.005007 | -0.035144 | 0.036847 | -0.057527 | -0.001652 | 0.012658 |
label | -0.005007 | 1.000000 | -0.338481 | -0.077221 | -0.035322 | 0.081629 | 0.257307 |
Pclass | -0.035144 | -0.338481 | 1.000000 | -0.369226 | 0.083081 | 0.018443 | -0.549500 |
Age | 0.036847 | -0.077221 | -0.369226 | 1.000000 | -0.308247 | -0.189119 | 0.096067 |
SibSp | -0.057527 | -0.035322 | 0.083081 | -0.308247 | 1.000000 | 0.414838 | 0.159651 |
Parch | -0.001652 | 0.081629 | 0.018443 | -0.189119 | 0.414838 | 1.000000 | 0.216225 |
Fare | 0.012658 | 0.257307 | -0.549500 | 0.096067 | 0.159651 | 0.216225 | 1.000000 |
3)单调性
# 等频切分
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10)
df_train.head()
df_train = df_train.sort_values('Fare')
alist = list(set(df_train['fare_qcut']))
badrate = {}
for x in alist:
a = df_train[df_train.fare_qcut == x]
bad = a[a.label == 1]['label'].count()
good = a[a.label == 0]['label'].count()
badrate[x] = bad/(bad+good)
f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
print(badrate)
badrate.plot('cut','badrate')
cut badrate
9 (-0.001, 7.55] 0.141304
6 (7.55, 7.854] 0.298851
8 (7.854, 8.05] 0.179245
7 (8.05, 10.5] 0.230769
3 (10.5, 14.454] 0.428571
4 (14.454, 21.679] 0.420455
2 (21.679, 27.0] 0.516854
5 (27.0, 39.688] 0.373626
1 (39.688, 77.958] 0.528090
0 (77.958, 512.329] 0.758621
Out:
4)稳定性
就是将样本按照月份切割,一次作为训练集和测试集来训练模型,取进入模型的变量之间的交集,但是要小心共线特征!
解决方法
公式:
=∑(实际占比−预期占比)∗ln(实际占比预期占比)PSI=∑(实际占比−预期占比)∗ln(实际占比预期占比)
来自知乎的例子:
比如训练一个logistic回归模型,预测时候会有个概率输出p。
测试集上的输出设定为p1,将它从小到大排序后10等分,如0-0.1,0.1-0.2,…。
用这个模型去对新的样本进行预测,预测结果叫p2,按p1的区间也划分为10等分。
实际占:是p2上在各区间的用户占比
预期占:p1上在各区间的用户占比。
意义就是如果模型很稳定,那么p1和p2上各区间的用户应该是相近的,占比不会变动很大,也就是预测出来的概率不会差距很大。
一般认为psi小于0.1时候模型稳定性很高,0.1-0.25一般,大于0.25模型稳定性差,建议重做。
注意分箱的数量将会影响着变量的PSI值。
PSI并不只可以对模型来求,对变量来求也一样。只需要对跨时间分箱的数据分别求PSI即可。
import pandas as pd
import numpy as np
data = pd.read_excel('oil_data_for_tree.xlsx')
data.head()
uid | oil_actv_dt | create_dt | total_oil_cnt | pay_amount_total | class_new | bad_ind | oil_amount | discount_amount | sale_amount | amount | pay_amount | coupon_amount | payment_coupon_amount | channel_code | oil_code | scene | source_app | call_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A8217710 | 2018-08-19 | 2018-08-17 | 275.0 | 48295495.4 | B | 0 | 3308.56 | 1760081.0 | 1796001.0 | 1731081.0 | 8655401.0 | 1.0 | 1.0 | 1 | 3 | 2 | 0 | 3 |
1 | A8217710 | 2018-08-19 | 2018-08-16 | 275.0 | 48295495.4 | B | 0 | 4674.68 | 2487045.0 | 2537801.0 | 2437845.0 | 12189221.0 | 1.0 | 1.0 | 1 | 3 | 2 | 0 | 3 |
2 | A8217710 | 2018-08-19 | 2018-08-15 | 275.0 | 48295495.4 | B | 0 | 1873.06 | 977845.0 | 997801.0 | 961845.0 | 4809221.0 | 1.0 | 1.0 | 1 | 2 | 2 | 0 | 3 |
3 | A8217710 | 2018-08-19 | 2018-08-14 | 275.0 | 48295495.4 | B | 0 | 4837.78 | 2526441.0 | 2578001.0 | 2484441.0 | 12422201.0 | 1.0 | 1.0 | 1 | 2 | 2 | 0 | 3 |
4 | A8217710 | 2018-08-19 | 2018-08-13 | 275.0 | 48295495.4 | B | 0 | 2586.38 | 1350441.0 | 1378001.0 | 1328441.0 | 6642201.0 | 1.0 | 1.0 | 1 | 2 | 2 | 0 | 3 |
In [32]:
set(data.class_new)
Out[32]:
{'A', 'B', 'C', 'D', 'E', 'F'}
org_lst 不需要做特殊变换,直接去重
agg_lst 数值型变量做聚合
dstc_lst 文本型变量做cnt
In [33]:
org_lst = ['uid','create_dt','oil_actv_dt','class_new','bad_ind']
agg_lst = ['oil_amount','discount_amount','sale_amount','amount','pay_amount','coupon_amount','payment_coupon_amount']
dstc_lst = ['channel_code','oil_code','scene','source_app','call_source']
数据重组
In [34]:
df = data[org_lst].copy()
df[agg_lst] = data[agg_lst].copy()
df[dstc_lst] = data[dstc_lst].copy()
看一下缺失情况
In [35]:
df.isna().sum()
Out[35]:
uid 0
create_dt 4944
oil_actv_dt 0
class_new 0
bad_ind 0
oil_amount 4944
discount_amount 4944
sale_amount 4944
amount 4944
pay_amount 4944
coupon_amount 4944
payment_coupon_amount 4946
channel_code 0
oil_code 0
scene 0
source_app 0
call_source 0
dtype: int64
看一下基础变量的describe
In [7]:
df.describe()
Out[7]:
bad_ind | oil_amount | discount_amount | sale_amount | amount | pay_amount | coupon_amount | payment_coupon_amount | channel_code | oil_code | scene | source_app | call_source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50609.000000 | 45665.000000 | 4.566500e+04 | 4.566500e+04 | 4.566500e+04 | 4.566500e+04 | 45665.000000 | 45663.000000 | 50609.000000 | 50609.000000 | 50609.000000 | 50609.000000 | 50609.000000 |
mean | 0.017764 | 425.376107 | 1.832017e+05 | 1.881283e+05 | 1.808673e+05 | 9.043344e+05 | 0.576853 | 149.395397 | 1.476378 | 1.617894 | 1.906519 | 0.306072 | 2.900729 |
std | 0.132093 | 400.596244 | 2.007574e+05 | 2.048742e+05 | 1.977035e+05 | 9.885168e+05 | 0.494064 | 605.138823 | 1.511470 | 3.074166 | 0.367280 | 0.893682 | 0.726231 |
min | 0.000000 | 1.000000 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 5.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 175.440000 | 6.039100e+04 | 6.200100e+04 | 5.976100e+04 | 2.988010e+05 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 |
50% | 0.000000 | 336.160000 | 1.229310e+05 | 1.279240e+05 | 1.209610e+05 | 6.048010e+05 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 |
75% | 0.000000 | 557.600000 | 2.399050e+05 | 2.454010e+05 | 2.360790e+05 | 1.180391e+06 | 1.000000 | 100.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 |
max | 1.000000 | 7952.820000 | 3.916081e+06 | 3.996001e+06 | 3.851081e+06 | 1.925540e+07 | 1.000000 | 50000.000000 | 6.000000 | 9.000000 | 2.000000 | 3.000000 | 4.000000 |
对creat_dt做补全,用oil_actv_dt来填补
截取6个月的数据。
构造变量的时候不能直接对历史所有数据做累加。
否则随着时间推移,变量分布会有很大的变化。
In [37]:
def time_isna(x,y):
if str(x) == 'NaT':
x = y
else:
x = x
return x
df2 = df.sort_values(['uid','create_dt'],ascending = False)
df2['create_dt'] = df2.apply(lambda x: time_isna(x.create_dt,x.oil_actv_dt),axis = 1)
df2['dtn'] = (df2.oil_actv_dt - df2.create_dt).apply(lambda x :x.days)
df = df2[df2['dtn']<180]
df.head()
Out[37]:
uid | create_dt | oil_actv_dt | class_new | bad_ind | oil_amount | discount_amount | sale_amount | amount | pay_amount | coupon_amount | payment_coupon_amount | channel_code | oil_code | scene | source_app | call_source | dtn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50608 | B96436391985035703 | 2018-10-08 | 2018-10-08 | B | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 9 | 2 | 3 | 4 | 0 |
50607 | B96436391984693397 | 2018-10-11 | 2018-10-11 | E | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 9 | 2 | 3 | 4 | 0 |
50606 | B96436391977217468 | 2018-10-17 | 2018-10-17 | B | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 9 | 2 | 3 | 4 | 0 |
50605 | B96436391976480892 | 2018-09-28 | 2018-09-28 | B | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 9 | 2 | 3 | 4 | 0 |
50604 | B96436391972106043 | 2018-10-19 | 2018-10-19 | A | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6 | 9 | 2 | 3 | 4 | 0 |
对org_list变量求历史贷款天数的最大间隔,并且去重
In [38]:
base = df[org_lst]
base['dtn'] = df['dtn']
base = base.sort_values(['uid','create_dt'],ascending = False)
base = base.drop_duplicates(['uid'],keep = 'first')
base.shape
Out[38]:
(11099, 6)
In [39]:
gn = pd.DataFrame()
for i in agg_lst:
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:len(df[i])).reset_index())
tp.columns = ['uid',i + '_cnt']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.where(df[i]>0,1,0).sum()).reset_index())
tp.columns = ['uid',i + '_num']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nansum(df[i])).reset_index())
tp.columns = ['uid',i + '_tot']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmean(df[i])).reset_index())
tp.columns = ['uid',i + '_avg']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmax(df[i])).reset_index())
tp.columns = ['uid',i + '_max']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmin(df[i])).reset_index())
tp.columns = ['uid',i + '_min']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanvar(df[i])).reset_index())
tp.columns = ['uid',i + '_var']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmax(df[i]) -np.nanmin(df[i]) ).reset_index())
tp.columns = ['uid',i + '_var']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
In [40]:
gn.columns
Out[40]:
Index(['uid', 'oil_amount_cnt', 'oil_amount_num', 'oil_amount_tot',
'oil_amount_avg', 'oil_amount_max', 'oil_amount_min',
'oil_amount_var_x', 'oil_amount_var_y', 'discount_amount_cnt',
'discount_amount_num', 'discount_amount_tot', 'discount_amount_avg',
'discount_amount_max', 'discount_amount_min', 'discount_amount_var_x',
'discount_amount_var_y', 'sale_amount_cnt', 'sale_amount_num',
'sale_amount_tot', 'sale_amount_avg', 'sale_amount_max',
'sale_amount_min', 'sale_amount_var_x', 'sale_amount_var_y',
'amount_cnt', 'amount_num', 'amount_tot', 'amount_avg', 'amount_max',
'amount_min', 'amount_var_x', 'amount_var_y', 'pay_amount_cnt',
'pay_amount_num', 'pay_amount_tot', 'pay_amount_avg', 'pay_amount_max',
'pay_amount_min', 'pay_amount_var_x', 'pay_amount_var_y',
'coupon_amount_cnt', 'coupon_amount_num', 'coupon_amount_tot',
'coupon_amount_avg', 'coupon_amount_max', 'coupon_amount_min',
'coupon_amount_var_x', 'coupon_amount_var_y',
'payment_coupon_amount_cnt', 'payment_coupon_amount_num',
'payment_coupon_amount_tot', 'payment_coupon_amount_avg',
'payment_coupon_amount_max', 'payment_coupon_amount_min',
'payment_coupon_amount_var_x', 'payment_coupon_amount_var_y'],
dtype='object')
对dstc_lst变量求distinct个数
In [41]:
gc = pd.DataFrame()
for i in dstc_lst:
tp = pd.DataFrame(df.groupby('uid').apply(lambda df: len(set(df[i]))).reset_index())
tp.columns = ['uid',i + '_dstc']
if gc.empty == True:
gc = tp
else:
gc = pd.merge(gc,tp,on = 'uid',how = 'left')
将变量组合在一起
In [42]:
fn = pd.merge(base,gn,on= 'uid')
fn = pd.merge(fn,gc,on= 'uid')
fn.shape
Out[42]:
(11099, 67)
In [43]:
fn = fn.fillna(0)
In [14]:
fn.head(100)
Out[14]:
uid | create_dt | oil_actv_dt | class_new | bad_ind | dtn | oil_amount_cnt | oil_amount_num | oil_amount_tot | oil_amount_avg | … | payment_coupon_amount_max | payment_coupon_amount_min | payment_coupon_amount_var_x | payment_coupon_amount_var_y | payment_coupon_amount_var | channel_code_dstc | oil_code_dstc | scene_dstc | source_app_dstc | call_source_dstc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | B96436391985035703 | 2018-10-08 | 2018-10-08 | B | 0 | 0 | 1 | 0 | 0.00 | 0.00 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 1 | 1 | 1 |
1 | B96436391984693397 | 2018-10-11 | 2018-10-11 | E | 0 | 0 | 1 | 0 | 0.00 | 0.00 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 1 | 1 | 1 |
2 | B96436391977217468 | 2018-10-17 | 2018-10-17 | B | 0 | 0 | 1 | 0 | 0.00 | 0.00 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 1 | 1 | 1 |
3 | B96436391976480892 | 2018-09-28 | 2018-09-28 | B | 0 | 0 | 1 | 0 | 0.00 | 0.00 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 1 | 1 | 1 |
100 rows × 74 columns
训练决策树模型
In [44]:
x = fn.drop(['uid','oil_actv_dt','create_dt','bad_ind','class_new'],axis = 1)
y = fn.bad_ind.copy()
from sklearn import tree
dtree = tree.DecisionTreeRegressor(max_depth = 2,min_samples_leaf = 500,min_samples_split = 5000)
dtree = dtree.fit(x,y)
输出决策树图像,并作出决策
In [45]:
import pydotplus
from IPython.display import Image
from sklearn.externals.six import StringIO
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
with open("dt.dot", "w") as f:
tree.export_graphviz(dtree, out_file=f)
dot_data = StringIO()
tree.export_graphviz(dtree, out_file=dot_data,
feature_names=x.columns,
class_names=['bad_ind'],
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
value = badrate
In [19]:
sum(fn.bad_ind)/len(fn.bad_ind)
Out[19]:
0.04658077304261645