数据分箱6——分箱结果进行WOE转化

WOE的具体公式与含义请参考:特征筛选7——WOE(Weight of Evidence)/IV值(Information Value)筛选特征(有监督筛选)

WOE转化可以将分箱的阈值覆盖原有的值,一般来讲并不会改变预测精度,但是可以为可解释性提供方便

更深入的理解请参考:

  • 风控模型—WOE与IV指标的深入理解应用:https://zhuanlan.zhihu.com/p/80134853

示例代码

我们使用一个分箱的库:scorecardbundle来做分箱
scorecardbundle github主页:https://github.com/Lantianzz/Scorecard-Bundle
scorecardbundle 文档:https://scorecard-bundle.bubu.blue/English/1.intro.html

import pandas as pd


def get_dataset():
    from sklearn.datasets import make_classification

    data_x, data_y = make_classification(n_samples=1000, n_classes=2, n_features=6, n_informative=4, random_state=0)  # 2个特征
    data_df = pd.DataFrame(data_x).merge(pd.Series(data_y, name="y_label"), left_index=True, right_index=True)
    data_df.columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'y_label']
    return data_df.drop(["y_label"], axis=1), data_df['y_label']


if __name__ == '__main__':
    x_value, y_value = get_dataset()

    # 分箱
    from scorecardbundle.feature_discretization import ChiMerge as cm

    trans_cm = cm.ChiMerge(max_intervals=10, min_intervals=2, decimal=3, output_dataframe=True)
    result_cm = trans_cm.fit_transform(x_value, y_value)
    trans_cm.boundaries_  # 查看分箱阈值

    # woe转换
    from scorecardbundle.feature_encoding import WOE as woe

    trans_woe = woe.WOE_Encoder(output_dataframe=True)
    result_woe = trans_woe.fit_transform(result_cm, y_value)
    print(trans_woe.iv_)  # iv值
    print(trans_woe.result_dict_)  # woe dictionary and iv value for each feature

得到结果:

{'x1': 1.6346317045813659, 'x2': 0.8150596934872387, 'x3': 2.1554523388874394, 'x4': 0.09694014765234728, 'x5': 4.0535459433655925, 'x6': 0.7650559687769136}
{'x1': ({'-0.089~0.574': -0.14034198335669795, '-0.49~-0.089': -1.2982402309053493, '-1.819~-0.49': -0.6471710273616801, '-1.856~-1.819': 1.406295027699045, '-3.086~-1.856': -0.2676814057179092, '-inf~-3.086': 1.1351422572511074, '0.574~0.627': -2.177223909860503, '0.627~1.396': 0.02000066670100254, '1.396~3.756': 1.13436131227068, '3.756~inf': 26.041583870099096}, 1.6346317045813659), 'x2': ({'-1.761~0.542': -0.15463536598804814, '-2.661~-1.761': -0.7879220283290467, '-inf~-2.661': -3.643560975704789, '0.542~0.943': 0.6390398750588908, '0.943~1.866': 1.6056279303592254, '1.866~2.056': 0.24314421797227492, '2.056~inf': 2.483853907178159}, 0.8150596934872387), 'x3': ({'-0.832~0.868': 0.046387421872776745, '-1.606~-0.832': -0.8107678916070944, '-1.628~-1.606': 0.8672985270005281, '-1.794~-1.628': -0.8272971935574127, '-1.823~-1.794': 0.8672985270005281, '-2.661~-1.823': -0.827297193555032, '-2.737~-2.661': 1.406295027699045, '-inf~-2.737': -0.3854644413598186, '0.868~2.802': 1.4918172011816269, '2.802~inf': 26.447048978207263}, 2.1554523388874394), 'x4': ({'-0.706~0.394': -0.2729864579462302, '-0.776~-0.706': 1.406295027699045, '-1.904~-0.776': -0.2754635461586352, '-2.181~-1.904': 0.7131478472036047, '-2.424~-2.181': -0.47247581833830427, '-inf~-2.424': 0.27886230059432987, '0.394~0.619': 0.274892916306179, '0.619~0.752': -0.4367577357396309, '0.752~inf': 0.1614123207389391}, 0.09694014765234728), 'x5': ({'-0.81~0.146': -1.4518158675271366, '-1.099~-0.81': -0.5060924291285228, '-1.391~-1.099': 0.38396604387173733, '-1.44~-1.391': -1.0786116217760677, '-2.204~-1.44': 0.32228153854905195, '-2.852~-2.204': -0.935510778169474, '-inf~-2.852': -23.025850929940457, '0.146~0.881': -0.008986870170557375, '0.881~2.227': 1.5973502645022293, '2.227~inf': 3.8041903004751383}, 4.0535459433655925), 'x6': ({'-0.827~0.314': -0.11353072590854493, '-1.026~-0.827': -1.366293694126287, '-1.374~-1.026': -0.15184959020824298, '-1.47~-1.374': -2.1200654960643353, '-1.962~-1.47': -0.5525185259954828, '-4.027~-1.962': 0.6390398750607955, '-inf~-4.027': 25.348436689539152, '0.314~1.844': 0.39212935451196373, '1.844~inf': -0.3664162463898471}, 0.7650559687769136)}

其中trans_woe就是预期的结果

你可能感兴趣的:(特征工程,python,机器学习,机器学习,数据挖掘,python)