数据挖掘实战之信用卡违约率分析

本文通过针对台湾某银行 2005 年 4 月到 9 月的信用卡数据这一数据集构建一个分析信用卡违约率的分类器。

数据来源https://github.com/cystanford/credit_default

数据挖掘实战之信用卡违约率分析_第1张图片

1、数据加载和探索:

数据完整,没有缺失值

数据挖掘实战之信用卡违约率分析_第2张图片

#查看下一个月的违约情况
default = data['default.payment.next.month'].value_counts()
default

df = pd.DataFrame({'default.payment.next.month':default.index,'values':default.values})  #barplot的data参数需要是Dataframe或者array
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']  #用来正常显示中文标签
plt.figure(figsize = (6,6))
plt.title('信用卡违约率客户\n(违约:1,守约:0)')
sns.set_color_codes('pastel')
sns.barplot(x = 'default.payment.next.month',y = 'values',data = df)
locs,labels = plt.xticks()

数据挖掘实战之信用卡违约率分析_第3张图片

#特征选择
data.drop(['ID'],inplace = True,axis = 1)
target = data['default.payment.next.month'].values
columns = data.columns.tolist()  #data.columns返回array,可以通过tolist(),或者list(array)转换为list,一般tolist()效率更高
columns.remove('default.payment.next.month')
features = data[columns].values
#30%作为测试集,其余作为训练集
from sklearn.cross_validation import train_test_split
train_x,test_x,train_y,test_y = train_test_split(features,target,test_size = 0.30,stratify = target,random_state = 1)

2、分类阶段:因为不确定哪个分类器的分类效果好,构造了SVM、决策树、随机森林、KNN、Adaboost分类器,

然后通过 GridSearchCV 工具,找到每个分类器的最优参数和最优分数,最终找到适合这个项目的分类器和该分类器的参数。

#构造各种分类器
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
classifiers = [
    SVC(random_state = 1,kernel = 'rbf'),
    DecisionTreeClassifier(random_state = 1,criterion = 'gini'),
    RandomForestClassifier(random_state = 1,criterion = 'gini'),
    KNeighborsClassifier(metric = 'minkowski'),
    AdaBoostClassifier(random_state = 1)]


#分类器名称
classifiers_names = ['svc',
                     'decisionTreeClassifier',
                     'randomForestClassifier',
                     'kneighborsClassifier',
                     'adaBoostClassifier']


#分类器参数
classifiers_param_grid = [{'svc__C':[1],'svc__gamma':[0.01]},
                         {'decisionTreeClassifier__max_depth':[6,9,11]},
                         {'randomForestClassifier__n_estimators':[3,5,6]},
                         {'kneighborsClassifier__n_neighbors':[4,6,8]},
                         {'adaBoostClassifier__n_estimators':[10,50,100]}]
#对具体的分类器进行GridSearchCV参数调优
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import learning_curve
def GridSearchCV_work(pipeline,train_x,train_y,test_x,test_y,param_grid,score = 'accuracy'):
    response = {}
    gridsearch = GridSearchCV(estimator = pipeline,param_grid = param_grid,scoring = score)
    search = gridsearch.fit(train_x,train_y)
    print('GridSearch 最优参数:',search.best_params_)
    print('GridSearch 最优分数:%0.4lf' %search.best_score_)
    predict_y = gridsearch.predict(test_x)
    print('准确率 %0.4lf' %accuracy_score(test_y,predict_y))    
    response['predict_y'] = predict_y
    response['accuracy_score'] = accuracy_score(test_y,predict_y)
    return response


for model,model_name,model_param_grid in zip(classifiers,classifiers_names,classifiers_param_grid):
    pipeline = Pipeline([('scaler',StandardScaler()),
                        (model_name,model)])
    result = GridSearchCV_work(pipeline,train_x,train_y,test_x,test_y,model_param_grid,score = 'accuracy')

数据挖掘实战之信用卡违约率分析_第4张图片

可以看到SVM分类器的准确率最高为0.8172,最优分数为0.8174。

 

 

你可能感兴趣的:(数据分析实战)