python信用卡违约预测分析_Kaggle 比赛: 德国信用卡违约数据分析

数据描述

German Credit Data, 我们来看看数据的格式,

A1 到 A15 为 15个不同类别的特征,A16 为 label 列,一共有 690条数据,下面列举其中一条当作例子:

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A12

A13

A14

A15

A16

b

30.83

0

u

g

w

v

1.25

t

t

01

f

g

00202

0

+

Attribute Information:

A1: b, a.

A2: continuous.

A3: continuous.

A4: u, y, l, t.

A5: g, p, gg.

A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.

A7: v, h, bb, j, n, z, dd, ff, o.

A8: continuous.

A9: t, f.

A10: t, f.

A11: continuous.

A12: t, f.

A13: g, p, s.

A14: continuous.

A15: continuous.

A16: +,- (class attribute)

Missing Attribute Values:

37 cases (5%) have one or more missing values. The missing

values from particular attributes are:

A1: 12

A2: 12

A4: 6

A5: 6

A6: 9

A7: 9

A14: 13

Class Distribution

+: 307 (44.5%)

-: 383 (55.5%)

数据处理与数据分析

下面展示一下数据处理流程,主要是处理了一下缺失值,然后根据特征按连续型和离散型进行分别处理,使用了 sklearn 里面的 LogisticRegression 包,下面的代码都有很详细的注释。

import pandas as pd

import numpy as np

import matplotlib as plt

import seaborn as sns

# 读取数据

data = pd.read_csv("./crx.data")

# 给数据增加列标签

data.columns = ["f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", "label"]

# 替换 label 映射

label_mapping = {

"+": 1,

"-": 0

}

data["label"] = data["label"].map(label_mapping)

# 处理缺省值的方法

data = data.replace("?", np.nan)

# 将 object 类型的列转换为 float型

data["f2"] = pd.to_numeric(data["f2"])

data["f14"] = pd.to_numeric(data["f14"])

# 连续型特征如果有缺失值的话,用它们的平均值替代

data["f2"] = data["f2"].fillna(data["f2"].mean())

data["f3"] = data["f3"].fillna(data["f3"].mean())

data["f8"] = data["f8"].fillna(data["f8"].mean())

data["f11"] = data["f11"].fillna(data["f11"].mean())

data["f14"] = data["f14"].fillna(data["f14"].mean())

data["f15"] = data["f15"].fillna(data["f15"].mean())

# 离散型特征如果有缺失值的话,用另外一个不同的值替代

data["f1"] = data["f1"].fillna("c")

data["f4"] = data["f4"].fillna("s")

data["f5"] = data["f5"].fillna("gp")

data["f6"] = data["f6"].fillna("hh")

data["f7"] = data["f7"].fillna("ee")

data["f13"] = data["f13"].fillna("ps")

tf_mapping = {

"t": 1,

"f": 0

}

data["f9"] = data["f9"].map(tf_mapping)

data["f10"] = data["f10"].map(tf_mapping)

data["f12"] = data["f12"].map(tf_mapping)

# 给离散的特征进行 one-hot 编码

data = pd.get_dummies(data)

from sklearn.linear_model import LogisticRegression

# 打乱顺序

shuffled_rows = np.random.permutation(data.index)

# 划分本地测试集和训练集

highest_train_row = int(data.shape[0] * 0.70)

train = data.iloc[0:highest_train_row]

loc_test = data.iloc[highest_train_row:]

# 去掉最后一列 label 之后的才是 feature

features = train.drop(["label"], axis = 1).columns

model = LogisticRegression()

X_train = train[features]

y_train = train["label"] == 1

model.fit(X_train, y_train)

X_test = loc_test[features]

test_prob = model.predict(X_test)

test_label = loc_test['label']

# 本地测试集上的准确率

accuracy_test = (test_prob == loc_test["label"]).mean()

print accuracy_test

0.835748792271

from sklearn import cross_validation, metrics

#验证集上的auc值

test_auc = metrics.roc_auc_score(test_label, test_prob)#验证集上的auc值

print test_auc

0.835748792271

简单使用了一下逻辑回归,发现准确率是 0.835748792271,AUC 值是 0.835748792271,效果还不错,接下来对模型进行优化来进一步提高准确率。

你可能感兴趣的:(python信用卡违约预测分析)