离散型随机变量的二分类预测案例

案例目标：使用逻辑回归进行信用卡欺诈行为的二分类预测

设计流程：

读取csv信用卡使用信息数据集，绘制图表，对数据进行宏观分析

对于与其他列向量差异较大的某些列向量进行标准化处理

根据分析，对数据进行下采样处理

从下采样后的数据拆分训练集和测试集

从训练集中继续拆分出训练集和验证集

在构建的逻辑回归模型进行训练和预测，进行交叉验证后，找到最大召回率均值对应的惩罚力度和模型

利用混淆矩阵分析模型在测试集中预测的结果

扩展：过采样操作

image.png

准备

python >= 3.7
matplotlib >= 3.3.2
numpy >= 1.19.2
sklearn >= 0.23.2
pandas >= 1.1.3

首先我们读取数据，使用matplotlib对Class特征(0:无欺诈行为)制作直方图，直观感受一下数据的样子：

data = pd.read_csv('creditcard.csv')
#根据'Class'特征，把每种特征的数量进行统计
count_classes = pd.value_counts(data['Class'],sort=True).sort_index()
#做出直方图进行分析
count_classes.plot(kind='bar')
plt.title('Fraud class histogram') #Fraud:欺诈 histogram：直方图
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

效果如下：

信用卡异常情况分布

有没有搞错？？这批数据里居然没有异常信息？？那还玩个毛！
然而当我放大标签为1的部分后：

局部放大

看来还是我想多了，那么我们老老实实继续干活吧。
根据直方图，我们可以很容易看出：
这批数据中，信用卡正常数据(Class==0)和异常数据(Class == 1)的数据占比存在巨大差异，通常这种数据不进行预处理而直接用来模型制作简直就是给自己徒增烦恼。因此我们需要让数据的分配变得更加均衡，这里介绍两种方式：

下采样：

把特征中数据量过大的分类继续进行随机抽样，直到和样本少的类别有相同数量的样本，以此来达到样本的均衡

过采样：

把特征中样本的少的类别通过某种数据生成策略，把样本变得和另一类别的样本一样多

这里我们先使用下采样。（之后单独介绍过采样）
我们观察数据同时还会发现Amount特征的向量值和其他值的范围差异巨大，我们对这一列数据进行标准化操作

Amount的列向量长的丑

#为了保证特征之间分布差异不大，对Amount特征进行预处理，(归一化，或者标准化)
data['normAmout'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
# 删掉没用到的特征(axis=0:删除行，1：删除列)
data = data.drop(['Time','Amount'],axis=1)

制作数据和label：

X = data.loc[:,data.columns != 'Class'] #取出除了class特征的其他全部特征数据
y = data.loc[:,data.columns == 'Class'] #取出class特征的数据

下采样：

#统计Class标记为1(标识为欺诈)的数据个数(原理：data.Class==1返回全部数据并且让数据里class特征值为1的数据标记为True；然后把这些数据取出来)
number_records_fraud = len(data[data.Class == 1]) 
#拿到Class特征为1的数据的index特征值
fraud_indices = np.array(data[data.Class == 1].index) 

#拿到Class为0的index特征值
normal_indices = data[data.Class == 0].index 

#从class特征为0的index集中随机抽和class为1同等数量的样本（replace=False:无放回抽样）
random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace=False) 
random_normal_indices = np.array(random_normal_indices)

#合并class为1和刚刚下采样得到的class为0的index集
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

#根据合并后的index集拿到合并后的全部数据
under_sample_data = data.iloc[under_sample_indices,:]

#取出除了class特征的其他全部特征数据
X_undersample = under_sample_data.loc[:,under_sample_data.columns != 'Class'] 
#取出class特征的数据
y_undersample = under_sample_data.loc[:,under_sample_data.columns == 'Class']

然后，我们按照3:7把原始数据集和下采样数据集拆成测试集:训练集

#3份的测试集，7份的训练集，random_state=0代表每次洗牌(随机)拿到的数据都一样
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample = train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)

利用KFold函数，我们把训练数据打成5份来制作验证集

#构建拆分器，拆5份，4:1的比例拆分出 训练集:验证集 ，身份一共变换5回
fold = KFold(5, shuffle=False) #shuffle:是否打乱

接下来，我们可以使用不同的惩罚力度，对二分类进行L1正则化，这里我们的惩罚力度为：

c_param_range = [0.01, 0.1, 1, 10, 100]

构建逻辑回归图，数据训练，预测，计算召回率，在一个交叉验证的for循环里一气呵成：

recall_accs = []
# 切割数据
index = fold.split(y_train_data)
for iteration, indices in enumerate(index, start=1):

    # 建立模型，使用l1正则化
    lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')
    # 训练模型
    lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())
    # 预测
    y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)

    # 计算召回率
    recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
    recall_accs.append(recall_acc)
    print('Iteration ', iteration, ': recall score = ', recall_acc)

# 计算某一惩罚力度5次交叉验证的平均召回率
results_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')

TIPS

1.召回率recall:比如，一批人员，有10个病人，我检测发现出来了2个，recall=2/10。使用召回率而不使用精度来评估模型的优劣
2.TP:true positive:寻找10个病人，模型把10个病人判断成10个病人TP=10(命中)
3.FP:false positive:寻找10个病人，模型把90个好人判断成了病人FP=90（不该命中的部分命中了）
4.FN:false negative:寻找10个病人，模型把两个病人当成了好人，FN=2（该命中的却不中）
5.TN:true negative:寻找10个病人，模型找到了真正的好人4人，TN=4（发现了真正不该命中的）
6.因此，recall = TP/(TP+FN),混淆矩阵可以映射出TP,FP,FN,TN四个指标

这时我们得到了某一种惩罚力度下5次交叉验证的平均召回率。
然而我们还有其他4组惩罚力度哟，用一个for循环来搞定，循环结束后会拿到了5组平均召回率
找到最大的平均召回率相对应的惩罚力度(在此方式里，我们并没考虑到误杀率FP)

best_c = results_table.loc[results_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']

接下来，我们拿到了最佳的惩罚力度和模型。
那么来预测一波下采样测试集吧！顺便做个混淆矩阵，混淆矩阵制作代码如下:

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap='Blues'):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

利用混淆矩阵展示预测结果：

#用之前训练的模型预测测试集
y_pred_undersample = lr.predict(X_test_undersample.values)

# 根据y真实值和y估计，来制作混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2) #设置浮点精度
#利用混淆矩阵计算召回率
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

结果如下

混淆矩阵

我们看到结果还是不做的，只有10张信用卡的欺诈行为没有被检测到，另外还有16张正常消费的信用卡被误认为是欺诈。

调节sigmoid的阈值

如果大家知道二分类算法里是通过sigmoid函数来把数据区分成两个大类的，默认情况下，0.5为分界点(即阈值)，即概率大于0.5的归类为1，小于0.5的归类为0，那么如果我们通过调节这个值的标准，便可调节归类的倾向程度。例如：概率大于0.7的才归类为1，那么如何操作呢？

sigmod函数

先前，我们预测结果的函数，使用的lr.predict(X_test_undersample.values)来直接获得最终标签结果，即某个概率非黑即白，如果想调节sigmoid的阈值，我们得获得某个事物发生的具体概率概率，使用lr.predict_proba(X_test_undersample.values),此方法在二分类问题里，返回的是n行2列的矩阵，行为n多个事物，第一列是事件为假的概率，第二列是事件为真的概率，因此第m行的两列相加必然为一。
然后：

#拿到的是全部数据的第一列的值,通过不等式得到由BOOL值组成的数组
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > threshold

以上是有下采样操作的数据预测，下面来介绍一下过采样操作

扩展：过采样骚操作

此操作就是根据样本量少的那部分样本的特征，随机浮动生成假数据

对于少数类中每一个样本x，以欧氏距离为标准计算它到少数类样本集中所有样本的距离，得到其k近邻。

根据样本不平衡比例设置一个采样比例以确定采样倍率N，对于每一个少数类样本x，从其k近邻中随机选择若干个样本，假设选择的近邻为xn。

对于每一个随机选出的近邻xn，分别与原样本按照如下的公式构建新的样本。

smote算法

代码实现：

from imblearn.over_sampling import SMOTE
oversampler=SMOTE(random_state=0)#random_state=0：每次生成一样的数据
os_X,os_y=oversampler.fit_sample(X_train,y_train)
os_features = pd.DataFrame(os_X)
os_labels = pd.DataFrame(os_y)

此时的os_features,os_labels就包含了先前的原始数据和新生成是数据了！!
数据生成搞定！

完整代码（不包含改变阈值和过采样）

'''
信用卡欺诈分析
1.csv文件中class分类样本数量差异过大，样本不均衡的时候，采用“过采样”或者“下采样”
2.下采样：把样本多的类别继续抽取样本，直到和样本少的类别有相同数量的样本，以此来达到样本的均衡
3.过采样：把样本的少的类别通过某种数据生成策略，把样本变得和灵异类别的样本一样多
4.sklearn中，把[2,3]的矩阵reshape(-1,2)的意思是：在原来数量不变的基础上(即:2x3=6)，根据第二维是2的策略，让程序自动填补-1处(这里是第一个维度)的数字，此例子生成后为[3,2]
5.召回率recall:比如，一批人员，有10个病人，我检测发现出来了2个，recall=2/10。使用召回率而不使用精度来评估模型的优劣
6.TP:true positive:寻找10个病人，模型把10个病人判断成10个病人TP=10(命中)
7.FP:false positive:寻找10个病人，模型把90个好人判断成了病人FP=90（不该命中的部分命中了）
8.FN:false negative:寻找10个病人，模型把两个病人当成了好人，FN=2（该命中的却不中）
9.TN:true negative:寻找10个病人，模型找到了真正的好人4人，TN=4（发现了真正不该命中的）
10.因此，recall = TP/(TP+FN),混淆矩阵可以映射出TP,FP,FN,TN四个指标
11.自己补充一下，L1正则化和L2正则化
12.可以认为设定召回率和误杀率来评估哟，可以通过sigmod的threshold
13，smote算法实现过采样
14.过采样的误杀率很低，但是召回率也相对低了一些
15:iloc和loc差别：loc的参数代表index的号码，iloc后面的参数代表第几行
'''
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
import itertools


data = pd.read_csv('creditcard.csv')
# #根据'Class'类别，把每种类型的数量进行统计
# count_classes = pd.value_counts(data['Class'],sort=True).sort_index()
# #做出直方图进行分析
# count_classes.plot(kind='bar')
# plt.title('Fraud class histogram') #Fraud:欺诈 histogram：直方图
# plt.xlabel('Class')
# plt.ylabel('Frequency')
# plt.show()

#通过分析发现：Class特征中，标记为1和0的样本数量极度不均衡，这里采用下采样策略

#为了保证特征之间分布差异不大，对Amount特征进行预处理，(归一化，或者标准化)
data['normAmout'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
data = data.drop(['Time','Amount'],axis=1) # 删掉没用到的特征(axis=0:删除行，1：删除列)


X = data.loc[:,data.columns != 'Class'] #取出除了class特征的其他全部特征数据
y = data.loc[:,data.columns == 'Class'] #取出class特征的数据

number_records_fraud = len(data[data.Class == 1]) #统计Class标记为1(标识为欺诈)的数据个数(原理：data.Class==1返回全部数据并且让数据里class特征值为1的数据标记为True；然后把这些数据取出来)
fraud_indices = np.array(data[data.Class == 1].index) #拿到Class特征为1的数据的index特征值

normal_indices = data[data.Class == 0].index #拿到Class为0的index特征值

random_normal_indices = np.random.choice(normal_indices,number_records_fraud,replace=False) #从class特征为0的index集中随机抽和class为1同等数量的样本（replace=False:无放回抽样）
random_normal_indices = np.array(random_normal_indices)

#合并class为1和刚刚下采样得到的class为0的index集
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

#根据合并后的index集拿到合并后的全部数据
under_sample_data = data.iloc[under_sample_indices,:]


X_undersample = under_sample_data.loc[:,under_sample_data.columns != 'Class'] #取出除了class特征的其他全部特征数据
y_undersample = under_sample_data.loc[:,under_sample_data.columns == 'Class'] #取出class特征的数据

#进行交叉验证：数据切分成train和test，再把train平均切分成三份①②③，①+②->③,①<-②+③,①+③->②

#切分数据测试集和训练集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)#3份的测试集，7份的训练集，random_state=0代表每次洗牌(随机)拿到的数据都一样
X_train_undersample,X_test_undersample,y_train_undersample,y_test_undersample = train_test_split(X_undersample,y_undersample,test_size=0.3,random_state=0)


def printing_Kfold_scores(x_train_data, y_train_data):
    #构建拆分器，拆5份，4:1的比例拆分出 训练集:验证集 ，身份一共变换5回
    fold = KFold(5, shuffle=False) #shuffle:是否打乱

    print(fold)

    # 5组惩罚力度
    c_param_range = [0.01, 0.1, 1, 10, 100]
    # 构建结果集表结构
    results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
    # 在结果集里把惩罚力度填进去
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range: #用某一个惩罚力度
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        # 切割数据
        index = fold.split(y_train_data)
        for iteration, indices in enumerate(index, start=1):

            # 建立模型，使用l1正则化
            lr = LogisticRegression(C = c_param, penalty = 'l1',solver='liblinear')
            # 训练模型
            lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())
            # 预测
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)

            # 计算召回率
            recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration, ': recall score = ', recall_acc)

        # 计算某一惩罚力度5次交叉验证的平均召回率
        results_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')



    print(results_table)
    best_c = results_table.loc[results_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']

    # 找到5次惩罚力度中召回率均值最大的一次
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')

    return best_c


best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap='Blues'):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')



lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
#用之前训练的模型预测测试集
y_pred_undersample = lr.predict(X_test_undersample.values)

# 根据y真实值和y估计，来制作混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2) #设置浮点精度
#利用混淆矩阵计算召回率
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()