从安然公司邮件中发现欺诈证据

tags: 机器学习算法

项目概述

安然曾是 2000 年美国最大的公司之一。2002 年，由于其存在大量的企业欺诈行为，这个昔日的大集团土崩瓦解。在随后联邦进行的调查过程中，大量有代表性的保密信息进入了公众的视线，包括成千上万涉及高管的邮件和详细的财务数据。运用机器学习技能构建一个算法，通过公开的安然财务和邮件数据集，找出有欺诈嫌疑的安然雇员。

数据

数据来自于Github，数据中的特征分为三大类，即财务特征、邮件特征和 POI 标签。

财务特征: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (单位均是美元）

邮件特征: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (单位通常是电子邮件的数量，明显的例外是 ‘email_address’，这是一个字符串）

POI 标签: [‘poi’] (boolean，整数)

接下来要做的是设计特征，选择并调整算法，用以测试和评估识别符。

数据处理

1.加载.pkl数据集，转换为DataFrame数据格式

with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
df = pd.DataFrame(data_dict)
df = df.T

数据集特点：
a)包含21个字段，146条记录，数据点总数是3066个；POI的数量18，非POI数量128，针对此项目目标，此数据集是一个类失衡数据，所以accurary不是一个很好的指标，可以用recall，f1，precision多个测评分数作为选择分类器的指标；
b)特征数量是20个，'poi'是标签。

2.处理缺失值

df.replace('NaN', np.nan, inplace = True) #np.nan是浮点型数据，在pandas中用isnull()判断，numpy中用isnan()判断
features_nan_count = df.isnull().sum(axis = 0).sort_values(ascending=False) #每一列的缺失值
person_nan_count = df.isnull().sum(axis = 1).sort_values(ascending=False) #每一行的缺失值

print'缺失值最多的10个特征：'
print features_nan_count[:10]
print'缺失值最多的10个人：'
print person_nan_count[:10]

经统计发现：
a)存在缺失值最多的人是"LOCKHART EUGENE E" ，缺失20个特征，一共有21个特征；
b)loan_advances、director_fees、estricted_stock_deferred、deferral_payments、deferred_income、long_term_incentive、bonus、shared_receipt_with_poi、from_poi_to_this_person、from_messages这10个特征含有较多的缺失值，缺失值多于60个。

处理缺失值：
a)删除"LOCKHART EUGENE E"这一记录；
b)缺失值较多的特征也可能是寻找嫌疑人的有利特征，不做删除处理，将缺失值填充为0。

df = df.fillna(0) 
df = df.drop(['LOCKHART EUGENE E'], axis=0)

3.处理异常值

plt.scatter( df["salary"], df["bonus"])
plt.xlabel("salary")
plt.ylabel("bonus")
plt.show()

工资-奖金散点图

通过工资奖金散点图，找到出异常点的键'TOTAL',还有一个异常点'THE TRAVEL AGENCY IN THE PARK'，不属于人名，将两个异常点记录删除。

df=df.drop(['TOTAL','THE TRAVEL AGENCY IN THE PARK'], axis=0)

4.构建新特征

df['from_poi_rate'] = df['from_poi_to_this_person'] / df['to_messages']
df['from_poi_rate'] = df['from_poi_rate'].fillna(0)

df['sent_to_poi_rate'] = df['from_this_person_to_poi'] / df['from_messages']
df['sent_to_poi_rate'] = df['sent_to_poi_rate'].fillna(0)

df['shared_with_poi_rate'] = df['shared_receipt_with_poi'] / df['to_messages']
df['shared_with_poi_rate'] = df['shared_with_poi_rate'].fillna(0)

重新构建了3个认为对找出嫌疑人有利的邮件新特征：a.from_poi_rate'，总收件数中来自嫌疑人的邮件数占比；b.'sent_to_poi_rate'，总发件数中发送给嫌疑人的邮件数占比；c.'shared_with_poi_rate',总收件数中来自嫌疑人分享的邮件数占比。

5.删除无用特征

'email_address'邮箱地址对找出嫌疑人用处不大，从特征中删除。

6.特征选择

df_features = df.drop('poi', axis=1)
df_labels = df['poi']
k_best = SelectKBest(k='all')
k_best.fit(df_features, df_labels)
best_score=pd.DataFrame(k_best.scores_,index=df_features.columns.tolist(),columns=['score'])

使用SelectKBest，返回所有特征跟'poi'标签相关性得分。
最佳特征：
score
exercised_stock_options 24.815080
total_stock_value 24.182899
bonus 20.792252
salary 18.289684
sent_to_poi_rate 16.409713
deferred_income 11.458477
long_term_incentive 9.922186
restricted_stock 9.212811
shared_with_poi_rate 9.101269
total_payments 8.772778
shared_receipt_with_poi 8.589421
loan_advances 7.184056
expenses 6.094173
from_poi_to_this_person 5.243450
other 4.187478
from_poi_rate 3.128092
from_this_person_to_poi 2.382612
director_fees 2.126328
to_messages 1.646341
deferral_payments 0.224611
from_messages 0.169701
restricted_stock_deferred 0.065500

选取得分最高的前10个特征，但这10个特征中有的特征含有大量的缺失值，去掉缺失值在前10的，最后筛选出6个特征来识别'poi'，分别是exercised_stock_options、total_stock_value、sent_to_poi_rate、restricted_stock、shared_with_poi_rate、total_payments。

df_features = df[['exercised_stock_options', 'total_stock_value','sent_to_poi_rate','restricted_stock', 'shared_with_poi_rate','total_payments']]

选择分类器

1.分类器
a)朴素贝叶斯（Naive Bayes Classifier）:
是一种简单有效的常用分类算法，适用于假定所有特征发生概率是独立的，显然我选择的用来标记'poi'的6个特征不太可能相互独立，但也不妨试一下。
b)支持向量机(Support Vector Machine，SVM)：
对于SVM，存在一个分类面，两个点集到此平面的最小距离最大，两个点集中的边缘点到此平面的距离最大（基于距离分类，需要部署特征缩放），适用于小样本。
c)决策树 (Decision tree)：
特点是它总是在沿着特征做切分，能够生成清晰的基于特征选择不同预测结果的树状结构。
d)随机森林 (Random forest):
是一种集成算法,它首先随机选取不同的特征和训练样本，生成大量的决策树,然后综合这些决策树的结果来进行最终的分类。适用于数据维度相对低（几十维），同时对准确性有较高要求时。
e)AdaBoost ：
其核心思想是针对同一个训练集训练不同的分类器，即弱分类器，然后把这些弱分类器集合起来，构造一个更强的最终分类器。适用于用于二分类或多分类的应用场景，无脑化，简单，不会过拟合，不用调分类器。

algorithms = [
    GaussianNB(),
    make_pipeline(MinMaxScaler(), SVC(class_weight='balanced', random_state=42)),
    DecisionTreeClassifier(class_weight='balanced', random_state=42),
    RandomForestClassifier(class_weight='balanced', random_state=42),
    AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced'), random_state=42)
]

2.设置参数

初始设置参数时，参数差值尽量大一些，自动调参后，结合最优参数值，重新设置参数值。

params = [
    { },  # GaussianNB

    {  # SVC pipeline
        'svc__kernel' : ('linear', 'rbf', 'poly', 'sigmoid'),
        'svc__C' : [0.1, 1.0, 10, 100, 1000, 10000],
        'svc__gamma' : [0.001, 0.01, 0.1, 1.0, 10, 100, 1000, 10000]
    },
    {  # DecisionTreeClassifier
        'criterion' : ('gini', 'entropy'),
        'splitter' : ('best', 'random'),
        'min_samples_split' : [7, 8, 9, 10, 11, 12, 13]
    },
    {  # RandomForestClassifier
        'n_estimators' : [2,3,4,5,6,7,8],
        'criterion' : ('gini', 'entropy'),
        'min_samples_split' : [25,30,35,40,45,50,55]
    },
    {  # AdaBoostClassifier
        'n_estimators' : [5, 10, 50],
        'learning_rate' : [0.001, 0.01, 0.1, 1.0]
    }
]

3.使用GridSearchCV自动调参

划分训练集和测试集，训练集占70%，测试集占30%。

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(df_features,df_labels,test_size=0.3, random_state=42)

用F1作为分类型评价标准，自动调参，得到最佳参数后，返回params重新设置参数，再次自动调参，直到得到满意的f1_score。

dataset = df.to_dict(orient='index') #dataframe数据结构转换为list
best_estimator = None  
best_score = 0  
print('\n\n==============\n选择最好的模型算法（F1得分）：:\n')
warnings.filterwarnings('ignore') # 忽视f1警告错误

for ii, algorithm in enumerate(algorithms):
    model = GridSearchCV(algorithm, params[ii],scoring = 'f1',cv=5) 
    model.fit(features_train,labels_train)

    best_estimator_ii = model.best_estimator_
    best_score_ii = model.best_score_

    print('------------\nF1 Score:',best_score_ii,'\n')

    test_classifier(best_estimator_ii, dataset, feature_list)

    if best_score_ii > best_score:
        best_estimator = best_estimator_ii
        best_score = best_score_ii

clf =  best_estimator

朴素叶贝斯：
Accuracy: 0.85673 Precision: 0.43270 Recall: 0.23950 F1: 0.30834

支持向量机：
Accuracy: 0.80047 Precision: 0.31508 Recall: 0.42300 F1: 0.36115

决策树：
Accuracy: 0.78600 Precision: 0.23254 Recall: 0.26300 F1: 0.24683

随机森林：
Accuracy: 0.77247 Precision: 0.31155 Recall: 0.58400 F1: 0.40633

adaboost：
Accuracy: 0.79200 Precision: 0.18820 Recall: 0.16900 F1: 0.17808

得分最高的分类器：
随机森林

交叉检验

使用cross_val_score的方法对数据集进行多次分割（CV=5），多次验证，最后取平均得分，大大降低随机性对数据分布的影响。

f1 = cross_val_score(clf, features_train,labels_train, cv=5,scoring='f1')
acc = cross_val_score(clf, features_train,labels_train, cv=5, scoring='accuracy')
pre = cross_val_score(clf, features_train,labels_train, cv=5, scoring='precision')
rec = cross_val_score(clf, features_train,labels_train, cv=5, scoring='recall')
test_set = (f1.mean(),acc.mean(),pre.mean(),rec.mean())

检验结果：
F1=0.537979797979798
accuracy=0.79864661654135338
precision=0.46547619047619049
recall=0.83333333333333326
召回率大于0.3，交叉验证成功。

※评估度量理解
True Positive（TP）：做出Positive的判定，而且判定是正确的。
此项目中的含义：POI判定为POI的数量
False Positive（TP）：做出Positive的判定，而且判定是错误的。
此项目中的含义：非POI判定为POI的数量
True Negative（TN)：做出Negative的判定，而且判定是正确的。
此项目中的含义：非POI判定为非POI的数量
False Negative（FN）：做出Negative的判定，而且判定是错误的。
此项目中的含义：POI判定为非POI的数量

precision= TP / (TP + FP)
精确率：找到的POI数量占判定为POI数量的比例
recall= TP / (TP + FN)
召回率：找到的POI数占总POI数量的比例
accurary= (TP + TN) / (TP + FP + TN + FN)
准确率：所有正确分类数量在总数的比例
F1 Score = P*R/2(P+R)
F1：精确率和召回率的平均效果

输出

dataset = df.to_dict(orient='index')
feature_list = ['poi'] + df_features.columns.tolist()
dump_classifier_and_data(clf, dataset, feature_list )