基于Broward犯罪数据集机器学习分类

目录

  • 1 数据导入和展示
  • 2 数据可视化
    • 2.1 绘制直方图
    • 2.2 绘制箱型图
    • 2.3 绘制相关系数图像
  • 3 特征处理
    • 3.1 非正态数据通过对数转换,转换为正态数据
    • 3.2 计算方差,我们只选取方差大的特征
  • 4 机器学习模型

包含详细的数据分析,数据可视化、特征处理、机器学习模型分类,详情请私信。纤细数据集见:https://download.csdn.net/download/ww596520206/87940780

1 数据导入和展示

本文使用ProPublica COMPAS数据集并构建分类任务,目标是预测刑事被告是否会再次犯罪。该数据集由2013年至2014年在佛罗里达州Broward县接受COMPAS筛查的所有刑事被告组成,我们仅对可用于预测被告再犯风险的特征感兴趣。作者使用特征子集进行分析,最终得到9个可用于构造累犯预测分类器的特征如下:逮捕指控说明(例如,盗窃,藏有毒品),指控程度(轻罪或重罪),先前的刑事犯罪数量,少年重罪犯罪,少年轻罪犯罪,其他少年犯罪,被告年龄 ,被告的性别和被告的种族。

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import KFold, GridSearchCV
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
file_path = r"broward_data.csv"
data = pd.read_csv(file_path)
new_data = data.copy()
new_data.describe()

基于Broward犯罪数据集机器学习分类_第1张图片

2 数据可视化

2.1 绘制直方图

hist_features = ["age_at_current_charge", "age_at_first_charge", "total_convictions", "p_pending_charge"]

for k, feature in enumerate(hist_features):
    plt.subplot(2, 2, k + 1)
    plt.title(feature)
    new_data[feature].plot(kind="hist")
plt.show()

基于Broward犯罪数据集机器学习分类_第2张图片

2.2 绘制箱型图

feature_list = ["p_charges","p_pending_charge","total_convictions","p_arrest"]
for k, feature in enumerate(feature_list):
    plt.subplot(2, 2, k + 1)
    plt.title(feature)
    plt.boxplot(new_data[feature].values)
plt.show()

基于Broward犯罪数据集机器学习分类_第3张图片

2.3 绘制相关系数图像

feature_list = ["p_pending_charge",
                "p_charges", "p_pending_charge", "total_convictions", "p_arrest"]
feature_corrcoef = np.corrcoef(np.asarray(new_data[feature_list].values).T)
f, ax = plt.subplots(figsize=(10, 15))
ax.xaxis.tick_top()
sns.heatmap(feature_corrcoef, cmap='RdBu', linewidths=0.05, annot=True, ax=ax,
            xticklabels=feature_list, yticklabels=feature_list, fmt='.3f')
plt.show()

基于Broward犯罪数据集机器学习分类_第4张图片

3 特征处理

3.1 非正态数据通过对数转换,转换为正态数据

#转换前图像
hist_features = [ "total_convictions", "p_pending_charge"]
for k, feature in enumerate(hist_features):
    plt.subplot(1, 2, k + 1)
    plt.title(feature)
    new_data[feature].plot(kind="hist")
plt.show()

基于Broward犯罪数据集机器学习分类_第5张图片

# 使用对数转换,转换后图像
new_data["total_convictions"] = np.log(new_data["total_convictions"].values + 1)
new_data["p_pending_charge"] = np.log(new_data["p_pending_charge"].values + 1)
hist_features = [ "total_convictions", "p_pending_charge"]
for k, feature in enumerate(hist_features):
    plt.subplot(1, 2, k + 1)
    plt.title(feature)
    new_data[feature].plot(kind="hist")
plt.show()

基于Broward犯罪数据集机器学习分类_第6张图片

3.2 计算方差,我们只选取方差大的特征

# 选择出特征数据
label_columns = ["general_two_year", "general_six_month", "drug_two_year", "property_two_year", "misdemeanor_two_year",
                 "felony_two_year", "violent_two_year", "drug_six_month", "property_six_month", "misdemeanor_six_month",
                 "felony_six_month", "violent_six_month"]
drop_features = ["person_id"] + label_columns
features_columns = new_data.columns.tolist()
for x in drop_features:
    features_columns.remove(x)
x = new_data.loc[:, features_columns] 

4 机器学习模型

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
label_columns = ["general_two_year", "general_six_month", "drug_two_year", "property_two_year", "misdemeanor_two_year",
                 "felony_two_year", "violent_two_year", "drug_six_month", "property_six_month", "misdemeanor_six_month",
                 "felony_six_month", "violent_six_month"]

model_list = [KNeighborsClassifier(), SVC(probability = True),
              DecisionTreeClassifier(), RandomForestClassifier(),
              LogisticRegression()]
model_name_list = ["knn","svc","dt","rf",'lr']
num = int(len(new_data) * 0.7)
acc_np = np.zeros((12, 5))
f1_np = np.zeros((12, 5))
train_x = x.loc[:num, :].values
test_x = x.loc[num:, :].values
test_pred_proba_list = []

for i, label in enumerate(label_columns):
    train_y = new_data.loc[:num, label].values
    test_y =  new_data.loc[num:, label].values
    t_list = []
    for j, model in enumerate(model_list):
        model.fit(train_x, train_y)
        
        test_pred = model.predict(test_x)
        test_pred_proba = model.predict_proba(test_x)
        t_list.append(test_pred_proba)
        

        test_acc = accuracy_score(test_y, test_pred)
        test_f1 = f1_score(test_y, test_pred)
        acc_np[i,j] = test_acc
        f1_np[i,j] = test_f1
    test_pred_proba_list.append(t_list)
acc_df = pd.DataFrame(acc_np, index=label_columns, columns=model_name_list)
f1_df = pd.DataFrame(f1_np, index=label_columns, columns=model_name_list)
roc_auc_df = pd.DataFrame(roc_auc_np, index=label_columns, columns=model_name_list[:-1])
roc_auc_df  

基于Broward犯罪数据集机器学习分类_第7张图片

基于Broward犯罪数据集机器学习分类_第8张图片

你可能感兴趣的:(机器学习,分类,python)