(284807, 31)
data.descirbe(include='all')
可以看到Amount的统计量:
此案例没有缺失值
0 : 284315 (99.8%)
1 : 492 (0.2%)
isFraud发生的极其少,数据极不平衡,之后需要SMOTE或者Under sample
import seaborn as sns
corr = data.corr() #传入要看相关性的属性或全部列data.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr,square=True) ##annot=Ture 显示每个方格的数据 fmt参数控制字符属性
plt.show()
看不出有哪些属性有着很强的相关性,很大程度可能是因为极度imbalance的缘故造成的
通用的预处理:
from sklearn.preprocessing import StandardScaler
data['Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
如果fit_transform传入的数据只有一列需要加values.reshape(-1,1)
# undersample 抽取data中所有的fraud 样本400多个,然后在剩下的class=0的样本中sample相等数量的样本就行
fraud_data = data[data["Class"]==1] # fraud_data dataframe
number_records_fraud = len(fraud_data) # how many fraud samples:492
not_fraud_data = data[data["Class"]==0].sample(n=number_records_fraud)
training_data_1 = pd.concat([fraud_data,not_fraud_data])
tips:这里新生成的training_data_1的索引是未经过重排序的,是乱的
进行feature和class的分离
X_under = training_data_1.drop(['Class'],axis=1)
y_under = training_data_1['Class']
经过under sample的数据集,看下个属性的相关性热力图
很明显看出V16-V18正相关性很强,V1V2还有一些属性负相关性很强
对原始数据进行分离
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=33)
对undersample的数据进行分离
train_x_under, test_x_under, train_y_under, test_y_under = train_test_split(X_under, y_under, test_size=0.3, random_state=33)
from imblearn.over_sampling import SMOTE,ADASYN
features_data,label_data = SMOTE().fit_resample(features_data,np.ravel(label_data))
X_smote,y_smote = SMOTE(random_state=33).fit_resample(X,np.ravel(y))
X_smote = pd.DataFrame(X_smote,columns=features_name)
y_smote = pd.DataFrame({"Class":y_smote})
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
parameters={'penalty':['l1','l2'],'C':[0.01,0.1,1,3,10]}
#C是惩罚力度,与直觉相反,C越小惩罚力度越大
clf = LogisticRegression(n_jobs=-1)
model = GridSearchCV(estimator=clf,param_grid=parameters,cv=8,n_jobs=-1,scoring='recall')
model.fit(train_x_under,train_y_under)
print(model.best_score_)
print(model.best_params_)
往GridSearchCV传入不同的评估指标recall 和 Auc
AUC = 0.979185745056816 {‘C’:1, ‘penalty’: ‘l1’} LogisticRegression
Recall = 0.9105691056910569 {‘C’: 0.01, ‘penalty’: ‘l1’} LogisticRegression
train_x_under, test_x_under, train_y_under, test_y_under
clf = LogisticRegression(C=1,penalty='l1')
clf.fit(train_x_under,train_y_under)
predict = clf.predict(test_x_under)
train_x, test_x, train_y, test_y
clf = LogisticRegression(C=1,penalty='l1')
clf.fit(train_x,train_y)
predict = clf.predict(test_x)
evaluate(predict,test_y,[0,1])
fraud的recall rate偏低,所有151个fraud样本中只有62%被找出来
SMOTE过后的数据 0:1 = 28万:28万
clf = LogisticRegression(C=0.01,penalty='l1')
clf.fit(X_smote,y_smote)
predict = clf.predict(test_x)
误报率提高了好多,正常的被分类成fraud的有2000多个
先将notFraud的undersample到700,这样0:1 = 700:492
再进行SMOTE,就变成700:700,拿这个数据进行训练