最近实践过程中遇到需要KFold()
记录一下,以便日后查阅
KFold()在sklearn中属于model_slection模块
from sklearn.model_selection import KFold
KFold(n_splits=’warn’, shuffle=False, random_state=None)
参数:
n_splits 表示划分为几块(至少是2)
shuffle 表示是否打乱划分,默认False,即不打乱
random_state 表示是否固定随机起点,Used when shuffle == True.
方法
1,get_n_splits([X, y, groups]) 返回分的块数
2,split(X[,Y,groups]) 返回分类后数据集的index
例子:
1, get_n_splits()
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
print( kf.get_n_splits(X))
输出2
2, split()
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]
for k,(train,test) in enumerate(kf.split(x,y)):
print (k,(train,test))
x_train=X.iloc[train]
x_test=X.iloc[test]
y_train=Y.iloc[train]
y_test=Y.iloc[tes]
用KFold()优化逻辑回归参数C
参数C为正则化项的系数gama的倒数(C=1/gama)
def best_C_param (x,y):
kf=KFold(n_splits=7,shuffle=True,random_state=0)
C_param=[0.001,0.01,0.1,1,10,100]
result=[]
for c_parm in C_param:
print('C_param',c_parm)
recall_score_lr_kf=[]
for k,(train,test) in enumerate(kf.split(x,y)):
#print(y.iloc[train])
model = LogisticRegression(C=c_parm,penalty='l2')
lr_kf=model.fit(x.iloc[train],y.iloc[train].values.ravel())
pred_lr_kf=lr_kf.predict(x.iloc[test])
recall_score_lrkf=recall_score(y.iloc[test],pred_lr_kf)
recall_score_lr_kf.append(recall_score_lrkf)
print('iteration',k,'recall score',recall_score_lrkf)
result.append(np.mean(recall_score_lr_kf))
print(c_parm,np.mean(recall_score_lr_kf))
print('bets mean recall score',max(result))
print('using whole datasets X and Y')
best_C_param(X,Y)
print('-------------------------')
print('using under sampling data X and Y')
best_C_param(X_underSampling,Y_underSampling)
一点小小的感触:
数据是非平衡数据结构,正样本1在总体数据集中只占有0.17%
欠采样处理后,二分类比例达到1:1
欠采样处理后的数据KFold寻找LR的最佳C:
方法1:不打乱划分,即shuffle=False (默认),其他同上
kf=KFold(n_splits=7)
每个C参数,出现recall score为0 (7块中至少2块为0),导致平均下来每个c_parm的recall均分只有0.5左右。
原因:不打乱的时候,分块中有些没分到正样本
方法2:打乱划分,固定随机种子
kf=KFold(n_splits=7,shuffle=True,random_state=0)
输出:结果对欠采样处理后的数据表现较好
using whole datasets X and Y before under sampling
C_param 0.001
iteration 0 recall score 0.5797101449275363
iteration 1 recall score 0.5223880597014925
iteration 2 recall score 0.5
iteration 3 recall score 0.5571428571428572
iteration 4 recall score 0.6774193548387096
iteration 5 recall score 0.5692307692307692
iteration 6 recall score 0.4567901234567901
0.001 0.5518116156140221
bets mean recall score 0.5518116156140221
C_param 0.01
iteration 0 recall score 0.5942028985507246
iteration 1 recall score 0.5671641791044776
iteration 2 recall score 0.5512820512820513
iteration 3 recall score 0.6142857142857143
iteration 4 recall score 0.6774193548387096
iteration 5 recall score 0.6
iteration 6 recall score 0.5185185185185185
0.01 0.5889818166543136
bets mean recall score 0.5889818166543136
C_param 0.1
iteration 0 recall score 0.6231884057971014
iteration 1 recall score 0.6119402985074627
iteration 2 recall score 0.5512820512820513
iteration 3 recall score 0.6142857142857143
iteration 4 recall score 0.6935483870967742
iteration 5 recall score 0.6153846153846154
iteration 6 recall score 0.5555555555555556
0.1 0.6093121468441821
bets mean recall score 0.6093121468441821
C_param 1
iteration 0 recall score 0.6521739130434783
iteration 1 recall score 0.6119402985074627
iteration 2 recall score 0.5512820512820513
iteration 3 recall score 0.6428571428571429
iteration 4 recall score 0.7096774193548387
iteration 5 recall score 0.6153846153846154
iteration 6 recall score 0.5802469135802469
1 0.6233660505728338
bets mean recall score 0.6233660505728338
C_param 10
iteration 0 recall score 0.6521739130434783
iteration 1 recall score 0.6119402985074627
iteration 2 recall score 0.5512820512820513
iteration 3 recall score 0.6428571428571429
iteration 4 recall score 0.7096774193548387
iteration 5 recall score 0.6153846153846154
iteration 6 recall score 0.5802469135802469
10 0.6233660505728338
bets mean recall score 0.6233660505728338
C_param 100
iteration 0 recall score 0.6521739130434783
iteration 1 recall score 0.6119402985074627
iteration 2 recall score 0.5512820512820513
iteration 3 recall score 0.6428571428571429
iteration 4 recall score 0.7096774193548387
iteration 5 recall score 0.6153846153846154
iteration 6 recall score 0.5802469135802469
100 0.6233660505728338
bets mean recall score 0.6233660505728338
-------------------------
using under sampling data X and Y
C_param 0.001
iteration 0 recall score 0.9558823529411765
iteration 1 recall score 0.9861111111111112
iteration 2 recall score 0.9473684210526315
iteration 3 recall score 0.9466666666666667
iteration 4 recall score 0.9661016949152542
iteration 5 recall score 0.96
iteration 6 recall score 0.9701492537313433
0.001 0.9617542143454548
bets mean recall score 0.9617542143454548
C_param 0.01
iteration 0 recall score 0.9558823529411765
iteration 1 recall score 0.8888888888888888
iteration 2 recall score 0.881578947368421
iteration 3 recall score 0.8666666666666667
iteration 4 recall score 0.9661016949152542
iteration 5 recall score 0.9466666666666667
iteration 6 recall score 0.9253731343283582
0.01 0.9187369073964903
bets mean recall score 0.9617542143454548
C_param 0.1
iteration 0 recall score 0.9264705882352942
iteration 1 recall score 0.8888888888888888
iteration 2 recall score 0.868421052631579
iteration 3 recall score 0.8666666666666667
iteration 4 recall score 0.9661016949152542
iteration 5 recall score 0.9466666666666667
iteration 6 recall score 0.8955223880597015
0.1 0.9083911351520072
bets mean recall score 0.9617542143454548
C_param 1
iteration 0 recall score 0.9264705882352942
iteration 1 recall score 0.9027777777777778
iteration 2 recall score 0.881578947368421
iteration 3 recall score 0.8933333333333333
iteration 4 recall score 0.9491525423728814
iteration 5 recall score 0.9466666666666667
iteration 6 recall score 0.8955223880597015
1 0.9136431776877252
bets mean recall score 0.9617542143454548
C_param 10
iteration 0 recall score 0.9264705882352942
iteration 1 recall score 0.8888888888888888
iteration 2 recall score 0.8947368421052632
iteration 3 recall score 0.9066666666666666
iteration 4 recall score 0.9491525423728814
iteration 5 recall score 0.9466666666666667
iteration 6 recall score 0.8955223880597015
10 0.9154435118564803
bets mean recall score 0.9617542143454548
C_param 100
iteration 0 recall score 0.9264705882352942
iteration 1 recall score 0.9027777777777778
iteration 2 recall score 0.881578947368421
iteration 3 recall score 0.9066666666666666
iteration 4 recall score 0.9661016949152542
iteration 5 recall score 0.96
iteration 6 recall score 0.8805970149253731
100 0.9177418128412552
bets mean recall score 0.9617542143454548
Process finished with exit code 0