【机器学习(10)】模型评价:数据集划方法(留出法和交叉验证法)

数据集划分方法

1) 划分基本准则:保持训练集和验证集之间的 互斥性
        准则解释:测试样本尽量不在训练样本中出现,以保证验证集上的表现能代表模型的泛化能力(比如期末测试题上出的内容不是课上讲的原题)

2) 留出法:
        直接将数据集划分成两个互斥的集合,其中一个做训练集,一个做验证集
        常用划分比例:7:3、7.5:2.5、8:2
        存在的问题:随机取样对模型的影响(比如这一次考试随机抽取的题目都是会做的,而第二次抽取的又恰巧是我不会做的),这种情况下测试的结果并不能代表我的真实水平

3)交叉验证法(cv)
        将数据集划分为k个大小相似的互斥子集,每一次以k-1个子集做训练,1个子集做验证,训练k次,最终返回的是k次训练结果的均值,因此交叉验证法又称为k折交叉法(k-fold)

留出法进行数据划分及模型评价得分

这里直接进行数据的分割

from sklearn.model_selection import train_test_split
training, testing = train_test_split(df,test_size=0.25, random_state=1)
x_train = training.copy().drop(columns=['average_price','id'])
y_train = training.copy()['average_price']
x_test = testing.copy().drop(columns=['average_price','id'])
y_test = testing.copy()['average_price']
print(f'the shape of the training set is {x_train.shape}')
print(f'the shape of the testing set is {x_test.shape}')

–> 输出的结果为:(按照0.75:0.25进行分割数据)

the shape of the training set is (673, 7)
the shape of the testing set is (225, 7)

查看留出法验证集上模型的表现

import warnings
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
warnings.filterwarnings('ignore')
pipe_lm.fit(x_train,y_train)
y_predict = pipe_lm.predict(x_test)
print(f'mean squared error is: {mean_squared_error(y_test,y_predict)}')
print(f'mean absolute error is: {mean_absolute_error(y_test,y_predict)}')
print(f'R Squared is: {r2_score(y_test,y_predict)}')

–> 输出的结果为:

mean squared error is: 37995892.98761668
mean absolute error is: 4396.432366811368
R Squared is: 0.5719194373282056

交叉验证法进行数据划分及模型评价得分

from sklearn.model_selection import KFold
k = 10
kf = KFold(n_splits=k, shuffle=True)

mse = []
mae = []
r_s2 = []

for train_index, test_index in kf.split(df):  # 拆分
    x_train, x_test = x.loc[train_index], x.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    pipe_lm.fit(x_train,y_train)
    y_predict = pipe_lm.predict(x_test)  # 模型原型 选择
    k_mse = mean_squared_error(y_test,y_predict)
    mse.append(k_mse)
    print(f'mean squared error is {k_mse}')
    k_mae = mean_absolute_error(y_test,y_predict)
    mae.append(k_mae)
    print(f'mean absolute error is {k_mae}')
    k_r_s2 = r2_score(y_test,y_predict)
    r_s2.append(k_r_s2)
    print(f'R Squared is {k_r_s2}')

–> 输出的结果为:(分十次进行结果输出)

mean squared error is 33944053.2775839
mean absolute error is 4091.3521198501926
R Squared is 0.5092029932534381
mean squared error is 35114434.65383525
mean absolute error is 4058.507125221178
R Squared is 0.5689849523694475
mean squared error is 18197156.68175975
mean absolute error is 3369.9995687271
R Squared is 0.776182527685076
mean squared error is 34535243.46299915
mean absolute error is 4192.820971274831
R Squared is 0.5779088133315295
mean squared error is 40233080.92316229
mean absolute error is 4097.433595440258
R Squared is 0.5424187136274303
mean squared error is 30563555.203029394
mean absolute error is 4036.1654178309395
R Squared is 0.599523380094302
mean squared error is 45418168.17460259
mean absolute error is 4849.161828850654
R Squared is 0.5349900325354779
mean squared error is 39047300.59608697
mean absolute error is 4684.974341683951
R Squared is 0.6127954527533661
mean squared error is 40672733.401221015
mean absolute error is 4738.685116862728
R Squared is 0.5256179983368192
mean squared error is 37635022.702838585
mean absolute error is 4099.567148872294
R Squared is 0.4709723577684112

最后取十次结果的平均值即可

import numpy as np
print(f'mean squared error is {np.array(mse).mean()}')
print(f'mean absolute error is {np.array(mae).mean()}')
print(f'R Squared is {np.array(r_s2).mean()}')

–> 输出的结果为:

mean squared error is 35536074.90771189
mean absolute error is 4221.866723461413
R Squared is 0.5718597221755297

对比结果

可以看出两种方法最后的模型评价的分中,r2基本上是一致的,但是使用交叉验证法的mae/mse结果要比留出法要低,可以认为交叉验证法能有效的控制随机取样对模型的影响

你可能感兴趣的:(机器学习,机器学习,python,深度学习,数据分析,数据挖掘)