from sklearn.model_selection import train_test_split
klearn.model_selection.train_test_split(arrays, *options)
参数:
返回
特征值和标签
feature_data = [[1,3,4],[1,3,4],[1,5,2],[1,5,8]]
label_data = [1,3,2,4]
将数据按3:1划分为训练集和测试集,划分后其特征和标签按顺序自动对应
x_train, x_test, y_train, y_test = train_test_split(feature_data, label_data,test_size=0.25)
测试:
print(x_train)
print(y_train)
[[1, 5, 2], [1, 3, 4], [1, 5, 8]]
[2, 1, 4]
该API同样支持numpy类型数据
from sklearn.datasets import load_iris # 导入鸢尾花数据
iris = load_iris() # 字典类型的数据集
查看数据
print(iris.keys()) # 查看key值
输出
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
查看特征值data
print(iris['data'])
输出:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
....
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
共有150条数据
# 训练集的特征值x_train 测试集的特征值x_test 训练集的目标值y_train 测试集的目标值y_test,指定测试集所占的比例为20%
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22,test_size=0.2)
print("x_train:\n", x_train.shape)
print("x_test:\n",x_test.shape)
输出shape:
x_train:
(120, 4)
x_test:
(30, 4)