sklearn之前没有接触过,以练代学了。常用的用法记录下来,这样才能慢慢总结。
sklearn.model_selection.train_test_split(*arrays, **options)
train_test_split里面常用的因数(arguments)介绍:
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>>
>>> namelist = pd.DataFrame({
... "name" : ["Suzuki", "Tanaka", "Yamada", "Watanabe", "Yamamoto",
... "Okada", "Ueda", "Inoue", "Hayashi", "Sato",
... "Hirayama", "Shimada"],
... "age": [30, 40, 55, 29, 41, 28, 42, 24, 33, 39, 49, 53],
... "department": ["HR", "Legal", "IT", "HR", "HR", "IT",
... "Legal", "Legal", "IT", "HR", "Legal", "Legal"],
... "attendance": [1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1]
... })
>>> print(namelist)
age attendance department name
0 30 1 HR Suzuki
1 40 1 Legal Tanaka
2 55 1 IT Yamada
3 29 0 HR Watanabe
4 41 1 HR Yamamoto
5 28 1 IT Okada
6 42 1 Legal Ueda
7 24 0 Legal Inoue
8 33 0 IT Hayashi
9 39 1 HR Sato
10 49 1 Legal Hirayama
11 53 1 Legal Shimada
将testing数据指定为0.3(test_size=0.3),从而将testing和training 集合分开。
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=0.3)
>>> print(namelist_train)
age attendance department name
10 49 1 Legal Hirayama
1 40 1 Legal Tanaka
7 24 0 Legal Inoue
2 55 1 IT Yamada
4 41 1 HR Yamamoto
3 29 0 HR Watanabe
9 39 1 HR Sato
6 42 1 Legal Ueda
>>> print(namelist_test)
age attendance department name
0 30 1 HR Suzuki
8 33 0 IT Hayashi
11 53 1 Legal Shimada
5 28 1 IT Okada
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=5)
>>> print(namelist_train)
age attendance department name
3 29 0 HR Watanabe
4 41 1 HR Yamamoto
6 42 1 Legal Ueda
1 40 1 Legal Tanaka
9 39 1 HR Sato
8 33 0 IT Hayashi
7 24 0 Legal Inoue
>>> print(namelist_test)
age attendance department name
2 55 1 IT Yamada
10 49 1 Legal Hirayama
5 28 1 IT Okada
11 53 1 Legal Shimada
0 30 1 HR Suzuki
接下来将training data 指定为0.5(training_size=0.5)
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=None, train_size=0.5)
>>> print(namelist_train)
age attendance department name
5 28 1 IT Okada
2 55 1 IT Yamada
3 29 0 HR Watanabe
4 41 1 HR Yamamoto
10 49 1 Legal Hirayama
0 30 1 HR Suzuki
>>> print(namelist_test)
age attendance department name
6 42 1 Legal Ueda
7 24 0 Legal Inoue
9 39 1 HR Sato
11 53 1 Legal Shimada
8 33 0 IT Hayashi
1 40 1 Legal Tanaka
接下来是是shuffle和straify功能。例题欣赏。
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> namelist_train, namelist_test = train_test_split(namelist, shuffle=False)
>>> print(namelist_train)
age attendance department name
0 30 1 HR Suzuki
1 40 1 Legal Tanaka
2 55 1 IT Yamada
3 29 0 HR Watanabe
4 41 1 HR Yamamoto
5 28 1 IT Okada
6 42 1 Legal Ueda
7 24 0 Legal Inoue
8 33 0 IT Hayashi
>>> print(namelist_test)
age attendance department name
9 39 1 HR Sato
10 49 1 Legal Hirayama
11 53 1 Legal Shimada