特征预处理API
sklearn.preprocessing
为什么要做归一化/标准化?
无量纲化
特征的单位或者数量相差较大,这样某特征会‘绝对’最终结果,使得其他算法无法学习到其他特征。
将原始数据进行变换将数据映射到[0,1]之间(默认)
我们可以使用sklearn库中的MinMaxScaler(feature_range(0,1)):进行数据处理
案例:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
def minmax_demo():
"""
归一化
:return:
"""
# 1、获取数据
data = pd.read_csv('test00.csv')
# 只要前三列数据
data = data.iloc[:, :3]
print("data:\n", data)
# 2、实例化一个转换器类
transfer = MinMaxScaler()
# 3、调用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
if __name__ == '__main__':
minmax_demo()
最终转换结果都在 0-1 区间内
data:
height weight chest measurement
0 180 70 0.88877
1 190 80 0.99665
2 168 60 0.65878
3 159 65 0.65598
4 169 56 0.55658
5 173 60 0.46058
6 186 76 0.69978
7 178 60 0.64979
8 175 75 0.89895
9 176 60 0.88488
10 177 90 0.79595
11 168 100 0.48789
12 158 102 0.55646
13 168 60 0.69585
14 179 80 0.65785
15 183 70 0.69578
16 190 66 0.89586
17 196 88 0.96527
18 187 91 0.62488
19 182 90 0.58484
20 158 70 0.58947
21 159 55 0.58484
22 166 55 0.59896
23 178 54 0.48487
24 163 69 0.68745
25 156 55 0.52621
26 189 89 0.66959
27 156 56 0.59595
28 189 98 0.59716
29 169 66 0.65479
30 179 55 0.99598
31 177 68 0.55257
32 166 76 0.69784
33 169 86 0.68745
34 189 89 0.69988
35 188 68 0.78955
36 176 59 0.55999
37 177 60 0.68747
38 196 80 0.64888
data_new:
[[0.6 0.33333333 0.79875762]
[0.85 0.54166667 1. ]
[0.3 0.125 0.36972783]
[0.075 0.22916667 0.36450464]
[0.325 0.04166667 0.17908109]
[0.425 0.125 0. ]
[0.75 0.45833333 0.44621038]
[0.55 0.125 0.35295764]
[0.475 0.4375 0.81774768]
[0.5 0.125 0.79150111]
[0.525 0.75 0.6256086 ]
[0.3 0.95833333 0.05094484]
[0.05 1. 0.17885724]
[0.3 0.125 0.43887925]
[0.575 0.54166667 0.36799299]
[0.675 0.33333333 0.43874867]
[0.85 0.25 0.81198351]
[1. 0.70833333 0.94146287]
[0.775 0.77083333 0.30648982]
[0.65 0.75 0.23179809]
[0.05 0.33333333 0.24043502]
[0.075 0.02083333 0.23179809]
[0.25 0.02083333 0.25813793]
[0.55 0. 0.04531125]
[0.175 0.3125 0.42320966]
[0. 0.02083333 0.12242804]
[0.825 0.72916667 0.38989311]
[0. 0.04166667 0.25252299]
[0.825 0.91666667 0.25478016]
[0.325 0.25 0.36228478]
[0.575 0.02083333 0.99875016]
[0.525 0.29166667 0.17160072]
[0.25 0.45833333 0.44259145]
[0.325 0.66666667 0.42320966]
[0.825 0.72916667 0.44639693]
[0.8 0.29166667 0.61366986]
[0.5 0.10416667 0.1854422 ]
[0.525 0.125 0.42324696]
[1. 0.54166667 0.3512601 ]]
Process finished with exit code 0
归一化缺点:如果最大值和最小值是异常值,则对结果影响很大
通过对原始数据进行变换,把数据变换到均值为0,标准差为1的范围内
对于标准化而言,如果出现异常值,则对最终结果的影响也不是很大
使用sklearn中的API—StandardScaler()
案例:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
def minmax_demo():
"""
归一化
:return:
"""
# 1、获取数据
data = pd.read_csv('test00.csv')
# 只要前三列数据
data = data.iloc[:, :3]
print("data:\n", data)
# 2、实例化一个转换器类
# transfer = MinMaxScaler()
transfer = StandardScaler()
# 3、调用fit_transform()
data_new = transfer.fit_transform(data)
print("data_new:\n", data_new)
return None
if __name__ == '__main__':
minmax_demo()
data:
height weight chest measurement
0 180 70 0.88877
1 190 80 0.99665
2 168 60 0.65878
3 159 65 0.65598
4 169 56 0.55658
5 173 60 0.46058
6 186 76 0.69978
7 178 60 0.64979
8 175 75 0.89895
9 176 60 0.88488
10 177 90 0.79595
11 168 100 0.48789
12 158 102 0.55646
13 168 60 0.69585
14 179 80 0.65785
15 183 70 0.69578
16 190 66 0.89586
17 196 88 0.96527
18 187 91 0.62488
19 182 90 0.58484
20 158 70 0.58947
21 159 55 0.58484
22 166 55 0.59896
23 178 54 0.48487
24 163 69 0.68745
25 156 55 0.52621
26 189 89 0.66959
27 156 56 0.59595
28 189 98 0.59716
29 169 66 0.65479
30 179 55 0.99598
31 177 68 0.55257
32 166 76 0.69784
33 169 86 0.68745
34 189 89 0.69988
35 188 68 0.78955
36 176 59 0.55999
37 177 60 0.68747
38 196 80 0.64888
data_new:
[[ 0.40612393 -0.13933189 1.4864856 ]
[ 1.29594603 0.56637508 2.26419106]
[-0.66166258 -0.84503885 -0.17150918]
[-1.46250247 -0.49218537 -0.19169434]
[-0.57268037 -1.12732164 -0.90826759]
[-0.21675154 -0.84503885 -1.60033029]
[ 0.94001719 0.28409229 0.12405926]
[ 0.22815951 -0.84503885 -0.23631797]
[-0.03878712 0.21352159 1.55987308]
[ 0.05019509 -0.84503885 1.45844265]
[ 0.1391773 1.27208204 0.81734748]
[-0.66166258 1.977789 -1.40345287]
[-1.55148468 2.11893039 -0.90913267]
[-0.66166258 -0.84503885 0.09572795]
[ 0.31714172 0.56637508 -0.17821354]
[ 0.67307056 -0.13933189 0.09522332]
[ 1.29594603 -0.42161467 1.53759732]
[ 1.82983928 1.13094065 2.03797306]
[ 1.0289994 1.34265273 -0.41589382]
[ 0.58408835 1.27208204 -0.70454163]
[-1.55148468 -0.13933189 -0.67116403]
[-1.46250247 -1.19789233 -0.70454163]
[-0.839627 -1.19789233 -0.60275075]
[ 0.22815951 -1.26846303 -1.42522401]
[-1.10657363 -0.20990258 0.03517246]
[-1.7294491 -1.19789233 -1.12720451]
[ 1.20696382 1.20151134 -0.09358004]
[-1.7294491 -1.12732164 -0.6244498 ]
[ 1.20696382 1.83664761 -0.61572692]
[-0.57268037 -0.42161467 -0.20027304]
[ 0.31714172 -1.19789233 2.25936103]
[ 0.1391773 -0.28047328 -0.93717563]
[-0.839627 0.28409229 0.11007383]
[-0.57268037 0.98979925 0.03517246]
[ 1.20696382 1.20151134 0.12478016]
[ 1.11798161 -0.28047328 0.77120997]
[ 0.05019509 -0.91560955 -0.88368495]
[ 0.1391773 -0.84503885 0.03531664]
[ 1.82983928 0.56637508 -0.24287815]]