写了第一篇博客之后再也没写第二篇。最近在用Python做实验,许多小问题要不断尝试验证,细枝末节太多,想要记住的话还是总结成笔记印象比较深刻。好记性不如烂笔头。
scikit-learn中fit_transform()与transform()到底有什么区别,能不能混用?
使用preprocessing.MinMaxScaler()对象对数据进行归一化。原理是:(x-xMin)/(xMax - xMin),从而将所有数据映射到【0,1】区间。
import numpy as np
from sklearn.preprocessing import MinMaxScaler
data = np.array(np.random.randint(-100,100,24).reshape(6,4))
data
Out[55]:
array([[ 68, -63, -31, -10],
[ 49, -49, 73, 18],
[ 46, 65, 75, -78],
[-72, 30, 90, -80],
[ 95, -88, 79, -49],
[ 34, -81, 57, 83]])
train = data[:4]
test = data[4:]
train
Out[58]:
array([[ 68, -63, -31, -10],
[ 49, -49, 73, 18],
[ 46, 65, 75, -78],
[-72, 30, 90, -80]])
test
Out[59]:
array([[ 95, -88, 79, -49],
[ 34, -81, 57, 83]])
minmaxTransformer = MinMaxScaler(feature_range=(0,1))
#先对train用fit_transformer(),包括拟合fit找到xMin,xMax,再transform归一化
train_transformer = minmaxTransformer.fit_transform(train)
#根据train集合的xMin,xMax,对test集合进行归一化transform.
#(如果test中的某个值比之前的xMin还要小,依然用原来的xMin;同理如果test中的某个值比之前的xMax还要大,依然用原来的xMax.
#所以,对test集合用同样的xMin和xMax,**有可能不再映射到【0,1】**)
test_transformer = minmaxTransformer.transform(test)
train_transformer
Out[64]:
array([[ 1. , 0. , 0. , 0.71428571],
[ 0.86428571, 0.109375 , 0.85950413, 1. ],
[ 0.84285714, 1. , 0.87603306, 0.02040816],
[ 0. , 0.7265625 , 1. , 0. ]])
test_transformer
Out[65]:
array([[ 1.19285714, -0.1953125 , 0.90909091, 0.31632653],
[ 0.75714286, -0.140625 , 0.72727273, 1.66326531]])
#如果少了fit环节,直接transform(partData),则会报错
minmaxTransformer = MinMaxScaler(feature_range=(0,1))
train_transformer2 = minmaxTransformer.transform(train)
Traceback (most recent call last):
File "" , line 1, in <module>
train_transformer2 = minmaxTransformer.transform(train)
File "D:\Program Files\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 352, in transform
check_is_fitted(self, 'scale_')
File "D:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 690, in check_is_fitted
raise _NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
#如果对test也用fit_transform(),则结果跟之前不一样。对于许多机器学习算法来说,对于train和test的处理应该统一。
test_transformer2 = minmaxTransformer.fit_transform(test)
test_transformer2
Out[71]:
array([[ 1., 0., 1., 0.],
[ 0., 1., 0., 1.]])
test_transformer
Out[72]:
array([[ 1.19285714, -0.1953125 , 0.90909091, 0.31632653],
[ 0.75714286, -0.140625 , 0.72727273, 1.66326531]])