在现实使用数据的过程中,常常会遇到数据缺失,需要对数据进行采样,不同指标数据差异太大,这都需要对数据做预处理,下面是几种简单的数据处理方法:这里用到的是scikit-learn中preprocessing这个模块和numpy
1.修补数据 from sklearn.preprocessing import Imputer
2.数据随机采样
3.数据缩放 from sklearn.preprocessing import MinMaxScaler
4.数据标准化 from sklearn.preprocessing import scale
1.修补数据:
#加载库
from sklearn.datasets import load_iris
from sklearn.preprocessing import Imputer
import numpy as np
import numpy.ma as ma
#1.加载Iris数据集
data = load_iris()
x = data['data']
y = data['target']
#制作一份副本
x_t = x.copy()
#2.在第2行制作一些丢失值
x_t[2,:] = np.repeat(0,x.shape[1])
#3.创建一个imputer对象,采用平均值策略
imputer = Imputer(missing_values=0,strategy="mean")
#中位数
#imputer = Imputer(missing_values=0,strategy="median")
x_imputed = imputer.fit_transform(x_t)
mask = np.zeros_like(x_t)
mask[2,:] = 1
x_t_m = ma.masked_array(x_t,mask)
print np.mean(x_t_m,axis=0)
print x_imputed[2,:]
#平均值
[5.851006711409397 3.053020134228189 3.7751677852349017 1.2053691275167793]
[ 5.85100671 3.05302013 3.77516779 1.20536913]
#中位数
[5.851006711409397 3.053020134228189 3.7751677852349017 1.2053691275167793]
[ 5.8 3. 4.4 1.3]
from sklearn.datasets import load_iris
import numpy as np
#1.加载Iris数据集
data = load_iris()
x = data['data']
#2.从加载数据集随机采样10条记录
no_records = 10
x_sample_idx = np.random.choice(range(x.shape[0]),no_records )
print x[x_sample_idx,:]
#out
[[ 6.4 2.9 4.3 1.3]
[ 6.4 2.8 5.6 2.2]
[ 5.3 3.7 1.5 0.2]
[ 6.7 2.5 5.8 1.8]
[ 5.5 2.6 4.4 1.2]
[ 5.1 3.5 1.4 0.3]
[ 5.1 3.3 1.7 0.5]
[ 6.1 2.8 4.7 1.2]
[ 5.1 3.7 1.5 0.4]
[ 4.4 3. 1.3 0.2]]
有时数据集不同列,不同维度之间差异非常大,例如用户数据,第一列是购买商品数据,第二列是消费金额,这样两列间数据范围差异非常大,处理起来不方便,可以通过数据的缩放来解决。
原理:x_scaled = (x-min(x))/(max(x)-min(x))
import numpy as np
#1.生成一些随机数据用来缩放
np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]
#2.定义一个函数,对数据进行缩放
def min_max(x):
return [round((xx-min(x))/(1.0*(max(x)-min(x))),2) for xx in x]
#3.对给定的数据进行缩放
print x
print min_max(x)
#OUT
[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0]
[0.69, 1.0, 0.31, 0.0, 0.08, 0.85, 0.92, 0.69, 1.0, 0.0]
#4.调用sklearn.preprocessing
from sklearn.preprocessing import MinMaxScaler
x1 = np.matrix([np.random.randint(10,25)*1.0 for i in range(10)])
x1 = x1.T
minmax = MinMaxScaler(feature_range=(0.0,1.0))
x1_t = minmax.fit_transform(x1)
print x1.T
print x1_t.T
#OUT
[[ 23. 11. 20. 18. 19. 10. 20. 18. 16. 14.]]
[[ 1. 0.07692308 0.76923077 0.61538462 0.69230769 0.
0.76923077 0.61538462 0.46153846 0.30769231]]
将输入的数值转换为平均值为0,标准差为1的形式
原理:x = x-mean(value)/standard deviation(X)
#x = x-mean(value)/standard deviation(X)
import numpy as np
from sklearn.preprocessing import scale
#生成数据
np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]
#标准化
x_centered = scale(x,with_mean=True,with_std=False)
x_standard = scale(x,with_mean=True,with_std=True)
print x
print x_centered
print x_standard
print "orginal x mean=%0.2f, centered x mean=%0.2f, standard x mean=%0.2f" %(np.mean(x),np.mean(x_centered),np.mean(x_standard))
#out
[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0]
[ 1.8 5.8 -3.2 -7.2 -6.2 3.8 4.8 1.8 5.8 -7.2]
[ 0.35059022 1.12967961 -0.62327151 -1.4023609 -1.20758855 0.74013492
0.93490726 0.35059022 1.12967961 -1.4023609 ]
orginal x mean=17.20, centered x mean=0.00, standard x mean=0.00