数据处理:数据的几种简单处理

在现实使用数据的过程中,常常会遇到数据缺失,需要对数据进行采样,不同指标数据差异太大,这都需要对数据做预处理,下面是几种简单的数据处理方法:这里用到的是scikit-learn中preprocessing这个模块和numpy 

1.修补数据   from sklearn.preprocessing import Imputer

2.数据随机采样

3.数据缩放  from sklearn.preprocessing import MinMaxScaler

4.数据标准化  from sklearn.preprocessing import scale


1.修补数据:

#加载库
from sklearn.datasets import load_iris
from sklearn.preprocessing import  Imputer
import numpy as np
import numpy.ma as ma

#1.加载Iris数据集
data = load_iris()
x = data['data']
y = data['target']

#制作一份副本
x_t = x.copy()

#2.在第2行制作一些丢失值
x_t[2,:] = np.repeat(0,x.shape[1])

#3.创建一个imputer对象,采用平均值策略
imputer = Imputer(missing_values=0,strategy="mean")
#中位数
#imputer = Imputer(missing_values=0,strategy="median")

x_imputed = imputer.fit_transform(x_t)

mask = np.zeros_like(x_t)
mask[2,:] = 1
x_t_m = ma.masked_array(x_t,mask)

print np.mean(x_t_m,axis=0)
print x_imputed[2,:]

#平均值
[5.851006711409397 3.053020134228189 3.7751677852349017 1.2053691275167793]
[ 5.85100671  3.05302013  3.77516779  1.20536913]

#中位数
[5.851006711409397 3.053020134228189 3.7751677852349017 1.2053691275167793]
[ 5.8  3.   4.4  1.3]

2.随机采样:

from sklearn.datasets import load_iris
import numpy as np

#1.加载Iris数据集
data = load_iris()
x = data['data']

#2.从加载数据集随机采样10条记录
no_records = 10
x_sample_idx = np.random.choice(range(x.shape[0]),no_records )
print x[x_sample_idx,:]

#out
[[ 6.4  2.9  4.3  1.3]
 [ 6.4  2.8  5.6  2.2]
 [ 5.3  3.7  1.5  0.2]
 [ 6.7  2.5  5.8  1.8]
 [ 5.5  2.6  4.4  1.2]
 [ 5.1  3.5  1.4  0.3]
 [ 5.1  3.3  1.7  0.5]
 [ 6.1  2.8  4.7  1.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.4  3.   1.3  0.2]]

3.数据缩放:

有时数据集不同列,不同维度之间差异非常大,例如用户数据,第一列是购买商品数据,第二列是消费金额,这样两列间数据范围差异非常大,处理起来不方便,可以通过数据的缩放来解决。

原理:x_scaled = (x-min(x))/(max(x)-min(x))

import numpy as np
 
#1.生成一些随机数据用来缩放
np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]

#2.定义一个函数,对数据进行缩放
def min_max(x):
    return [round((xx-min(x))/(1.0*(max(x)-min(x))),2) for xx in x]

#3.对给定的数据进行缩放
print x
print min_max(x)

#OUT
[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0]
[0.69, 1.0, 0.31, 0.0, 0.08, 0.85, 0.92, 0.69, 1.0, 0.0]

#4.调用sklearn.preprocessing
from sklearn.preprocessing import MinMaxScaler
x1 = np.matrix([np.random.randint(10,25)*1.0 for i in range(10)])
x1 = x1.T
minmax = MinMaxScaler(feature_range=(0.0,1.0))
x1_t = minmax.fit_transform(x1)
print x1.T
print x1_t.T

#OUT
[[ 23.  11.  20.  18.  19.  10.  20.  18.  16.  14.]]
[[ 1.          0.07692308  0.76923077  0.61538462  0.69230769  0.
   0.76923077  0.61538462  0.46153846  0.30769231]]
 


4.数据标准化:

将输入的数值转换为平均值为0,标准差为1的形式

原理:x = x-mean(value)/standard deviation(X)

#x = x-mean(value)/standard deviation(X)
import numpy as np
from sklearn.preprocessing import scale

#生成数据
np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]

#标准化
x_centered = scale(x,with_mean=True,with_std=False)
x_standard = scale(x,with_mean=True,with_std=True)

print x
print x_centered
print x_standard
print "orginal x mean=%0.2f, centered x mean=%0.2f, standard x mean=%0.2f" %(np.mean(x),np.mean(x_centered),np.mean(x_standard))

#out
[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0]
[ 1.8  5.8 -3.2 -7.2 -6.2  3.8  4.8  1.8  5.8 -7.2]
[ 0.35059022  1.12967961 -0.62327151 -1.4023609  -1.20758855  0.74013492
  0.93490726  0.35059022  1.12967961 -1.4023609 ]
orginal x mean=17.20, centered x mean=0.00, standard x mean=0.00



你可能感兴趣的:(数据分析)