<1> 导入数据
<2> 按照算法的输入和输出整理数据
<3> 格式化输入数据
<4> 总结显示数据的变化
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
# ======================================================================
# 1、调整数据尺寸
# 导入数据
file_name = r'../pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(file_name,names=names)
# 将数据分为输入数据和输出结果
array = data.values
print(array)
X = array[:,0:8]
print("="*40)
print(X)
Y = array[:,8]
print("="*40)
print(X)
'''
[[ 6. 148. 72. ..., 0.627 50. 1. ]
[ 1. 85. 66. ..., 0.351 31. 0. ]
[ 8. 183. 64. ..., 0.672 32. 1. ]
...,
[ 5. 121. 72. ..., 0.245 30. 0. ]
[ 1. 126. 60. ..., 0.349 47. 1. ]
[ 1. 93. 70. ..., 0.315 23. 0. ]]
========================================
[[ 6. 148. 72. ..., 33.6 0.627 50. ]
[ 1. 85. 66. ..., 26.6 0.351 31. ]
[ 8. 183. 64. ..., 23.3 0.672 32. ]
...,
[ 5. 121. 72. ..., 26.2 0.245 30. ]
[ 1. 126. 60. ..., 30.1 0.349 47. ]
[ 1. 93. 70. ..., 30.4 0.315 23. ]]
========================================
[[ 6. 148. 72. ..., 33.6 0.627 50. ]
[ 1. 85. 66. ..., 26.6 0.351 31. ]
[ 8. 183. 64. ..., 23.3 0.672 32. ]
...,
[ 5. 121. 72. ..., 26.2 0.245 30. ]
[ 1. 126. 60. ..., 30.1 0.349 47. ]
[ 1. 93. 70. ..., 30.4 0.315 23. ]]
'''
print('='*40)
transformer = MinMaxScaler(feature_range=(0,1))
# 数据转换
newX = transformer.fit_transform(X)
# 设定数据的打印格式
set_printoptions(precision=3)
print(newX)
'''
[[ 0.353 0.744 0.59 ..., 0.501 0.234 0.483]
[ 0.059 0.427 0.541 ..., 0.396 0.117 0.167]
[ 0.471 0.92 0.525 ..., 0.347 0.254 0.183]
...,
[ 0.294 0.608 0.59 ..., 0.39 0.071 0.15 ]
[ 0.059 0.633 0.492 ..., 0.449 0.116 0.433]
[ 0.059 0.467 0.574 ..., 0.453 0.101 0.033]]
'''
# ======================================================================
# 2、正态化数据
from sklearn.preprocessing import StandardScaler
'''
正态化数据是有效处理符合高斯分布的数据的手段,输出结果以0为中位数,
方差为1,并作为假定数据符合高斯分布的算法的输入。
'''
print('#'*30,'正态化数据','#'*30)
# 导入数据
file_name = r'../pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(file_name,names=names)
# 将数据分为输入数据和输出结果
array = data.values
X = array[:,0:8]
Y = array[:,8]
transformer = StandardScaler().fit(X)
# 数据转换
newX = transformer.transform(X)
# 设定数据的打印格式
set_printoptions(precision=3)
print(newX)
'''
[[ 0.64 0.848 0.15 ..., 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 ..., -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 ..., -1.103 0.604 -0.106]
...,
[ 0.343 0.003 0.15 ..., -0.735 -0.685 -0.276]
[-0.845 0.16 -0.471 ..., -0.24 -0.371 1.171]
[-0.845 -0.873 0.046 ..., -0.202 -0.474 -0.871]]
'''
# ======================================================================
# 3、标准化数据
from sklearn.preprocessing import Normalizer
'''
标准化数据处理是将每一行的数据的距离处理成1(在线性代数中矢量距离为1)
的数据又叫做“归一元”处理,适合处理稀疏数据(具有很多0的数据)。
'''
print('#'*30,'标准化数据','#'*30)
# 导入数据
file_name = r'../pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(file_name,names=names)
# 将数据分为输入数据和输出结果
array = data.values
X = array[:,0:8]
Y = array[:,8]
transformer = Normalizer().fit(X)
# 数据转换
newX = transformer.transform(X)
# 设定数据的打印格式
set_printoptions(precision=3)
print(newX)
'''
############################## 标准化数据 ##############################
[[ 0.034 0.828 0.403 ..., 0.188 0.004 0.28 ]
[ 0.008 0.716 0.556 ..., 0.224 0.003 0.261]
[ 0.04 0.924 0.323 ..., 0.118 0.003 0.162]
...,
[ 0.027 0.651 0.388 ..., 0.141 0.001 0.161]
[ 0.007 0.838 0.399 ..., 0.2 0.002 0.313]
[ 0.008 0.736 0.554 ..., 0.241 0.002 0.182]]
'''
# ======================================================================
# 4、二值数据
from sklearn.preprocessing import Binarizer
'''
二值数据是使用值将数据转化为二值,大于阈值设置为1,
小于阈值设置为0,这个过程叫做二分数据或阈值转换。
'''
print('#'*30,'二值数据','#'*30)
# 导入数据
file_name = r'../pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(file_name,names=names)
# 将数据分为输入数据和输出结果
array = data.values
X = array[:,0:8]
Y = array[:,8]
transformer = Binarizer(threshold=0.0).fit(X)
# 数据转换
newX = transformer.transform(X)
# 设定数据的打印格式
set_printoptions(precision=3)
print(newX)
'''
############################## 二值数据 ##############################
[[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
...,
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]]
'''