特征提取后常常会有许多问题,如缺失值问题/不属于同一纲量问题/信息冗余问题/信息利用率低问题.
以下使用鸢尾花数据集进行处理.
导入鸢尾花数据:
from sklearn.datasets import load_iris
iris = load_iris()
数据预处理
将不同规格数据转换为同一规格.
基于特征矩阵的列,将特征值转换至服从标准正态分布.
x′=x−X¯S
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(iris.data)
区间缩放法的思路有多种,常见的一种为利用两个最值进行缩放.
x′=x−MinMax−Min
from sklearn.preprocessing import MinMaxScaler
MinMaxScaler().fit_transform(iris.data)
归一化是依照特征矩阵的行处理数据,其目的在于样本向量在点乘运算或其他核函数计算相似性时,拥有统一的标准,也就是说都转化为“单位向量”。规则为L2的归一化公式:
x′=x∑nix2i√
from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(iris.data)
设定一个阈值,大于阈值的赋值为1,小于等于阈值的赋值为0
分别对鸢尾花数据集进行标准化/区间缩放/归一化/二值化处理
# 特征工程
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
iris = load_iris()
x = np.arange(len(iris.data))
iris0 = [iris.data[i][0] for i in range(len(iris.data))]
iris1 = [iris.data[i][1] for i in range(len(iris.data))]
iris2 = [iris.data[i][2] for i in range(len(iris.data))]
iris3 = [iris.data[i][3] for i in range(len(iris.data))]
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import OneHotEncoder
print OneHotEncoder().fit_transform(iris.target.reshape((-1,1)))
plt.subplot(511), plt.plot(x, iris0, 'ys', x, iris1, 'g^', x, iris2, 'co', x, iris3, 'r*')
for index in range(4):
if index == 0:
# 标准化--标准化需要计算特征的均值和标准差
iris_standard = StandardScaler().fit_transform(iris.data)
elif index == 1:
# 区间缩放法--常见的为利用两个最值进行缩放
iris_standard = MinMaxScaler().fit_transform(iris.data)
elif index == 2:
# 标准化与归一化的区别--
iris_standard = Normalizer().fit_transform(iris.data)
elif index == 3:
# 标准化与归一化的区别--
iris_standard = Binarizer(threshold=3).fit_transform(iris.data)
iris_standard0 = [iris_standard[i][0] for i in range(len(iris_standard))]
iris_standard1 = [iris_standard[i][1] for i in range(len(iris_standard))]
iris_standard2 = [iris_standard[i][2] for i in range(len(iris_standard))]
iris_standard3 = [iris_standard[i][3] for i in range(len(iris_standard))]
plt.subplot(512+index), plt.plot(x, iris_standard0, 'ys', x, iris_standard1, 'g^', x, iris_standard2, 'co', x, iris_standard3, 'r*')
plt.show()
显示输出:
分别使用鸢尾花数据的第1/2/3个特征作为x/y/z轴的坐标值,使用第四的特征作为坐标点的颜色指标,可以看到鸢尾花数据的大致分布情况如图:
# # 特征工程
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import Binarizer
iris = load_iris()
index = 0
if index == 0:
# 原始数据
iris_standard = iris.data
elif index == 1:
# 标准化--标准化需要计算特征的均值和标准差
iris_standard = StandardScaler().fit_transform(iris.data)
elif index == 2:
# 区间缩放法--常见的为利用两个最值进行缩放
iris_standard = MinMaxScaler().fit_transform(iris.data)
elif index == 3:
# 归一化
iris_standard = Normalizer().fit_transform(iris.data)
elif index == 4:
# 定量特征二值化--设定一个阈值,大于阈值的赋值为1,小于等于阈值的赋值为0
iris_standard = Binarizer(threshold=3).fit_transform(iris.data)
iris0 = [iris_standard[i][0] for i in range(len(iris_standard))]
iris1 = [iris_standard[i][1] for i in range(len(iris_standard))]
iris2 = [iris_standard[i][2] for i in range(len(iris_standard))]
iris3 = [iris_standard[i][3] for i in range(len(iris_standard))]
# iris3 = iris.target
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d', axisbg= 'k')
xs = iris0
ys = iris1
zs = iris2
cm = plt.cm.get_cmap('RdYlBu')
ax.scatter(xs, ys, zs, c=iris3, vmin=min(iris3), vmax=max(iris3), s=10, cmap=cm, marker='o',linewidth=2, antialiased=True)
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.show()
参考:https://www.cnblogs.com/jasonfreak/p/5448385.html