摘要:本文主要介绍机器学习的基础内容——数据预处理。在学习研究算法之前,我们需要有处理数据的能力;主要知识点包括:数据预处理的步骤介绍、简单的代码实例、输出结果展示、详细的代码注释等内容。
导入需要的库——>导入数据集——>处理丢失数据——>解析分类数据——>拆分数据集——>特征缩放
这里是Data.csv文件里的全部内容
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes
#导入库
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.model_selection import train_test_split
#导入数据集
def getData():
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values#取出所有的行,(去掉最后一列 purchased)
Y = dataset.iloc[:,3].values#取出结果
return X,Y
#处理缺失数据 主要是数字
def proMissingData(X):
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean = imp_mean.fit(X[:,1:3])
X[:,1:3] = imp_mean.transform(X[:,1:3])
#解析分类数据 把数据转化成数字
def parseData(X,Y):
le_x = preprocessing.LabelEncoder()
# 将X的第一列,包含3个国家的取值,使用三维向量表示
X[:,0] = le_x.fit_transform(X[:,0])
oe = OneHotEncoder(categorical_features=[0])
X = oe.fit_transform(X).toarray()
# 对Y进行标签编码(NO:0;Yes:1)
le_y = preprocessing.LabelEncoder()
Y = le_y.fit_transform(Y)
return X,Y
#拆分数据集为测试集和训练集
def divide(X,Y):
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2,random_state = 0)
return X_train,X_test,Y_train,Y_test
#特征缩放
def standardScale(X_train, X_test):
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
return X_train,X_test
if __name__ == '__main__':
X,Y = getData()
print("1.数据导入后:")
print("X")
print(X)
print("Y")
print(Y)
print("=============================================================================")
proMissingData(X)
print("2.处理缺失数据后:")
print("X")
print(X)
print("Y")
print(Y)
print("=============================================================================")
X,Y = parseData(X, Y)
print("3.把数据转换成数值后:")
print("X")
print(X)
print("Y")
print(Y)
print("=============================================================================")
X_train, X_test, Y_train, Y_test = divide(X,Y)
print("4.把数据拆分成训练集和测试集后:")
print("X_train=================================")
print(X_train)
print("X_test=================================")
print(X_test)
print("Y_train=================================")
print(Y_train)
print("Y_test=================================")
print(Y_test)
print("=============================================================================")
X_train,X_test = standardScale(X_train, X_test)
print("5.特征缩放后:")
print("X_train=====================================")
print(X_train)
print("X_test=====================================")
print(X_test)
输出结果如下图所示:
1.数据导入后:
X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
=============================================================================
2.处理缺失数据后:
X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
=============================================================================
3.把数据转换成数值后:
X
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]]
Y
[0 1 0 0 1 1 0 1 0 1]
=============================================================================
4.把数据拆分成训练集和测试集后:
X_train=================================
[[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]]
X_test=================================
[[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
[0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]]
Y_train=================================
[1 1 1 0 1 0 0 1]
Y_test=================================
[0 0]
=============================================================================
5.特征缩放后:
X_train=====================================
[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]
[ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]
[-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]
[-1. -0.37796447 1.29099445 0.05261351 -1.11141978]
[ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]
[-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]
[ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]
[ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
X_test=====================================
[[-1. 2.64575131 -0.77459667 -1.45882927 -0.90166297]
[-1. 2.64575131 -0.77459667 1.98496442 2.13981082]]
相关知识点说明:
"""
有时特征内容并不是数值,而是字符串类型。如果直接将字符串转成一个对应的数值,造成原本的特征具有大小关系。
这是需要使用 one-hot-encode编码格式。
两种转化方式:
pandas.get_dummies():常用方法,功能强大,操作简单;
sklearn.preprocessing.OneHotEncoder():用法复杂且易报错,较少使用。
categorical_features表示对哪些特征进行编码
通过索引或bool值来确定
如 OneHotEncoder(categorical_features = [0,2]) 等价于 [True, False, True]
即对0、2两列进行编码
ps:如果原数据包含三列,选中0、2列进行编码时,输出数据每一项会把没有编码“1”列数据
放在最后
"""
"""
Q1:把分类数据转化成数字的过程原理理解?
有时特征内容并不是数值,而是字符串类型。如果直接将字符串转成一个对应的数值,造成原本的特征具有大小关系。
这是需要使用 one-hot-encode编码格式。
将X的第一列,包含3个国家的取值,使用三维向量表示
对Y进行标签编码(NO:0;Yes:1)
Q2:train_test_split样本划分参数解释?
参数解释:
train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
random_state:是随机数的种子
cross_validatio为交叉验证
"""
文章比较简单,主要用来记录一下自己的学习过程。在代码运行时可能会出现一些警告,但不影响代码的运行。
参考资料:
微信公众号:机器学习算法与python实践
往期文章推荐:
神经网络—用python实现异或运算详细解释
神经网络与深度学习笔记(一)适合刚入门的小白