机器学习笔记01——数据预处理

摘要:本文主要介绍机器学习的基础内容——数据预处理。在学习研究算法之前,我们需要有处理数据的能力;主要知识点包括:数据预处理的步骤介绍、简单的代码实例、输出结果展示、详细的代码注释等内容。

目录

  • 1.数据预处理的步骤介绍
  • 2.代码实例
    • 2.1数据文件
    • 2.2完整代码
    • 2.3结果展示
  • 3.相关注释
  • 4.全文总结

1.数据预处理的步骤介绍

导入需要的库——>导入数据集——>处理丢失数据——>解析分类数据——>拆分数据集——>特征缩放

2.代码实例

2.1数据文件

这里是Data.csv文件里的全部内容

Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes

2.2完整代码

#导入库
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.model_selection import train_test_split

#导入数据集
def getData():
    dataset = pd.read_csv('Data.csv')
    X = dataset.iloc[:,:-1].values#取出所有的行,(去掉最后一列 purchased)
    Y = dataset.iloc[:,3].values#取出结果
    return X,Y

#处理缺失数据 主要是数字
def proMissingData(X):
    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    imp_mean = imp_mean.fit(X[:,1:3])
    X[:,1:3] = imp_mean.transform(X[:,1:3])

#解析分类数据  把数据转化成数字
def parseData(X,Y):
    le_x = preprocessing.LabelEncoder()
    # 将X的第一列,包含3个国家的取值,使用三维向量表示
    X[:,0] = le_x.fit_transform(X[:,0])
    oe = OneHotEncoder(categorical_features=[0])
    X = oe.fit_transform(X).toarray()
    # 对Y进行标签编码(NO:0;Yes:1)
    le_y = preprocessing.LabelEncoder()
    Y = le_y.fit_transform(Y)
    return X,Y

#拆分数据集为测试集和训练集
def divide(X,Y):
    X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2,random_state = 0)
    return X_train,X_test,Y_train,Y_test

#特征缩放
def standardScale(X_train, X_test):
    sc_X = StandardScaler()
    X_train = sc_X.fit_transform(X_train)
    X_test = sc_X.transform(X_test)
    return X_train,X_test

if __name__ == '__main__':
    X,Y = getData()
    print("1.数据导入后:")
    print("X")
    print(X)
    print("Y")
    print(Y)
    print("=============================================================================")
    proMissingData(X)
    print("2.处理缺失数据后:")
    print("X")
    print(X)
    print("Y")
    print(Y)
    print("=============================================================================")
    X,Y = parseData(X, Y)
    print("3.把数据转换成数值后:")
    print("X")
    print(X)
    print("Y")
    print(Y)
    print("=============================================================================")
    X_train, X_test, Y_train, Y_test = divide(X,Y)
    print("4.把数据拆分成训练集和测试集后:")
    print("X_train=================================")
    print(X_train)
    print("X_test=================================")
    print(X_test)
    print("Y_train=================================")
    print(Y_train)
    print("Y_test=================================")
    print(Y_test)
    print("=============================================================================")
    X_train,X_test = standardScale(X_train, X_test)
    print("5.特征缩放后:")
    print("X_train=====================================")
    print(X_train)
    print("X_test=====================================")
    print(X_test)

2.3结果展示

输出结果如下图所示:

1.数据导入后:
X
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
=============================================================================
2.处理缺失数据后:
X
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
=============================================================================
3.把数据转换成数值后:
X
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]]
Y
[0 1 0 0 1 1 0 1 0 1]
=============================================================================
4.把数据拆分成训练集和测试集后:
X_train=================================
[[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]]
X_test=================================
[[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
 [0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]]
Y_train=================================
[1 1 1 0 1 0 0 1]
Y_test=================================
[0 0]
=============================================================================
5.特征缩放后:
X_train=====================================
[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
X_test=====================================
[[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]

3.相关注释

相关知识点说明:

""" 
有时特征内容并不是数值,而是字符串类型。如果直接将字符串转成一个对应的数值,造成原本的特征具有大小关系。
这是需要使用 one-hot-encode编码格式。
两种转化方式:
pandas.get_dummies():常用方法,功能强大,操作简单;
sklearn.preprocessing.OneHotEncoder():用法复杂且易报错,较少使用。

categorical_features表示对哪些特征进行编码
 通过索引或bool值来确定
 如 OneHotEncoder(categorical_features = [0,2]) 等价于 [True, False, True]
 即对0、2两列进行编码
 ps:如果原数据包含三列,选中0、2列进行编码时,输出数据每一项会把没有编码“1”列数据
 放在最后
"""

"""
Q1:把分类数据转化成数字的过程原理理解?
    有时特征内容并不是数值,而是字符串类型。如果直接将字符串转成一个对应的数值,造成原本的特征具有大小关系。
    这是需要使用 one-hot-encode编码格式。
    将X的第一列,包含3个国家的取值,使用三维向量表示
    对Y进行标签编码(NO:0;Yes:1)
    
Q2:train_test_split样本划分参数解释?
    参数解释:
    train_data:所要划分的样本特征集
    train_target:所要划分的样本结果
    test_size:样本占比,如果是整数的话就是样本的数量
    random_state:是随机数的种子
    cross_validatio为交叉验证
"""

4.全文总结

文章比较简单,主要用来记录一下自己的学习过程。在代码运行时可能会出现一些警告,但不影响代码的运行。

参考资料:
微信公众号:机器学习算法与python实践

往期文章推荐:
神经网络—用python实现异或运算详细解释
神经网络与深度学习笔记(一)适合刚入门的小白

你可能感兴趣的:(机器学习)