机器学习100天练习(1)-数据预处理

机器学习100天练习(1)-数据预处理

Data.csv数据如下所示

Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

第1步:导入库

import numpy as np
import pandas as pd

第2步:导入数据集

dataset = pd.read_csv('Data.csv')//读取csv文件
X = dataset.iloc[ : , :-1].values//.iloc[行,列]
Y = dataset.iloc[ : , 3].values  // : 全部行 or 列;[a]第a行 or// [a,b,c]第 a,b,c 行 or

pandas中loc和iloc的使用

.values可以将dataframe格式转换成array格式

第3步:处理丢失数据

#该代码已经不适用
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = "most_frequent")
X[:,1:3] = imputer.fit_transform(X[:,1:3])
#输出的X
array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 48000.0],
       ['France', 35.0, 58000.0],
       ['Spain', 27.0, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)

SimpleImputer官方文档

参数:

1.missing_values: number, string, np.nan (default) or None
2.strategy: string, default=’mean’
  • mean, 平均数(只适用于数值型)
  • median, 中位数(只适用于数值型)
  • most_frequent,最高频出现的数(数值和字符串均可)
  • constant, 定值,使用fill_value替代(数值和字符串均可)
3.fill_value: string or numerical value, default=None
4.verbose: integer, default=0

Controls the verbosity of the imputer.

5.copy: default=True(布尔值)

If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

  • If X is not an array of floating values;
  • If X is encoded as a CSR matrix;
  • If add_indicator=True.
6.add_indicator: default=False(布尔值)

If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

第4步:解析分类数据

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Labelencoder = LabelEncoder()
X[ : , 0] = Labelencoder.fit_transform(X[ : , 0])
#输出的X:对不同的城市进行了分类,共3类
array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 48000.0],
       [0, 35.0, 58000.0],
       [2, 27.0, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)
      

创建虚拟变量

#原文中OneHotEncoder中的categorical_features已经删除,新版不再适用,替代写法如下
ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [0])],   # The column numbers to be transformed (here is [0] but can be [0, 1, 3])
    remainder='passthrough'                                         # Leave the rest of the columns untouched
)#只将第一列转换为独热编码,第二和第三列并不是分类数据,不应转换
X = ct.fit_transform(X))
Y = Labelencoder.fit_transform(Y)
#输出X
array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 48000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 27.0, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)
输出Y
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

sklearn.preprocessing.LabelEncoder将离散型的数据转换成 0 到 n−1 之间的数,这里n是一个列表的不同取值的个数,可以认为是某个特征的所有不同取值的个数,这个例子共三个城市,所以取值是0-2
LabelEncoder官方文档

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=, handle_unknown='error')
OneHotEncoder官方文档

sklearn.compose.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)
ColumnTransformer官方文档

第5步:拆分数据集为训练集合和测试集合

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)

X_train,X_test, y_train, y_test =model_selection.train_test_split(X,y,test_size, random_state)
train_test_split官方文档

Code Text
X 待划分的样本特征集合
y 待划分的样本标签
test_size 若在0~1之间,为测试集样本数目与原始样本数目之比;若为整数,则是测试集样本的数目。
random_state 随机数种子
X_train 划分出的训练集数据(返回值)
X_test 划分出的测试集数据(返回值)
y_train 划分出的训练集标签(返回值)
y_test 划分出的测试集标签(返回值)

第6步:特征量化

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
  • 关于为什么训练数据用fit_transform,而测试数据用transform
#X_train输出
[[-1.          2.64575131 -0.77459667  0.4330127  -1.1851228 ]
 [ 1.         -0.37796447 -0.77459667  0.          0.59842834]
 [-1.         -0.37796447  1.29099445 -1.44337567 -1.1851228 ]
 [-1.         -0.37796447  1.29099445 -1.44337567 -0.80963835]
 [ 1.         -0.37796447 -0.77459667  1.58771324  1.72488169]
 [-1.         -0.37796447  1.29099445  0.14433757  0.03520167]
 [ 1.         -0.37796447 -0.77459667  1.01036297  1.0677839 ]
 [ 1.         -0.37796447 -0.77459667 -0.28867513 -0.24641167]]
 #X_test输出
 [[ 0.  0.  0. -1. -1.]
 [ 0.  0.  0.  1.  1.]]

sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
StandardScaler官方文档

  • 计算公式如下
    z = ( x − u ) / s z = (x - u) / s z=(xu)/s
    式中,u是训练样本的平均值,如果with_mean=False,则为0;s是训练样本的标准偏差;如果with_std=False,则为1。

你可能感兴趣的:(机器学习,机器学习,python,数据分析)