Data.csv
数据如下所示
Country | Age | Salary | Purchased |
---|---|---|---|
France | 44 | 72000 | No |
Spain | 27 | 48000 | Yes |
Germany | 30 | 54000 | No |
Spain | 38 | 61000 | No |
Germany | 40 | Yes | |
France | 35 | 58000 | Yes |
Spain | 52000 | No | |
France | 48 | 79000 | Yes |
Germany | 50 | 83000 | No |
France | 37 | 67000 | Yes |
import numpy as np
import pandas as pd
dataset = pd.read_csv('Data.csv')//读取csv文件
X = dataset.iloc[ : , :-1].values//.iloc[行,列]
Y = dataset.iloc[ : , 3].values // : 全部行 or 列;[a]第a行 or 列
// [a,b,c]第 a,b,c 行 or 列
pandas中loc和iloc的使用
.values
可以将dataframe
格式转换成array
格式
#该代码已经不适用
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = "most_frequent")
X[:,1:3] = imputer.fit_transform(X[:,1:3])
#输出的X
array([['France', 44.0, 72000.0],
['Spain', 27.0, 48000.0],
['Germany', 30.0, 54000.0],
['Spain', 38.0, 61000.0],
['Germany', 40.0, 48000.0],
['France', 35.0, 58000.0],
['Spain', 27.0, 52000.0],
['France', 48.0, 79000.0],
['Germany', 50.0, 83000.0],
['France', 37.0, 67000.0]], dtype=object)
sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
SimpleImputer官方文档
missing_values
: number, string, np.nan (default) or Nonestrategy
: string, default=’mean’fill_value
: string or numerical value, default=Noneverbose
: integer, default=0Controls the verbosity of the imputer.
copy
: default=True(布尔值)If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False
:
add_indicator
: default=False(布尔值)If True, a MissingIndicator
transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Labelencoder = LabelEncoder()
X[ : , 0] = Labelencoder.fit_transform(X[ : , 0])
#输出的X:对不同的城市进行了分类,共3类
array([[0, 44.0, 72000.0],
[2, 27.0, 48000.0],
[1, 30.0, 54000.0],
[2, 38.0, 61000.0],
[1, 40.0, 48000.0],
[0, 35.0, 58000.0],
[2, 27.0, 52000.0],
[0, 48.0, 79000.0],
[1, 50.0, 83000.0],
[0, 37.0, 67000.0]], dtype=object)
#原文中OneHotEncoder中的categorical_features已经删除,新版不再适用,替代写法如下
ct = ColumnTransformer(
[('one_hot_encoder', OneHotEncoder(categories='auto'), [0])], # The column numbers to be transformed (here is [0] but can be [0, 1, 3])
remainder='passthrough' # Leave the rest of the columns untouched
)#只将第一列转换为独热编码,第二和第三列并不是分类数据,不应转换
X = ct.fit_transform(X))
Y = Labelencoder.fit_transform(Y)
#输出X
array([[1.0, 0.0, 0.0, 44.0, 72000.0],
[0.0, 0.0, 1.0, 27.0, 48000.0],
[0.0, 1.0, 0.0, 30.0, 54000.0],
[0.0, 0.0, 1.0, 38.0, 61000.0],
[0.0, 1.0, 0.0, 40.0, 48000.0],
[1.0, 0.0, 0.0, 35.0, 58000.0],
[0.0, 0.0, 1.0, 27.0, 52000.0],
[1.0, 0.0, 0.0, 48.0, 79000.0],
[0.0, 1.0, 0.0, 50.0, 83000.0],
[1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)
输出Y
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
sklearn.preprocessing.LabelEncoder
将离散型的数据转换成 0 到 n−1 之间的数,这里n是一个列表的不同取值的个数,可以认为是某个特征的所有不同取值的个数,这个例子共三个城市,所以取值是0-2
LabelEncoder官方文档
class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=
OneHotEncoder官方文档
sklearn.compose.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False)
ColumnTransformer官方文档
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
X_train,X_test, y_train, y_test =model_selection.train_test_split(X,y,test_size, random_state)
train_test_split官方文档
Code | Text |
---|---|
X | 待划分的样本特征集合 |
y | 待划分的样本标签 |
test_size | 若在0~1之间,为测试集样本数目与原始样本数目之比;若为整数,则是测试集样本的数目。 |
random_state | 随机数种子 |
X_train | 划分出的训练集数据(返回值) |
X_test | 划分出的测试集数据(返回值) |
y_train | 划分出的训练集标签(返回值) |
y_test | 划分出的测试集标签(返回值) |
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
fit_transform
,而测试数据用transform
#X_train输出
[[-1. 2.64575131 -0.77459667 0.4330127 -1.1851228 ]
[ 1. -0.37796447 -0.77459667 0. 0.59842834]
[-1. -0.37796447 1.29099445 -1.44337567 -1.1851228 ]
[-1. -0.37796447 1.29099445 -1.44337567 -0.80963835]
[ 1. -0.37796447 -0.77459667 1.58771324 1.72488169]
[-1. -0.37796447 1.29099445 0.14433757 0.03520167]
[ 1. -0.37796447 -0.77459667 1.01036297 1.0677839 ]
[ 1. -0.37796447 -0.77459667 -0.28867513 -0.24641167]]
#X_test输出
[[ 0. 0. 0. -1. -1.]
[ 0. 0. 0. 1. 1.]]
sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
StandardScaler官方文档
with_mean=False
,则为0;s是训练样本的标准偏差;如果with_std=False
,则为1。