数据集中往往有许多非数字的数据,我们来看看怎么使用他们。
在本课中你教会学到:什么是类别型变量,还有三种处理他们的方法。
X_train.head()
object
数据类型表明某一列的数据有文本类型(理论上还会有别的类型,但是我们现在不关心那些)。在这个数据集中,带有文本类型的列表明他们就是分类变量。# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
Categorical variables:
['Type', 'Method', 'Regionname']
score_dataset()
来比较三种方法在处理分类变量时的表现,该函数返回一个随机森林的平均绝对误差(MAE),我们想要MAE越低越好!from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
方法一得分(删除分类变量)
select_dtypes()
函数来删掉object
类型的列drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
MAE from Approach 1 (Drop categorical variables):
175703.48185157913
方法二得分(依次编码)
OrdinalEncoder
它能够获得依次编码,我们通过循环所有的分类变量,将OrdinalEncoder应用于每一列。from sklearn.preprocessing import OrdinalEncoder
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
MAE from Approach 2 (Ordinal Encoding):
165936.40548390493
OneHotEncoder
类来获得一位热编码的值。其中有许多参数你可以设置来自定义他的编码方式。
handle_unknown='ignore'
来避免验证集中的数据包含训练集中没有的选项。sparse=False
来保证我们获得的列中数据时为numpy 数组,而非稀疏矩阵。X_train[object_cols]
编码,由于object_cols包含了所有的分类变量,所以我们将所有的分类变量都使用了一位热编码。from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
MAE from Approach 3 (One-Hot Encoding):
166089.4893009678
哪个方法得分最高呢?
真实世界充满了分类变量。你要是知道了如何处理这些常见的变量,你将会成为一个高效的数据科学家。