在对数据进行预处理时,有时会面临特征值是字符串类型的时候,这时候就需要对特征值进行编码处理,主要分为一下两类:
下面分别说一下如何对以上两种类型数据进行编码处理:
拿kaggle中的House Price数据来举例说明。
import pandas as pd
df = pd.read_csv('./data/train.csv')
columns = ['MSZoning','ExterQual']
df_used = df[columns]
print(df_used)
使用到的两列的意义分别如下,很明显MSZoning是没有任何关联的,而ExterQual是对房屋材质进行的评价,是有等级划分的。
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
下面通过四种方法来处理这类问题。
1、pd.get_dummies()
看下源码:作用是将categorical变量转换为指标型变量。
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
columns=None, sparse=False, drop_first=False):
"""
Convert categorical variable into dummy/indicator variables
Parameters
----------
data : array-like, Series, or DataFrame
prefix : string, list of strings, or dict of strings, default None
String to append DataFrame column names
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternativly, `prefix`
can be a dictionary mapping column names to prefixes.
prefix_sep : string, default '_'
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix.`
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.
sparse : bool, default False
Whether the dummy columns should be sparse or not. Returns
SparseDataFrame if `data` is a Series or if all columns are included.
Otherwise returns a DataFrame with some SparseBlocks.
.. versionadded:: 0.16.1
drop_first : bool, default False
Whether to get k-1 dummies out of n categorical levels by removing the
first level.
.. versionadded:: 0.18.0
Returns
-------
dummies : DataFrame or SparseDataFrame
df_used = pd.get_dummies(df_used, columns=['MSZoning'])
print(df_used.head())
ExterQual MSZoning_C (all) MSZoning_FV MSZoning_RH MSZoning_RL \
0 Gd 0.0 0.0 0.0 1.0
1 TA 0.0 0.0 0.0 1.0
2 Gd 0.0 0.0 0.0 1.0
3 TA 0.0 0.0 0.0 1.0
4 Gd 0.0 0.0 0.0 1.0
MSZoning_RM
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
从结果来看,它为每一个单独的列特征创建了一个单独的列,并进行了one-hot编码。 另外,直接对dataframe转换的话,每个列名之前都会有之前列名作为前缀。
2、sklearn.preprocessing.LabelEncoder
熟悉sklearn的话应该用过sklearn.preprocessing.OneHotEncoder,然而OneHotEncoder只能对数值类型进行编码,而LabelEncoder可以对字符类型进行编码处理。
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
result = le.fit_transform(df_used['MSZoning'])
df_used['MSZoning'] = result
print(df_used.head())
这里会报一个copy的warn:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
不过没有什么影响,结果如下。
MSZoning ExterQual
0 3 Gd
1 3 TA
2 3 Gd
3 3 TA
4 3 Gd
5 3 TA
3、使用map函数
通过enumerate内置函数来为列属性中的所有值建立索引,然后将索引来代替之前的值。
map_MSZoning = {key : value for value, key in enumerate(set(df['MSZoning']))}
df_used['MSZoning'] = df_used['MSZoning'].map(map_MSZoning)
print(df_used.head())
MSZoning ExterQual
0 2 Gd
1 2 TA
2 2 Gd
3 2 TA
4 2 Gd
4、使用pd.factorize()
pd.factorize()不像pd.get_dummies()那样将一个特征映射为多个特征,而只是对该特征内的特征值进行编码。
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
"""
Encode input values as an enumerated type or categorical variable
Parameters
----------
values : ndarray (1-d)
Sequence
sort : boolean, default False
Sort by values
na_sentinel : int, default -1
Value to mark "not found"
size_hint : hint to the hashtable sizer
Returns
-------
labels : the indexer to the original array
uniques : ndarray (1-d) or Index
the unique values. Index is returned when passed values is Index or
Series
note: an array of Periods will ignore sort as it returns an always sorted
PeriodIndex
"""
df['MSZoning'] = pd.factorize(df['MSZoning'])[0]
print(df['MSZoning'])
通过map函数映射。
map_ExterQual = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
df_used['ExterQual'] = df_used['ExterQual'].map(map_ExterQual)
print(df_used.head())
MSZoning ExterQual
0 RL 4
1 RL 3
2 RL 4
3 RL 3
4 RL 4