数据分析——数据清洗之文字特征编码

在对数据进行预处理时,有时会面临特征值是字符串类型的时候,这时候就需要对特征值进行编码处理,主要分为一下两类:

  • 各个特征值之间没有任何关联,如['red', 'green', 'blue']。
  • 各个特征值之间有关系,如['Excellent', 'Good', 'Normal', 'Bad']。

下面分别说一下如何对以上两种类型数据进行编码处理:

拿kaggle中的House Price数据来举例说明。

import pandas as pd

df = pd.read_csv('./data/train.csv')
columns = ['MSZoning','ExterQual']
df_used = df[columns]
print(df_used)

使用到的两列的意义分别如下,很明显MSZoning是没有任何关联的,而ExterQual是对房屋材质进行的评价,是有等级划分的。

MSZoning: Identifies the general zoning classification of the sale.

       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park
       RM	Residential Medium Density

ExterQual: Evaluates the quality of the material on the exterior

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

一、各个特征值之间没有任何关联

下面通过四种方法来处理这类问题。

1、pd.get_dummies()

看下源码:作用是将categorical变量转换为指标型变量。

def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
                columns=None, sparse=False, drop_first=False):
    """
    Convert categorical variable into dummy/indicator variables

    Parameters
    ----------
    data : array-like, Series, or DataFrame
    prefix : string, list of strings, or dict of strings, default None
        String to append DataFrame column names
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternativly, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : string, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix.`
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy columns should be sparse or not.  Returns
        SparseDataFrame if `data` is a Series or if all columns are included.
        Otherwise returns a DataFrame with some SparseBlocks.

        .. versionadded:: 0.16.1
    drop_first : bool, default False
        Whether to get k-1 dummies out of n categorical levels by removing the
        first level.

        .. versionadded:: 0.18.0
    Returns
    -------
    dummies : DataFrame or SparseDataFrame
df_used = pd.get_dummies(df_used, columns=['MSZoning'])
print(df_used.head())
  ExterQual  MSZoning_C (all)  MSZoning_FV  MSZoning_RH  MSZoning_RL  \
0        Gd               0.0          0.0          0.0          1.0   
1        TA               0.0          0.0          0.0          1.0   
2        Gd               0.0          0.0          0.0          1.0   
3        TA               0.0          0.0          0.0          1.0   
4        Gd               0.0          0.0          0.0          1.0   

   MSZoning_RM  
0          0.0  
1          0.0  
2          0.0  
3          0.0  
4          0.0  

从结果来看,它为每一个单独的列特征创建了一个单独的列,并进行了one-hot编码。 另外,直接对dataframe转换的话,每个列名之前都会有之前列名作为前缀。

2、sklearn.preprocessing.LabelEncoder

熟悉sklearn的话应该用过sklearn.preprocessing.OneHotEncoder,然而OneHotEncoder只能对数值类型进行编码,而LabelEncoder可以对字符类型进行编码处理。

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
result = le.fit_transform(df_used['MSZoning'])
df_used['MSZoning'] = result
print(df_used.head())

这里会报一个copy的warn:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

不过没有什么影响,结果如下。

      MSZoning ExterQual
0            3        Gd
1            3        TA
2            3        Gd
3            3        TA
4            3        Gd
5            3        TA

3、使用map函数

通过enumerate内置函数来为列属性中的所有值建立索引,然后将索引来代替之前的值。

map_MSZoning = {key : value for value, key in enumerate(set(df['MSZoning']))}
df_used['MSZoning'] = df_used['MSZoning'].map(map_MSZoning)
print(df_used.head())
  MSZoning ExterQual
0         2        Gd
1         2        TA
2         2        Gd
3         2        TA
4         2        Gd

4、使用pd.factorize()

pd.factorize()不像pd.get_dummies()那样将一个特征映射为多个特征,而只是对该特征内的特征值进行编码。

def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
    """
    Encode input values as an enumerated type or categorical variable

    Parameters
    ----------
    values : ndarray (1-d)
        Sequence
    sort : boolean, default False
        Sort by values
    na_sentinel : int, default -1
        Value to mark "not found"
    size_hint : hint to the hashtable sizer

    Returns
    -------
    labels : the indexer to the original array
    uniques : ndarray (1-d) or Index
        the unique values. Index is returned when passed values is Index or
        Series

    note: an array of Periods will ignore sort as it returns an always sorted
    PeriodIndex
    """
df['MSZoning'] = pd.factorize(df['MSZoning'])[0]
print(df['MSZoning'])

二、各个特征值之间具有一定关系

通过map函数映射。

map_ExterQual = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
df_used['ExterQual'] = df_used['ExterQual'].map(map_ExterQual)
print(df_used.head())
  MSZoning  ExterQual
0       RL          4
1       RL          3
2       RL          4
3       RL          3
4       RL          4

 

你可能感兴趣的:(data,analysis,数据清洗,python,文字编码)