Convert categorical variable into dummy/indicator variables
data : array-like, Series, or DataFrame |
|
prefix : string, list of strings, or dict of strings, default None |
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes. |
prefix_sep : string, default ‘_’ |
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix. |
dummy_na : bool, default False |
Add a column to indicate NaNs, if False NaNs are ignored. |
columns : list-like, default None |
Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted. |
sparse : bool, default False |
Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks. |
drop_first : bool, default False |
Whether to get k-1 dummies out of k categorical levels by removing the first level. New in version 0.18.0. |
dtype : dtype, default np.uint8 |
Data type for new columns. Only a single dtype is allowed. New in version 0.23.0. |
dataframe = pd.DataFrame({"A":["China", "Japan", "India"], "B":["UK", "French", "Germany"]})
print(dataframe)
A B
0 China UK
1 Japan French
2 India Germany
基本用法,将标称型特征使用One-Hot方法进行编码
dataframe = pd.get_dummies(dataframe)
print(dataframe)
输出
A_China A_India A_Japan B_French B_Germany B_UK
0 1 0 0 0 0 1
1 0 0 1 1 0 0
2 0 1 0 0 1 0
import numpy as np
dataframe = pd.DataFrame({"A":["China", "Japan", np.nan], "B":["UK", "French", "Germany"]})
print(dataframe)
df_nan_ignore = pd.get_dummies(dataframe)
print(df_nan_ignore)
输出
A B
0 China UK
1 Japan French
2 NaN Germany
A_China A_Japan B_French B_Germany B_UK
0 1 0 0 0 1
1 0 1 1 0 0
2 0 0 0 1 0
默认情况下,get_dummies()不处理缺失值
df_nan_as_type = pd.get_dummies(dataframe, dummy_na=True)
print(df_nan_as_type)
输出
A_China A_Japan A_nan B_French B_Germany B_UK B_nan
0 1 0 0 0 0 1 0
1 0 1 0 1 0 0 0
2 0 0 1 0 1 0 0
dataframe = pd.DataFrame({"A":["China", "Japan", np.nan], "B":["UK", "French", "Germany"]})
df_prefix = pd.get_dummies(dataframe, prefix=['Aisa', 'Europe'])
print(df_prefix)
输出
Aisa_China Aisa_Japan Europe_French Europe_Germany Europe_UK
0 1 0 0 0 1
1 0 1 1 0 0
2 0 0 0 1 0
同上面的结果最对比,新产生的特征均以perfix参数指定的数值进行命名,默认情况下使用原始列名命名
df_prefix_sep = pd.get_dummies(dataframe, prefix=['Aisa', 'Europe'], prefix_sep ='.')
print(df_prefix_sep)
输出
Aisa.China Aisa.Japan Europe.French Europe.Germany Europe.UK
0 1 0 0 0 1
1 0 1 1 0 0
2 0 0 0 1 0