分类变量是表示类别或标记的。与数值型变量不同,分类变量的值是不能被排序的,故而又称为无序变量。
独热编码(one-hot encoding)通常用于处理类别间不具有大小关系的特征。独热编码使用一组比特位表示不同的类别,每个比特位表示一个特征。因此,一个可能有k个类别的分类变脸就可以编码成为一个长度为k的特征向量。若变量不能同时属于多个类别,那这组值就只有一个比特位是‘开’的。
独热编码的优缺点:
import pandas as pd
from sklearn import linear_model
df = pd.DataFrame({'city':['SF','SF','SF','NYC','NYC','NYC','Seattle','Seattle','Seattle'],
'Rent':[3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]})
df['Rent'].mean()
3333.3333333333335
#将分类变量转换为one-hot编码并拟合一个线性回归模型
one_hot_df = pd.get_dummies(df, prefix=['city'])
one_hot_df
Rent | city_NYC | city_SF | city_Seattle | |
---|---|---|---|---|
0 | 3999 | 0 | 1 | 0 |
1 | 4000 | 0 | 1 | 0 |
2 | 4001 | 0 | 1 | 0 |
3 | 3499 | 1 | 0 | 0 |
4 | 3500 | 1 | 0 | 0 |
5 | 3501 | 1 | 0 | 0 |
6 | 2499 | 0 | 0 | 1 |
7 | 2500 | 0 | 0 | 1 |
8 | 2501 | 0 | 0 | 1 |
model = linear_model.LinearRegression()
model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],
one_hot_df['Rent'])
model.coef_ #获取线性回归模型的系数
array([ 166.66666667, 666.66666667, -833.33333333])
model.intercept_ #获取线性回归模型的截距
3333.3333333333335
model.score(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']],
one_hot_df['Rent']) #获取模型的拟合优度R2
0.9999982857172245
使用one-hot编码时,截距表示目标变量rent的整体均值,每个线性系数表示相应城市的Rent均值与整体Rent均值有多大
虚拟编码在进行表示时只使用k-1个特征,除去了额外的自由度。没有被使用的那个特征通过一个全零向量来表示,它称为参照类。虚拟编码和one-hot都可以通过pandas.get_dummies实现
#用虚拟编码训练一个线性回归模型,指定drop_first标志来生成虚拟编码
dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True)
dummy_df
Rent | city_SF | city_Seattle | |
---|---|---|---|
0 | 3999 | 1 | 0 |
1 | 4000 | 1 | 0 |
2 | 4001 | 1 | 0 |
3 | 3499 | 0 | 0 |
4 | 3500 | 0 | 0 |
5 | 3501 | 0 | 0 |
6 | 2499 | 0 | 1 |
7 | 2500 | 0 | 1 |
8 | 2501 | 0 | 1 |
model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])
model.coef_
array([ 500., -1000.])
model.intercept_
3500.0
model.score(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])
0.9999982857172245
使用虚拟编码时,偏差系数表示相应变量y对于参照类的均值,该例中参照类是city_NYC。第i个特征的系数等于第i个类别的均值与参照类均值的差。
效果编码与虚拟编码非常相似,区别在于参照类的用全部由-1组成的向量表示的
effect_df = dummy_df.copy()
effect_df.loc[3:5, ['city_SF','city_Seattle']]= -1.0
effect_df
Rent | city_SF | city_Seattle | |
---|---|---|---|
0 | 3999 | 1.0 | 0.0 |
1 | 4000 | 1.0 | 0.0 |
2 | 4001 | 1.0 | 0.0 |
3 | 3499 | -1.0 | -1.0 |
4 | 3500 | -1.0 | -1.0 |
5 | 3501 | -1.0 | -1.0 |
6 | 2499 | 0.0 | 1.0 |
7 | 2500 | 0.0 | 1.0 |
8 | 2501 | 0.0 | 1.0 |
model.fit(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
model.coef_
array([ 666.66666667, -833.33333333])
model.intercept_
3333.3333333333335
model.score(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])
0.9999982857172245
散列函数是一种确定性函数,它可以将一个可能无界的整数映射到一个有限的整数范围【1,m】中。
import pandas as pd
import json
js = []
with open('yelp_academic_dataset_review.json') as f:
for i in range(10000):
js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
# 定义m为唯一的business_id的数量
m = len(review_df.business_id.unique())
m
4174
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features = m , input_type='string')
f = h.transform(review_df['business_id'])
review_df['business_id'].unique().tolist()[0:5]
['9yKzy9PApeiPPOUJEtnvkg',
'ZRJwVLyzEJq1VAihDhYiow',
'6oRAC4uyJCsJl1X0WZpVSA',
'_1QQZuf4zZOyFCvXc0o6Vg',
'6ozycU1RpktNG2-1BroVtw']
f.toarray()
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
from sys import getsizeof
print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id']))
print('Our hashed numpy array, in bytes: ', getsizeof(f))
Our pandas Series, in bytes: 790152
Our hashed numpy array, in bytes: 56
import pandas as pd
df = pd.read_csv('train_subset.csv')
len(df['device_id'].unique()) #查看训练集中有多少个唯一的特征
1075
df.head()
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | device_type | device_conn_type | C14 | C15 | C16 | C17 | C18 | C19 | C20 | C21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000009418151094273 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 2 | 15706 | 320 | 50 | 1722 | 0 | 35 | -1 | 79 |
1 | 10000169349117863715 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15704 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
2 | 10000371904215119486 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15704 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
3 | 10000640724480838376 | 0 | 14102100 | 1005 | 0 | 1fbe01fe | f3845767 | 28905ebd | ecad2386 | 7801e8d9 | ... | 1 | 0 | 15706 | 320 | 50 | 1722 | 0 | 35 | 100084 | 79 |
4 | 10000679056417042096 | 0 | 14102100 | 1005 | 1 | fe8cc448 | 9166c161 | 0569f928 | ecad2386 | 7801e8d9 | ... | 1 | 0 | 18993 | 320 | 50 | 2161 | 0 | 35 | -1 | 157 |
5 rows × 24 columns
def click_counting(x, bin_column):
clicks = pd.Series(
x[x['click'] > 0][bin_column].value_counts(), name='clicks')
no_clicks = pd.Series(
x[x['click'] < 1][bin_column].value_counts(), name='no_clicks')
counts = pd.DataFrame([clicks, no_clicks]).T.fillna('0')
counts['total'] = counts['clicks'].astype(
'int64') + counts['no_clicks'].astype('int64')
return counts
def bin_counting(counts):
counts['N+'] = counts['clicks'].astype('int64').divide(
counts['total'].astype('int64'))
counts['N-'] = counts['no_clicks'].astype('int64').divide(
counts['total'].astype('int64'))
counts['log_N+'] = counts['N+'].divide(counts['N-'])
# If we wanted to only return bin-counting properties, we would filter here
bin_counts = counts.filter(items=['N+', 'N-', 'log_N+'])
return counts, bin_counts
bin_column = 'device_id'
device_clicks = click_counting(df.filter(items = [bin_column, 'click']), bin_column)
device_all, device_bin_counts = bin_counting(device_clicks)
len(device_bin_counts)
1075
device_all.sort_values(by = 'total', ascending = False).head(4)
clicks | no_clicks | total | N+ | N- | log_N+ | |
---|---|---|---|---|---|---|
a99f214a | 1561 | 7163 | 8724 | 0.178932 | 0.821068 | 0.217925 |
c357dbff | 2 | 15 | 17 | 0.117647 | 0.882353 | 0.133333 |
a167aa83 | 0 | 9 | 9 | 0.000000 | 1.000000 | 0.000000 |
3c0208dc | 0 | 9 | 9 | 0.000000 | 1.000000 | 0.000000 |
from sys import getsizeof
print('Our pandas Series, in bytes: ', getsizeof(df.filter(items=['device_id', 'click'])))
print('Our bin-counting feature, in bytes: ', getsizeof(device_bin_counts))
Our pandas Series, in bytes: 730152
Our bin-counting feature, in bytes: 95699
参考:
爱丽丝·郑、阿曼达·卡萨丽,精通特征工程