Python 特征工程

1. LabelEncoder

简单来说 LabelEncoder 是对不连续的数字或者文本进行编号

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit([1,5,67,100])
le.transform([1,1,100,67,5])

 输出: array([0,0,3,2,1])


2. OneHotEncoder

OneHotEncoder 用于将表示分类的数据扩维:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit([[1],[2],[3],[4]])
ohe.transform([[2],[3],[1],[4]]).toarray()

输出:[ [0,1,0,0] , [0,0,1,0] , [1,0,0,0] ,[0,0,0,1] ] 
正如keras中的keras.utils.to_categorical(y_train, num_classes)

3. Normalizer

from sklearn.preprocessing import Normalizer
nor = Normalizer(norm='l2')
nor.transform([[1,2,3,4],
          [5,4,3,2],
          [1,3,5,2],
          [2,4,1,5]])

这里nor.fit是保留了api,但是这个函数没有用。会把每一行都转化成一个单位向量。 

4. 标准化(Standardization)

from sklearn.preprocessing import StandardScaler

std = StandardScaler()

std.fit(sample_train[feature_list])

sample_train[feature_list] = std.transform(sample_train[feature_list])

sample_test[feature_list] = std.transform(sample_test[feature_list])

3. Feature Tools

Ref: https://docs.featuretools.com/

是一个自动衍生特征的工具。需要自己对输入表定义Entity和Relationship,然后使用框架,和框架的内置函数,进行计算。

这边,函数可以自己定义,举例如下:

import pandas as pd
from featuretools.primitives import make_agg_primitive, make_trans_primitive
from featuretools.variable_types import Datetime, Numeric

def Hour(datetime):
    return pd.DatetimeIndex(datetime).hour.values

hour = make_trans_primitive(function = Hour,
                                  input_types = [Datetime],
                                  return_type = Numeric)

还有一个自己做的尝试的例子(考虑到结合SQL,输入表只有一张,已经连接好了。)这边有个时间窗口的问题,我没能成功解决,所以,输入表我已经限定了START_TIME在BIND_TIME之前的90天内。

sample = pd.read_csv('***')

cell_call = sample

from datetime import datetime
cell_call['call_hour'] = list(map(lambda x: int(x[11:13]), cell_call['start_time']))
cell_call['if_midnight_call'] = list(map(lambda x: 1 if x<=6 else 0, cell_call['call_hour']))
cell_call['if_weekends'] = list(map(lambda x: 1 if datetime.strptime(x, '%Y-%m-%d %H:%M:%S').weekday()>=5 else 0, cell_call['start_time']))
cell_call['if_workhours'] = list(map(lambda x, y: 1 if (y==0 and x>=8 and x<=18) else 0, cell_call['call_hour'], cell_call['if_weekends']))

uid_dt = cell_call[['uid', 'bind_time']].drop_duplicates()
uid_dt.columns = ['uid', 'time']
es = ft.EntitySet(id = 'uid')
es.entity_from_dataframe(
    entity_id = 'uid',
    dataframe = uid_dt,
    index = 'uid',
    time_index = 'time'
)
es.entity_from_dataframe(
    entity_id = 'cell_call',
    dataframe = cell_call,
    make_index = True,
    index = 'index',
    time_index = 'start_time',
    variable_types = {
        'init_type': ft.variable_types.Categorical,
        'other_cell_phone': ft.variable_types.Categorical,
        'if_midnight_call': ft.variable_types.Boolean,
        'if_weekends': ft.variable_types.Boolean,
        'if_workhours': ft.variable_types.Boolean
    }
)

es = es.add_relationship(
    ft.Relationship(es['uid']['uid'], es['cell_call']['uid'])
)
es['cell_call']['init_type'].interesting_values = [1, 2]
es['cell_call']['if_midnight_call'].interesting_values = [1]
es['cell_call']['if_weekends'].interesting_values = [1]
es['cell_call']['if_workhours'].interesting_values = [1]

feature_matrix, features_defs = ft.dfs(
    entityset=es,
    target_entity="uid", 
    agg_primitives=['count', 'mean', 'std', 'num_unique', 'skew', 'percent_true', 'num_true'],
    trans_primitives = [],
    where_primitives=["count", 'mean', 'std', 'num_unique', 'skew', 'percent_true', 'num_true'],
#     seed_features = [whether_night_call],
    max_depth = 5,
    ignore_variables = {'cell_call': ['mob', 'uid', 'aid', 'bind_time', 'subtotal', 'if_cuishou', 'if_contact', 'call_type']}
)

注意次数,我在ft之前,自己衍生了一些特征来辅助之后特征的堆叠。

你可能感兴趣的:(Python 特征工程)