简单来说 LabelEncoder 是对不连续的数字或者文本进行编号
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit([1,5,67,100])
le.transform([1,1,100,67,5])
输出: array([0,0,3,2,1])
OneHotEncoder 用于将表示分类的数据扩维:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit([[1],[2],[3],[4]])
ohe.transform([[2],[3],[1],[4]]).toarray()
输出:[ [0,1,0,0] , [0,0,1,0] , [1,0,0,0] ,[0,0,0,1] ]
正如keras中的keras.utils.to_categorical(y_train, num_classes)
from sklearn.preprocessing import Normalizer
nor = Normalizer(norm='l2')
nor.transform([[1,2,3,4],
[5,4,3,2],
[1,3,5,2],
[2,4,1,5]])
这里nor.fit是保留了api,但是这个函数没有用。会把每一行都转化成一个单位向量。
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
std.fit(sample_train[feature_list])
sample_train[feature_list] = std.transform(sample_train[feature_list])
sample_test[feature_list] = std.transform(sample_test[feature_list])
Ref: https://docs.featuretools.com/
是一个自动衍生特征的工具。需要自己对输入表定义Entity和Relationship,然后使用框架,和框架的内置函数,进行计算。
这边,函数可以自己定义,举例如下:
import pandas as pd
from featuretools.primitives import make_agg_primitive, make_trans_primitive
from featuretools.variable_types import Datetime, Numeric
def Hour(datetime):
return pd.DatetimeIndex(datetime).hour.values
hour = make_trans_primitive(function = Hour,
input_types = [Datetime],
return_type = Numeric)
还有一个自己做的尝试的例子(考虑到结合SQL,输入表只有一张,已经连接好了。)这边有个时间窗口的问题,我没能成功解决,所以,输入表我已经限定了START_TIME在BIND_TIME之前的90天内。
sample = pd.read_csv('***')
cell_call = sample
from datetime import datetime
cell_call['call_hour'] = list(map(lambda x: int(x[11:13]), cell_call['start_time']))
cell_call['if_midnight_call'] = list(map(lambda x: 1 if x<=6 else 0, cell_call['call_hour']))
cell_call['if_weekends'] = list(map(lambda x: 1 if datetime.strptime(x, '%Y-%m-%d %H:%M:%S').weekday()>=5 else 0, cell_call['start_time']))
cell_call['if_workhours'] = list(map(lambda x, y: 1 if (y==0 and x>=8 and x<=18) else 0, cell_call['call_hour'], cell_call['if_weekends']))
uid_dt = cell_call[['uid', 'bind_time']].drop_duplicates()
uid_dt.columns = ['uid', 'time']
es = ft.EntitySet(id = 'uid')
es.entity_from_dataframe(
entity_id = 'uid',
dataframe = uid_dt,
index = 'uid',
time_index = 'time'
)
es.entity_from_dataframe(
entity_id = 'cell_call',
dataframe = cell_call,
make_index = True,
index = 'index',
time_index = 'start_time',
variable_types = {
'init_type': ft.variable_types.Categorical,
'other_cell_phone': ft.variable_types.Categorical,
'if_midnight_call': ft.variable_types.Boolean,
'if_weekends': ft.variable_types.Boolean,
'if_workhours': ft.variable_types.Boolean
}
)
es = es.add_relationship(
ft.Relationship(es['uid']['uid'], es['cell_call']['uid'])
)
es['cell_call']['init_type'].interesting_values = [1, 2]
es['cell_call']['if_midnight_call'].interesting_values = [1]
es['cell_call']['if_weekends'].interesting_values = [1]
es['cell_call']['if_workhours'].interesting_values = [1]
feature_matrix, features_defs = ft.dfs(
entityset=es,
target_entity="uid",
agg_primitives=['count', 'mean', 'std', 'num_unique', 'skew', 'percent_true', 'num_true'],
trans_primitives = [],
where_primitives=["count", 'mean', 'std', 'num_unique', 'skew', 'percent_true', 'num_true'],
# seed_features = [whether_night_call],
max_depth = 5,
ignore_variables = {'cell_call': ['mob', 'uid', 'aid', 'bind_time', 'subtotal', 'if_cuishou', 'if_contact', 'call_type']}
)
注意次数,我在ft之前,自己衍生了一些特征来辅助之后特征的堆叠。