Featuretools的介绍
1、简要介绍
Featuretools是一个执行自动特征工程的框架。它擅长于为深度学习把互相关联的数据集转换为特征矩阵。
2、快速开始
下面是一个使用DFS(深度特征综合体 Deep Feature Synthesis)来实施自动化特征工程的例子。在这个例子中,我们应用DFS到一个由时间戳客户事务组成的多维表数据集。
##导入Featuretools
In [1]: import featuretools as ft
##加载mock data
In [2]: data = ft.demo.load_mock_customer()
准备数据
在这个玩家数据集上,有3个表。在Featuretools里每个表成为entity
customers: 拥有会话的唯一客户
sessions: 唯一会话及关联属性
transactions: 会话的事件列表
In [3]: customers_df = data["customers"]
In [4]: customers_df
Out[4]:
customer_id zip_code join_date
0 1 60091 2008-01-01
1 2 02139 2008-02-20
2 3 02139 2008-04-10
3 4 60091 2008-05-30
4 5 02139 2008-07-19
In [5]: sessions_df = data["sessions"]
In [6]: sessions_df.sample(5)
Out[6]:
session_id customer_id device session_start
16 17 4 mobile 2014-01-01 04:02:40
34 35 1 desktop 2014-01-01 08:45:25
4 5 2 tablet 2014-01-01 01:10:25
0 1 1 desktop 2014-01-01 00:00:00
29 30 4 desktop 2014-01-01 07:29:35
对DFS的最小输入是entity集合,关系列表,计算特征的target_entity。DFS的输出是特征矩阵,特征定义的对应列表。
首先为数据里的每个客户创建一个特征矩阵。
In [11]: feature_matrix_customers, features_defs = ft.dfs(entities=entities,
....: relationships=relationships,
....: target_entity="customers")
....:
In [12]: feature_matrix_customers
Out[12]:
zip_code COUNT(transactions) COUNT(sessions) SUM(transactions.amount) MODE(sessions.device) MIN(transactions.amount) MAX (transactions.amount) YEAR(join_date) SKEW(transactions.amount) DAY(join_date) ... SUM(sessions.MIN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MEAN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) STD(sessions.MAX(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) MIN(sessions.SKEW(transactions.amount))
customer_id ...
1 60091 131 10 10236.77 desktop 5.60 149.95 2008 0.070041 1 ... 169.77 0.610052 41.95 791.976505 175.939423 9.299023 -0.377150 5.857976 1 -0.395358
2 02139 122 8 9118.81 mobile 5.81 149.15 2008 0.028647 20 ... 114.85 0.492531 42.96 596.243506 230.333502 10.925037 0.962350 7.420480 1 -0.470007
3 02139 78 5 5758.24 desktop 6.78 147.73 2008 0.070814 10 ... 64.98 0.645728 21.77 369.770121 471.048551 9.819148 -0.244976 12.537259 1 -0.630425
4 60091 111 8 8205.28 desktop 5.73 149.56 2008 0.087986 30 ... 83.53 0.516262 17.27 584.673126 322.883448 13.065436 -0.548969 12.738488 1 -0.497169
5 02139 58 4 4571.37 tablet 5.91 148.17 2008 0.085883 19 ... 73.09 0.830112 27.46 313.448942 198.522508 8.950528 0.098885 5.599228 1 -0.396571
[5 rows x 69 columns]
现在我们有了许多描述客户的行为的新特征。
改变target entity
DFS强大的原因之一是它能创建特征矩阵,为数据里的任何entity。
In [13]: feature_matrix_sessions, features_defs = ft.dfs(entities=entities,
....: relationships=relationships,
....: target_entity="sessions")
....:
In [14]: feature_matrix_sessions.head(5)
Out[14]:
customer_id device WEEKDAY(session_start) MONTH(session_start) MODE(transactions.product_id) MEAN(transactions.amount) customers.zip_code DAY(session_start) MIN(transactions.amount) NUM_UNIQUE(transactions.product_id) ... customers.MODE(sessions.device) customers.WEEKDAY(join_date) MODE(transactions.MONTH(transaction_time)) customers.COUNT(transactions) customers.MAX(transactions.amount) customers.MIN(transactions.amount) MODE(transactions.YEAR(transaction_time)) MODE(transactions.WEEKDAY(transaction_time)) NUM_UNIQUE(transactions.MONTH(transaction_time)) NUM_UNIQUE(transactions.DAY(transaction_time))
session_id ...
1 1 desktop 2 1 2 77.846250 60091 1 5.60 5 ... desktop 1 1 131 149.95 5.60 2014 2 1 1
2 1 desktop 2 1 3 89.533000 60091 1 8.67 4 ... desktop 1 1 131 149.95 5.60 2014 2 1 1
3 5 mobile 2 1 5 67.130000 02139 1 20.91 5 ... tablet 5 1 58 148.17 5.91 2014 2 1 1
4 3 mobile 2 1 1 82.172800 02139 1 8.70 5 ... desktop 3 1 78 147.73 6.78 2014 2 1 1
5 2 tablet 2 1 1 65.031818 02139 1 6.29 5 ... mobile 2 1 122 149.15 5.81 2014 2 1 1
[5 rows x 40 columns]
后续更新。