源文件地址:https://docs.featuretools.com/
参考内容:https://blog.csdn.net/q337100/article/details/80804887
FeatureTools是进行特征自动生成的框架,它可以将时间和关系数据集转换为可用于机器学习的特征矩阵。
下面是使用深度特征合成(DFS)执行自动化特征工程的示例。在本例中,我们将DFS应用于一个由多个表组成的带有时间戳的客户交易数据集。
In [1]: import featuretools as ft
In [2]: data = ft.demo.load_mock_customer()
本示例使用的数据集包含三张表。在Featuretools中将表称之为entity。本示例包含的三个entity如下所示:
In [3]: customers_df = data["customers"]
In [4]: customers_df
Out[4]:
customer_id zip_code join_date date_of_birth
0 1 60091 2011-04-17 10:48:33 1994-07-18
1 2 13244 2012-04-15 23:31:04 1986-08-18
2 3 13244 2011-08-13 15:42:34 2003-11-21
3 4 60091 2011-04-08 20:08:14 2006-08-15
4 5 60091 2010-07-17 05:27:50 1984-07-28
In [5]: sessions_df = data["sessions"]
In [6]: sessions_df.sample(5)
Out[6]:
session_id customer_id device session_start
13 14 1 tablet 2014-01-01 03:28:00
6 7 3 tablet 2014-01-01 01:39:40
1 2 5 mobile 2014-01-01 00:17:20
28 29 1 mobile 2014-01-01 07:10:05
24 25 3 desktop 2014-01-01 05:59:40
In [7]: transactions_df = data["transactions"]
In [8]: transactions_df.sample(5)
Out[8]:
transaction_id session_id transaction_time product_id amount
74 232 5 2014-01-01 01:20:10 1 139.20
231 27 17 2014-01-01 04:10:15 2 90.79
434 36 31 2014-01-01 07:50:10 3 62.35
420 56 30 2014-01-01 07:35:00 3 72.70
54 444 4 2014-01-01 00:58:30 4 43.59
首先,我们用数据集中的所有实体指定一个字典。
In [9]: entities = {
...: "customers" : (customers_df, "customer_id"),
...: "sessions" : (sessions_df, "session_id", "session_start"),
...: "transactions" : (transactions_df, "transaction_id", "transaction_time")
...: }
...:
其次,我们指定实体的关联方式。当两个实体有一对多关系时,即为父子实体关系。父实体的一条记录对应子实体中的多条记录。例如Customer Entity(customer_id zip_code join_date date_of_birth)与session Entity(session_id customer_id device session_start),一个客户可以有多条会话记录。定义父子关系的语句如下所示:
(parent_entity, parent_variable, child_entity, child_variable)
在示例数据集中,具有如下关系:
In [10]: relationships = [("sessions", "session_id", "transactions", "session_id"),
....: ("customers", "customer_id", "sessions", "customer_id")]
....:
DFS的最小输入包括一组entity,一组关系以及要计算特征的target_entity。DFS的输出是一个特征矩阵和相应的特征定义列表。
In [11]: feature_matrix_customers, features_defs = ft.dfs(entities=entities,
....: relationships=relationships,
....: target_entity="customers")
....:
In [12]: feature_matrix_customers
Out[12]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(join_date) DAY(date_of_birth) YEAR(join_date) YEAR(date_of_birth) MONTH(join_date) MONTH(date_of_birth) WEEKDAY(join_date) WEEKDAY(date_of_birth) SUM(sessions.STD(transactions.amount)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.SUM(transactions.amount)) STD(sessions.MAX(transactions.amount)) STD(sessions.SKEW(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.MEAN(transactions.amount)) STD(sessions.COUNT(transactions)) STD(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.STD(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) MAX(sessions.COUNT(transactions)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.SKEW(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MIN(sessions.COUNT(transactions)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.SKEW(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.DAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MONTH(session_start)) MODE(sessions.WEEKDAY(session_start))
customer_id
1 60091 8 3 mobile 9025.62 40.442059 139.43 0.019698 5.81 71.631905 126 5 4 17 18 2011 1994 4 7 6 0 312.745952 1057.97 -0.476122 78.59 582.193117 40 279.510713 7.322191 0.589386 6.954507 13.759314 4.062019 0.000000 1613.93 46.905665 0.640252 26.36 88.755625 25 5 0.778170 -0.312355 -0.780493 2.440005 -0.424949 1.946018 0.000000 809.97 30.450261 118.90 -1.038434 50.623125 12 5 1128.202500 39.093244 132.246250 -0.059515 9.823750 72.774140 15.750000 5.000000 4 1 1 1 1 4 1 2014 1 2
2 13244 7 3 desktop 7200.28 37.705178 146.81 0.098259 8.73 77.422366 93 5 4 15 18 2012 1986 4 8 6 0 258.700528 931.63 -0.277640 154.60 548.905851 35 251.609234 17.221593 0.509798 15.874374 11.477071 3.450328 0.000000 1320.64 47.935920 0.755711 56.46 96.581000 18 5 -0.440929 0.013087 -1.539467 2.154929 0.235296 -0.303276 0.000000 634.84 27.839228 100.04 -0.763603 61.910000 8 5 1028.611429 36.957218 133.090000 -0.039663 22.085714 78.415122 13.285714 5.000000 4 1 1 1 1 3 1 2014 1 2
3 13244 6 3 desktop 6236.62 43.683296 149.15 0.418230 5.89 67.060430 93 5 1 13 21 2011 2003 8 11 5 4 257.299895 847.63 2.286086 66.21 405.237462 29 219.021420 10.724241 0.429374 5.424407 11.174282 2.428992 0.408248 1477.97 50.110120 0.854976 20.06 82.109444 18 5 2.246479 -0.245703 -0.941078 1.000771 0.678544 -1.507217 -2.449490 889.21 35.704680 126.74 -0.289466 55.579412 11 4 1039.436667 42.883316 141.271667 0.381014 11.035000 67.539577 15.500000 4.833333 4 1 1 1 1 1 1 2014 1 2
4 60091 8 3 mobile 8727.68 45.068765 149.95 -0.036348 5.73 80.070459 109 5 2 8 15 2011 2006 4 8 4 1 356.125829 1157.99 0.002764 131.51 649.657515 37 235.992478 3.514421 0.387884 16.960575 13.027258 3.335416 0.517549 1351.46 54.293903 0.382868 54.83 110.450000 18 5 -0.391805 -1.065663 0.027256 2.103510 1.980948 0.282488 -0.644061 771.68 29.026424 139.20 -0.711744 70.638182 10 4 1090.960000 44.515729 144.748750 0.000346 16.438750 81.207189 13.625000 4.625000 5 1 1 1 1 1 1 2014 1 2
5 60091 6 3 mobile 6349.66 44.095630 149.02 -0.025941 7.55 80.375443 79 5 5 17 28 2010 1984 7 7 5 5 259.873954 839.76 0.014384 86.49 472.231119 30 402.775486 7.928001 0.415426 4.961414 11.007471 3.600926 0.000000 1700.67 51.149250 0.602209 20.65 94.481667 18 5 0.472342 0.204548 -0.333796 -0.470410 0.335175 -0.317685 0.000000 543.18 36.734681 128.51 -0.539060 66.666667 8 5 1058.276667 43.312326 139.960000 0.002397 14.415000 78.705187 13.166667 5.000000 5 1 1 1 1 3 1 2014 1 2
从上述结果可以看出,我们得到了描述客户行为的几十个特征。
DFS如此强大的原因之一是它可以为数据中的任何实体创建一个特征矩阵。例如,我们同样可以为session构建特征:
In [13]: feature_matrix_sessions, features_defs = ft.dfs(entities=entities,
....: relationships=relationships,
....: target_entity="sessions")
....:
In [14]: feature_matrix_sessions.head(5)
Out[14]:
customer_id device SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(session_start) YEAR(session_start) MONTH(session_start) WEEKDAY(session_start) customers.zip_code NUM_UNIQUE(transactions.DAY(transaction_time)) NUM_UNIQUE(transactions.YEAR(transaction_time)) NUM_UNIQUE(transactions.MONTH(transaction_time)) NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) MODE(transactions.DAY(transaction_time)) MODE(transactions.YEAR(transaction_time)) MODE(transactions.MONTH(transaction_time)) MODE(transactions.WEEKDAY(transaction_time)) customers.COUNT(sessions) customers.NUM_UNIQUE(sessions.device) customers.MODE(sessions.device) customers.SUM(transactions.amount) customers.STD(transactions.amount) customers.MAX(transactions.amount) customers.SKEW(transactions.amount) customers.MIN(transactions.amount) customers.MEAN(transactions.amount) customers.COUNT(transactions) customers.NUM_UNIQUE(transactions.product_id) customers.MODE(transactions.product_id) customers.DAY(join_date) customers.DAY(date_of_birth) customers.YEAR(join_date) customers.YEAR(date_of_birth) customers.MONTH(join_date) customers.MONTH(date_of_birth) customers.WEEKDAY(join_date) customers.WEEKDAY(date_of_birth)
session_id
1 2 desktop 1229.01 41.600976 141.66 0.295458 20.91 76.813125 16 5 3 1 2014 1 2 13244 1 1 1 1 1 2014 1 2 7 3 desktop 7200.28 37.705178 146.81 0.098259 8.73 77.422366 93 5 4 15 18 2012 1986 4 8 6 0
2 5 mobile 746.96 45.893591 135.25 -0.160550 9.32 74.696000 10 5 5 1 2014 1 2 60091 1 1 1 1 1 2014 1 2 6 3 mobile 6349.66 44.095630 149.02 -0.025941 7.55 80.375443 79 5 5 17 28 2010 1984 7 7 5 5
3 4 mobile 1329.00 46.240016 147.73 -0.324012 8.70 88.600000 15 5 1 1 2014 1 2 60091 1 1 1 1 1 2014 1 2 8 3 mobile 8727.68 45.068765 149.95 -0.036348 5.73 80.070459 109 5 2 8 15 2011 2006 4 8 4 1
4 1 mobile 1613.93 40.187205 129.00 0.234349 6.29 64.557200 25 5 5 1 2014 1 2 60091 1 1 1 1 1 2014 1 2 8 3 mobile 9025.62 40.442059 139.43 0.019698 5.81 71.631905 126 5 4 17 18 2011 1994 4 7 6 0
5 4 mobile 777.02 48.918663 139.20 0.336381 7.43 70.638182 11 5 5 1 2014 1 2 60091 1 1 1 1 1 2014 1 2 8 3 mobile 8727.68 45.068765 149.95 -0.036348 5.73 80.070459 109 5 2 8 15 2011 2006 4 8 4 1