特征工程之自动特征生成(自动特征衍生)工具Featuretools介绍

源文件地址:https://docs.featuretools.com/

参考内容:https://blog.csdn.net/q337100/article/details/80804887

FeatureTools是进行特征自动生成的框架,它可以将时间和关系数据集转换为可用于机器学习的特征矩阵。

5分钟快速开始

下面是使用深度特征合成(DFS)执行自动化特征工程的示例。在本例中,我们将DFS应用于一个由多个表组成的带有时间戳的客户交易数据集。

In [1]: import featuretools as ft

载入Mock数据

In [2]: data = ft.demo.load_mock_customer()

准备数据

本示例使用的数据集包含三张表。在Featuretools中将表称之为entity。本示例包含的三个entity如下所示:

  • customers:由不同的客户记录组成,一个客户可以有多个session
  • sessions:由不同的session记录组成,一个session记录包括多个属性
  • transactions:由不同的交易记录组成,一个session可以包括多个交易事件
In [3]: customers_df = data["customers"]

In [4]: customers_df
Out[4]: 
   customer_id zip_code           join_date date_of_birth
0            1    60091 2011-04-17 10:48:33    1994-07-18
1            2    13244 2012-04-15 23:31:04    1986-08-18
2            3    13244 2011-08-13 15:42:34    2003-11-21
3            4    60091 2011-04-08 20:08:14    2006-08-15
4            5    60091 2010-07-17 05:27:50    1984-07-28

In [5]: sessions_df = data["sessions"]

In [6]: sessions_df.sample(5)
Out[6]: 
    session_id  customer_id   device       session_start
13          14            1   tablet 2014-01-01 03:28:00
6            7            3   tablet 2014-01-01 01:39:40
1            2            5   mobile 2014-01-01 00:17:20
28          29            1   mobile 2014-01-01 07:10:05
24          25            3  desktop 2014-01-01 05:59:40

In [7]: transactions_df = data["transactions"]

In [8]: transactions_df.sample(5)
Out[8]: 
     transaction_id  session_id    transaction_time product_id  amount
74              232           5 2014-01-01 01:20:10          1  139.20
231              27          17 2014-01-01 04:10:15          2   90.79
434              36          31 2014-01-01 07:50:10          3   62.35
420              56          30 2014-01-01 07:35:00          3   72.70
54              444           4 2014-01-01 00:58:30          4   43.59

首先,我们用数据集中的所有实体指定一个字典。

In [9]: entities = {
   ...:    "customers" : (customers_df, "customer_id"),
   ...:    "sessions" : (sessions_df, "session_id", "session_start"),
   ...:    "transactions" : (transactions_df, "transaction_id", "transaction_time")
   ...: }
   ...: 

其次,我们指定实体的关联方式。当两个实体有一对多关系时,即为父子实体关系。父实体的一条记录对应子实体中的多条记录。例如Customer Entity(customer_id zip_code           join_date date_of_birth)与session Entity(session_id  customer_id device session_start),一个客户可以有多条会话记录。定义父子关系的语句如下所示:

(parent_entity, parent_variable, child_entity, child_variable)

在示例数据集中,具有如下关系:

In [10]: relationships = [("sessions", "session_id", "transactions", "session_id"),
   ....:                  ("customers", "customer_id", "sessions", "customer_id")]
   ....: 

运行深度特征合成

DFS的最小输入包括一组entity,一组关系以及要计算特征的target_entity。DFS的输出是一个特征矩阵和相应的特征定义列表。

In [11]: feature_matrix_customers, features_defs = ft.dfs(entities=entities,
   ....:                                                  relationships=relationships,
   ....:                                                  target_entity="customers")
   ....: 

In [12]: feature_matrix_customers
Out[12]: 
            zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device) MODE(sessions.device)  SUM(transactions.amount)  STD(transactions.amount)  MAX(transactions.amount)  SKEW(transactions.amount)  MIN(transactions.amount)  MEAN(transactions.amount)  COUNT(transactions)  NUM_UNIQUE(transactions.product_id)  MODE(transactions.product_id)  DAY(join_date)  DAY(date_of_birth)  YEAR(join_date)  YEAR(date_of_birth)  MONTH(join_date)  MONTH(date_of_birth)  WEEKDAY(join_date)  WEEKDAY(date_of_birth)  SUM(sessions.STD(transactions.amount))  SUM(sessions.MAX(transactions.amount))  SUM(sessions.SKEW(transactions.amount))  SUM(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  SUM(sessions.NUM_UNIQUE(transactions.product_id))  STD(sessions.SUM(transactions.amount))  STD(sessions.MAX(transactions.amount))  STD(sessions.SKEW(transactions.amount))  STD(sessions.MIN(transactions.amount))  STD(sessions.MEAN(transactions.amount))  STD(sessions.COUNT(transactions))  STD(sessions.NUM_UNIQUE(transactions.product_id))  MAX(sessions.SUM(transactions.amount))  MAX(sessions.STD(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  MAX(sessions.MEAN(transactions.amount))  MAX(sessions.COUNT(transactions))  MAX(sessions.NUM_UNIQUE(transactions.product_id))  SKEW(sessions.SUM(transactions.amount))  SKEW(sessions.STD(transactions.amount))  SKEW(sessions.MAX(transactions.amount))  SKEW(sessions.MIN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  SKEW(sessions.COUNT(transactions))  SKEW(sessions.NUM_UNIQUE(transactions.product_id))  MIN(sessions.SUM(transactions.amount))  MIN(sessions.STD(transactions.amount))  MIN(sessions.MAX(transactions.amount))  MIN(sessions.SKEW(transactions.amount))  MIN(sessions.MEAN(transactions.amount))  MIN(sessions.COUNT(transactions))  MIN(sessions.NUM_UNIQUE(transactions.product_id))  MEAN(sessions.SUM(transactions.amount))  MEAN(sessions.STD(transactions.amount))  MEAN(sessions.MAX(transactions.amount))  MEAN(sessions.SKEW(transactions.amount))  MEAN(sessions.MIN(transactions.amount))  MEAN(sessions.MEAN(transactions.amount))  MEAN(sessions.COUNT(transactions))  MEAN(sessions.NUM_UNIQUE(transactions.product_id))  NUM_UNIQUE(sessions.MODE(transactions.product_id))  NUM_UNIQUE(sessions.DAY(session_start))  NUM_UNIQUE(sessions.YEAR(session_start))  NUM_UNIQUE(sessions.MONTH(session_start))  NUM_UNIQUE(sessions.WEEKDAY(session_start))  MODE(sessions.MODE(transactions.product_id))  MODE(sessions.DAY(session_start))  MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
1              60091                8                            3                mobile                   9025.62                 40.442059                    139.43                   0.019698                      5.81                  71.631905                  126                                    5                              4              17                  18             2011                 1994                 4                     7                   6                       0                              312.745952                                 1057.97                                -0.476122                                   78.59                               582.193117                                                 40                              279.510713                                7.322191                                 0.589386                                6.954507                                13.759314                           4.062019                                           0.000000                                 1613.93                               46.905665                                 0.640252                                   26.36                                88.755625                                 25                                                  5                                 0.778170                                -0.312355                                -0.780493                                 2.440005                                 -0.424949                            1.946018                                           0.000000                                   809.97                               30.450261                                  118.90                                -1.038434                                50.623125                                 12                                                  5                              1128.202500                                39.093244                               132.246250                                 -0.059515                                 9.823750                                 72.774140                           15.750000                                           5.000000                                                   4                                         1                                         1                                          1                                            1                                             4                                  1                                2014                                    1                                      2
2              13244                7                            3               desktop                   7200.28                 37.705178                    146.81                   0.098259                      8.73                  77.422366                   93                                    5                              4              15                  18             2012                 1986                 4                     8                   6                       0                              258.700528                                  931.63                                -0.277640                                  154.60                               548.905851                                                 35                              251.609234                               17.221593                                 0.509798                               15.874374                                11.477071                           3.450328                                           0.000000                                 1320.64                               47.935920                                 0.755711                                   56.46                                96.581000                                 18                                                  5                                -0.440929                                 0.013087                                -1.539467                                 2.154929                                  0.235296                           -0.303276                                           0.000000                                   634.84                               27.839228                                  100.04                                -0.763603                                61.910000                                  8                                                  5                              1028.611429                                36.957218                               133.090000                                 -0.039663                                22.085714                                 78.415122                           13.285714                                           5.000000                                                   4                                         1                                         1                                          1                                            1                                             3                                  1                                2014                                    1                                      2
3              13244                6                            3               desktop                   6236.62                 43.683296                    149.15                   0.418230                      5.89                  67.060430                   93                                    5                              1              13                  21             2011                 2003                 8                    11                   5                       4                              257.299895                                  847.63                                 2.286086                                   66.21                               405.237462                                                 29                              219.021420                               10.724241                                 0.429374                                5.424407                                11.174282                           2.428992                                           0.408248                                 1477.97                               50.110120                                 0.854976                                   20.06                                82.109444                                 18                                                  5                                 2.246479                                -0.245703                                -0.941078                                 1.000771                                  0.678544                           -1.507217                                          -2.449490                                   889.21                               35.704680                                  126.74                                -0.289466                                55.579412                                 11                                                  4                              1039.436667                                42.883316                               141.271667                                  0.381014                                11.035000                                 67.539577                           15.500000                                           4.833333                                                   4                                         1                                         1                                          1                                            1                                             1                                  1                                2014                                    1                                      2
4              60091                8                            3                mobile                   8727.68                 45.068765                    149.95                  -0.036348                      5.73                  80.070459                  109                                    5                              2               8                  15             2011                 2006                 4                     8                   4                       1                              356.125829                                 1157.99                                 0.002764                                  131.51                               649.657515                                                 37                              235.992478                                3.514421                                 0.387884                               16.960575                                13.027258                           3.335416                                           0.517549                                 1351.46                               54.293903                                 0.382868                                   54.83                               110.450000                                 18                                                  5                                -0.391805                                -1.065663                                 0.027256                                 2.103510                                  1.980948                            0.282488                                          -0.644061                                   771.68                               29.026424                                  139.20                                -0.711744                                70.638182                                 10                                                  4                              1090.960000                                44.515729                               144.748750                                  0.000346                                16.438750                                 81.207189                           13.625000                                           4.625000                                                   5                                         1                                         1                                          1                                            1                                             1                                  1                                2014                                    1                                      2
5              60091                6                            3                mobile                   6349.66                 44.095630                    149.02                  -0.025941                      7.55                  80.375443                   79                                    5                              5              17                  28             2010                 1984                 7                     7                   5                       5                              259.873954                                  839.76                                 0.014384                                   86.49                               472.231119                                                 30                              402.775486                                7.928001                                 0.415426                                4.961414                                11.007471                           3.600926                                           0.000000                                 1700.67                               51.149250                                 0.602209                                   20.65                                94.481667                                 18                                                  5                                 0.472342                                 0.204548                                -0.333796                                -0.470410                                  0.335175                           -0.317685                                           0.000000                                   543.18                               36.734681                                  128.51                                -0.539060                                66.666667                                  8                                                  5                              1058.276667                                43.312326                               139.960000                                  0.002397                                14.415000                                 78.705187                           13.166667                                           5.000000                                                   5                                         1                                         1                                          1                                            1                                             3                                  1                                2014                                    1                                      2

从上述结果可以看出,我们得到了描述客户行为的几十个特征。

修改target entity

DFS如此强大的原因之一是它可以为数据中的任何实体创建一个特征矩阵。例如,我们同样可以为session构建特征:

In [13]: feature_matrix_sessions, features_defs = ft.dfs(entities=entities,
   ....:                                                 relationships=relationships,
   ....:                                                 target_entity="sessions")
   ....: 

In [14]: feature_matrix_sessions.head(5)
Out[14]: 
            customer_id   device  SUM(transactions.amount)  STD(transactions.amount)  MAX(transactions.amount)  SKEW(transactions.amount)  MIN(transactions.amount)  MEAN(transactions.amount)  COUNT(transactions)  NUM_UNIQUE(transactions.product_id)  MODE(transactions.product_id)  DAY(session_start)  YEAR(session_start)  MONTH(session_start)  WEEKDAY(session_start) customers.zip_code  NUM_UNIQUE(transactions.DAY(transaction_time))  NUM_UNIQUE(transactions.YEAR(transaction_time))  NUM_UNIQUE(transactions.MONTH(transaction_time))  NUM_UNIQUE(transactions.WEEKDAY(transaction_time))  MODE(transactions.DAY(transaction_time))  MODE(transactions.YEAR(transaction_time))  MODE(transactions.MONTH(transaction_time))  MODE(transactions.WEEKDAY(transaction_time))  customers.COUNT(sessions)  customers.NUM_UNIQUE(sessions.device) customers.MODE(sessions.device)  customers.SUM(transactions.amount)  customers.STD(transactions.amount)  customers.MAX(transactions.amount)  customers.SKEW(transactions.amount)  customers.MIN(transactions.amount)  customers.MEAN(transactions.amount)  customers.COUNT(transactions)  customers.NUM_UNIQUE(transactions.product_id)  customers.MODE(transactions.product_id)  customers.DAY(join_date)  customers.DAY(date_of_birth)  customers.YEAR(join_date)  customers.YEAR(date_of_birth)  customers.MONTH(join_date)  customers.MONTH(date_of_birth)  customers.WEEKDAY(join_date)  customers.WEEKDAY(date_of_birth)
session_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
1                     2  desktop                   1229.01                 41.600976                    141.66                   0.295458                     20.91                  76.813125                   16                                    5                              3                   1                 2014                     1                       2              13244                                               1                                                1                                                 1                                                  1                                          1                                       2014                                           1                                             2                          7                                      3                         desktop                             7200.28                           37.705178                              146.81                             0.098259                                8.73                            77.422366                             93                                              5                                        4                        15                            18                       2012                           1986                           4                               8                             6                                 0
2                     5   mobile                    746.96                 45.893591                    135.25                  -0.160550                      9.32                  74.696000                   10                                    5                              5                   1                 2014                     1                       2              60091                                               1                                                1                                                 1                                                  1                                          1                                       2014                                           1                                             2                          6                                      3                          mobile                             6349.66                           44.095630                              149.02                            -0.025941                                7.55                            80.375443                             79                                              5                                        5                        17                            28                       2010                           1984                           7                               7                             5                                 5
3                     4   mobile                   1329.00                 46.240016                    147.73                  -0.324012                      8.70                  88.600000                   15                                    5                              1                   1                 2014                     1                       2              60091                                               1                                                1                                                 1                                                  1                                          1                                       2014                                           1                                             2                          8                                      3                          mobile                             8727.68                           45.068765                              149.95                            -0.036348                                5.73                            80.070459                            109                                              5                                        2                         8                            15                       2011                           2006                           4                               8                             4                                 1
4                     1   mobile                   1613.93                 40.187205                    129.00                   0.234349                      6.29                  64.557200                   25                                    5                              5                   1                 2014                     1                       2              60091                                               1                                                1                                                 1                                                  1                                          1                                       2014                                           1                                             2                          8                                      3                          mobile                             9025.62                           40.442059                              139.43                             0.019698                                5.81                            71.631905                            126                                              5                                        4                        17                            18                       2011                           1994                           4                               7                             6                                 0
5                     4   mobile                    777.02                 48.918663                    139.20                   0.336381                      7.43                  70.638182                   11                                    5                              5                   1                 2014                     1                       2              60091                                               1                                                1                                                 1                                                  1                                          1                                       2014                                           1                                             2                          8                                      3                          mobile                             8727.68                           45.068765                              149.95                            -0.036348                                5.73                            80.070459                            109                                              5                                        2                         8                            15                       2011                           2006                           4                               8                             4                                 1

 

你可能感兴趣的:(机器学习,特征工程)