网络数据挖掘-L10

title: 网络数据挖掘-L10
date: 2017-07-26 11:54:52
categories: DataMining
mathjax: true
tags: [WebDataMining]

L10 Association Rules关联规则 (Groups and Topics for Course Project confirmed)

参考博客
参考博客2

Association Rule Mining 基础定义：

Paste_Image.png

Itemset
- A collection of one or more items
  - Example: {Milk, Bread, Diaper}
- k-itemset
  - An itemset that contains k items
Support count (\delta)
- Frequency of occurrence of an itemset
- E.g. $\delta$ ({Milk, Bread,Diaper}) = 2
Support
- Fraction of transactions that contain an itemset
- E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
- An itemset whose support is greater than or equal to a minsup threshold
Association Rule
- An implication expression of the form
  X -> Y, where X and Y are itemsets
- Example: {Milk, Diaper} -> {Beer}
Rule Evaluation Metrics
- Support (s) 支持度
  - Fraction of transactions that contain both X and Y
- Confidence (c) 置信度
  - Measures how often items in Y
    appear in transactions that contain X
- Paste_Image.png
Maximal Frequent Itemset：An itemset is maximal frequent if none of its immediate supersets is frequent 子集都不是高频的
Closed Itemset：An itemset is closed if none of its immediate supersets has the same support as the itemset 子集的support都比自己小
小结论：Frequent Itemsets > Closed Frequent Itemsets> Maximal Frequent Itemsets

Association Rule Mining Task

大多数关联规则挖掘算法通常采用的一种策略是，将关联规则挖掘任务分解为如下两个主要的子任务。

频繁项集产生：其目标是发现满足最小支持度阈值的所有项集，这些项集称作频繁项集（frequent itemset）。
规则的产生：其目标是从上一步发现的频繁项集中提取所有高置信度的规则，这些规则称作强规则（strong rule）。

通常，频繁项集产生所需的计算开销远大于产生规则所需的计算开销。

最容易想到、也最直接的进行关联关系挖掘的方法或许就是暴力搜索（Brute-force）的方法：

Brute-force 蛮力法
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf thresholds 剪枝

然而，由于Brute-force的计算量过大，所以采样这种方法并不现实！

格结构（Lattice structure）常被用来枚举所有可能的项集。如下图所示为I={a,b,c,d,e}的项集格。一般来说，排除空集后，一个包含k个项的数据集可能产生 $2^k-1$ 个频繁项集。由于在实际应用中k的值可能非常大，需要探查的项集搜索空集可能是指数规模的。

Paste_Image.png

图上所有的候选项集，如果被transactions包含，则数量增加。这样需要计算 $O(NMw)$ ,其中 $M=2^d$

Paste_Image.png

Apriori算法

Apriori定律1：如果一个集合是频繁项集，则它的所有子集都是频繁项集。

Apriori定律2：如果一个集合不是频繁项集，则它的所有超集都不是频繁项集。

举例：假设集合{A}不是频繁项集，即A出现的次数小于 min_support，则它的任何超集如{A,B}出现的次数必定小于min_support，因此其超集必定也不是频繁项集。

下图表示当我们发现{A,B}是非频繁集时，就代表所有包含它的超级也是非频繁的，即可以将它们都剪除。

Paste_Image.png

效果：

Paste_Image.png

6+15+20=41

Apriori Algorithm 算法描述

Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are
identified
- Generate length (k+1) candidate itemsets from length k
  frequent itemsets
- Prune candidate itemsets containing subsets of length k
  that are infrequent
- Count the support of each candidate by scanning the DB
- Eliminate candidates that are infrequent, leaving only
  those that are frequent

加快统计

加快support的计算：candidate itemsets加入到一个hash tree中

Paste_Image.png

加快condition的计算： subset operation 到 subset operation using hash tree

Paste_Image.png

Paste_Image.png

FP-Tree算法

Paste_Image.png

示例：

如何通过FP树提取规则

规则的提取由最后的结点e开始，按照与建树相反的顺序进行提取。（假定支持度计数为2）
以结点e为例：

1、收集包含e结点的所有路径，这些初始的路径称为前缀路径（prefixpath）。
2、如图所示的前缀路径，通过把与结点e相关联的支持度计数相加得到e的支持度计数。假定最小的支持度为2，因为{e}的支持度是3所以它是频繁项集。
3、由于{e}是频繁的，因此算法必须解决发现以de、ce、be和ae结尾的频繁项集的子问题。在解决这些子问题之前，必须先将前缀路径转换成条件fp树。除了用于发现以某特定后缀结尾的频繁项集之外，条件fp树的结果与fp树类似。
- 条件fp树通过以下步骤得到：
- a、首先，必须更新前缀路径上的支持度计数。因为某些计数包含哪些不含项e的事务。例如图中最右边路径null->b:2->c:2->e:1,包含并不包含项e事务{b、c}。因此，必须将该前缀路径上的计数调整为1，以反映包含{b，c，e}的事务的实际个数。
- b、删除e的结点，修剪前缀路径.删除这些结点是因为，沿这些前缀路径的支持度计数已经更新，以反映包含e的那些事务，并且以发现de、ce、be和ae结尾的频繁项集问题的子问题不再需要结点e信息。
- c、更新沿前缀路径上的支持度计数后，某些项可能不是频繁的，删除那些不是频繁的项。
4、FP增长使用e的条件fp树来解决发现以de、ce、be和ae结尾的频繁项子问题。为了发现以de结尾的频繁项集，从项e的条件fp树来收集d的所有前缀路径。通过将与结点d相关联的频度计数求和，得到项集{d、e}的支持度计数。因为{d、e}支持度为2，是频繁的，保留。再构建{de}的条件FP树，该树只包含一个支持度等于最小支持度的项a，所以提取出{a、d、e}。用同样的方法去判断{c、e}、{b、e}、{a、e}。

提取结果如下：

后缀频繁项集
e {e},{d,e},{a,d,e},{c,e},{a,e}
d {d},{c,d},{b,c,d},{b,d},{a,b,d},{a,d}

c {c},{b,c},{a,b,c},{a,c}
b {b},{a,b}
a {a}

FP算法特点：

FP树是一种输入数据的压缩表示，它通过逐个读入事务，并把每个事务映射到FP树种的一条路径来
构造。由于不同的事务可能会有若干个相同的项，因此它们的路径可能部分重叠。路径相互重叠的
越多，使用FP树结构获得的压缩效果越好。如果FP树足够小，则能够存放在内存中，就可以直接从
这个内存中的结构提出频繁项集，而不必重复扫描存放在硬盘上的数据。

评估

Association rule algorithms tend to produce too many rules
- many of them are uninteresting or redundant
- Redundant if {A,B,C} -> {D} and {A,B} -> {D} have same support & confidence
Interestingness measures can be used to prune/rank the derived patterns
In the original formulation of association rules, support &
confidence are the only measures used