目标: 发现事务数据库不同项之间的联系, 这些联系构成的规则, 可以帮助找到某些行为特征,帮忙决策
场景: 超市, 网络浏览偏好, 入侵检测, 生产等领域
与序列挖掘的不同:
全局项I: I={ i 1 , i 2 , . . . , i j , . . . , i m i_{1},i_{2},...,i_{j},...,i_{m} i1,i2,...,ij,...,im}
事务数据库T: T={ t 1 , t 2 , . . . , t i , . . . , t n t_{1},t_{2},...,t_{i},...,t_{n} t1,t2,...,ti,...,tn}
项集: 非空不重复 例如 I 1 = i 1 , i 2 , i 3 I_{1}={i_{1},i_{2},i_{3}} I1=i1,i2,i3
事务 t i t_{i} ti: 每个事务对应一个项集
项集的支持度
关联规则的支持度
定义:其支持度大于或等于min_sup,则称 I i I_{i} Ii为频繁项集
缺点: m个项产生非空项集 2 m 2^m 2m,不具备样本可扩展性
提升度表示含有Y的条件下,同时含有X的概率,与X总体发生的概率之比,即:
Lift ( X ⇐ Y ) = P ( X ∣ Y ) / P ( X ) = Con fidence ( X ⇐ Y ) / P ( X ) \operatorname{Lift}(X \Leftarrow Y)=P(X | Y) / P(X)=\text { Con fidence }(X \Leftarrow Y) / P(X) Lift(X⇐Y)=P(X∣Y)/P(X)= Con fidence (X⇐Y)/P(X)
找到强关联规则的常用判断标准
挖掘强关联规则的两个基本步骤:
方法
n=|D|;
for (I的每个子集c)
{ i=0;
for (对于D中的每个事务t)
{ if (c是t的子集)
i++;
}
if (i/n≥min_sup)
L=L∪{c}; //将c添加到频繁项集集合L中;
}
1)扫描整个数据集,得到所有出现过的数据,作为候选频繁1项集(k=1),频繁0项集为空集。
2)挖掘频繁k项集
a) 扫描数据计算候选频繁k项集的支持度
b) 去除候选频繁k项集中支持度低于阈值的数据集,得到频繁k项集。如果得到的频繁k项集为空,则直接返回频繁k-1项集的集合作为算法结果,算法结束。如果得到的频繁k项集只有一项,则直接返回频繁k项集的集合作为算法结果,算法结束。
c) 基于频繁k项集,连接生成候选频繁k+1项集。
3) 令k=k+1,转入步骤2
缺点:从算法的步骤可以看出,Aprior算法每轮迭代都要扫描数据集,因此在数据集很大,数据种类很多的时候,算法效率很低。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
store_data = pd.read_csv(r'E:\ai\ai_lab\ai_case\ai_data\ml\store_data.csv', header=None)
store_data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | shrimp | almonds | avocado | vegetables mix | green grapes | whole weat flour | yams | cottage cheese | energy drink | tomato juice | low fat yogurt | green tea | honey | salad | mineral water | salmon | antioxydant juice | frozen smoothie | spinach | olive oil |
1 | burgers | meatballs | eggs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | chutney | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | turkey | avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | mineral water | milk | energy bar | whole wheat rice | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
records = []
for i in range(0, store_data.shape[0]):
records.append([str(store_data.values[i,j]) for j in range(0, store_data.shape[1])])
from apyori import apriori
"""
def apriori(transactions, **kwargs):
Arguments:
transactions:数据集
min_support:最小支持度
min_confidence:最小置信度
min_lift:最小提升度
max_length:最小项数.
"""
association_rules = apriori(records, min_support=0.005, min_confidence=0.2, min_lift=4, min_length=2)
association_results = list(association_rules)
print(len(association_results) )
8
for item in association_results:
pair = item[0]
items = [x for x in pair]
# 输出关联
print("Rule: " + items[0] + " -> " + items[1])
# 输出支持度
print("Support: " + str(item[1]))
# 输出确信度
print("Confidence: " + str(item[2][0][2]))
# 输出提升度
print("Lift: " + str(item[2][0][3]))
print("=====================================")
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
=====================================
Rule: whole wheat pasta -> olive oil
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
=====================================
Rule: pasta -> shrimp
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
=====================================
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
=====================================
Rule: ground beef -> herb & pepper
Support: 0.006399146780429276
Confidence: 0.3934426229508197
Lift: 4.004359721511667
=====================================
Rule: whole wheat pasta -> olive oil
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
=====================================
Rule: pasta -> nan
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
=====================================
Rule: ground beef -> herb & pepper
Support: 0.006399146780429276
Confidence: 0.3934426229508197
Lift: 4.004359721511667
=====================================
from pyfpgrowth import *
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
store_data = pd.read_csv(r'E:\ai\ai_lab\ai_case\ai_data\ml\store_data.csv', header=None)
store_data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | shrimp | almonds | avocado | vegetables mix | green grapes | whole weat flour | yams | cottage cheese | energy drink | tomato juice | low fat yogurt | green tea | honey | salad | mineral water | salmon | antioxydant juice | frozen smoothie | spinach | olive oil |
1 | burgers | meatballs | eggs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | chutney | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | turkey | avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | mineral water | milk | energy bar | whole wheat rice | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
records = []
for i in range(0, store_data.shape[0]):
records.append([str(store_data.values[i,j]) for j in range(0, store_data.shape[1])])
patterns = pyfpgrowth.find_frequent_patterns(records, 2)
rules = pyfpgrowth.generate_association_rules(patterns, 0.7)
for i in range(1,5):
key=list(rules.keys())[i]
value=rules[key][0]
confidence=rules[key][1]
print(" %s -> %s: %f"%(key,value,confidence))
('frozen vegetables', 'ground beef', 'mineral water', 'nan', 'oil', 'shrimp', 'spaghetti') -> ('low fat yogurt',): 0.789474
('avocado', 'cottage cheese', 'green tea', 'honey', 'low fat yogurt', 'salmon', 'spinach', 'tomato juice', 'vegetables mix', 'whole weat flour') -> ('mineral water',): 1.000000
('burgers', 'chocolate', 'french fries', 'ham', 'low fat yogurt', 'spaghetti') -> ('eggs', 'nan'): 8.000000
('cider', 'french wine', 'milk') -> ('green tea', 'nan'): 5.666667