上回讲了亲和性分析的简单分析,但只计算了一条规则的支持度和置信度,现在来说说怎么计算所有规则的支持度和置信度
数据挖掘 | 亲和性分析(一)
首先先创建字典,分别创建有效规则字典、无效规则字典以及条件相同的规则数量
# 创建字典,存储规则有效数据及无效数据
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int) # 条件相同的规则数量
字典创建完成后,开始统计规则,判断是否有效,然后存进相应的字典里
里面的键值表示为(1,2),表示为买了牛奶又买了奶酪,对应的值为7,表示有7个人买了牛奶又买了奶酪
通过循环,判断各项条件是否成立,然后存进对应的字典里
# 字典键值表示为(1,2),表示购买了牛奶和奶酪,对应的值表示次数
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue # 前提:购买了某一种商品
num_occurences[premise] += 1 # 满足前提,存进字典,记录该前提出现次数
for conclusion in range(n_features): # 结论,满足前提条件下还购买了什么
if premise == conclusion: # 过滤条件和结论相同情况
continue
if sample[conclusion] == 1: # 规则成立,存进规则有效字典,计算次数
valid_rules[(premise, conclusion)] += 1
else: # 否则不成,存进规则无效字典,计算次数
invalid_rules[(premise, conclusion)] += 1
打印三个字典出来看一下,结果如下
defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21}
defaultdict(<class 'int'>, {(1, 0): 23, (1, 3): 31, (2, 0): 32, (2, 3): 17, (4, 0): 40, (4, 3): 37, (1, 2): 38, (1, 4): 26, (2, 1): 28, (3, 0): 28, (3, 1): 24, (4, 1): 41, (3, 2): 20, (3, 4): 15, (2, 4): 7, (0, 1): 20, (0, 3): 32, (4, 2): 32, (0, 2): 39, (0, 4): 22}
defaultdict(<class 'int'>, {1: 46, 2: 36, 4: 61, 3: 39, 0: 43})
有了上述字典,我们还需要计算各个规则的支持度和置信度,对此我们还要建立支持度字典和置信度字典
# 计算支持度和置信度,得到字典
support = valid_rules # 规则应验次数
confidence = defaultdict(float) # 规则准确率
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurences[premise]
打印结果如下
defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21})
defaultdict(<class 'float'>, {(1, 2): 0.17391304347826086, (1, 4): 0.43478260869565216, (2, 1): 0.2222222222222222, (2, 4): 0.8055555555555556, (4, 1): 0.32786885245901637, (4, 2): 0.47540983606557374, (2, 3): 0.5277777777777778, (3, 2): 0.48717948717948717, (3, 4): 0.6153846153846154, (4, 3): 0.39344262295081966, (1, 3): 0.32608695652173914, (3, 1): 0.38461538461538464, (0, 1): 0.5348837209302325, (0, 2): 0.09302325581395349, (0, 3): 0.2558139534883721, (0, 4): 0.4883720930232558, (1, 0): 0.5, (2, 0): 0.1111111111111111, (3, 0): 0.28205128205128205, (4, 0): 0.3442622950819672})
有了这些字典,我们就可以查询任意规则的支持度和置信度啦
先定义一个列表,方便我们读懂数据
然后定义一个输出函数,将规则信息、支持度和置信度全部输出
代码如下
features = ["面包", "牛奶", "奶酪", "苹果", "香蕉"]
def print_rule(premise, conclusion, support,confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
rule = (premise, conclusion)
print("规则:如果顾客购买了{0},他们还会买{1}".format(premise_name, conclusion_name))
print("支持度:{0}".format(support[rule]))
print("置信度:{0:.3f}".format(confidence[rule])
if __name__ == '__main__':
premise = 0
conclusion = 1
print_rule(premise, conclusion, support,confidence, features)
这里表示查询买了面包又买了牛奶这条规则的支持度和置信度,结果如下
规则:如果顾客购买了面包,他们还会买牛奶
支持度:23
置信度:0.535
完整代码:
#coding: utf-8
import numpy as np
# 定义数据集文件名
dataset_filename = "affinity_dataset.txt"
# 加载数据集
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
# 寻找规则:如果购买了X,可能愿意购买Y
# 判断规则优劣:支持度(规则应验次数)和置信度(规则准确率)
# 一条规则由前提条件和结论两部分组成
# 创建字典,存储规则有效数据及无效数据
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int) # 条件相同的规则数量
# 字典键值表示为(1,2),表示购买了牛奶和奶酪,对应的值表示次数
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue # 前提:购买了某一种商品
num_occurences[premise] += 1 # 满足前提,存进字典,记录该前提出现次数
for conclusion in range(n_features): # 结论,满足前提条件下还购买了什么
if premise == conclusion: # 过滤条件和结论相同情况
continue
if sample[conclusion] == 1: # 规则成立,存进规则有效字典,计算次数
valid_rules[(premise, conclusion)] += 1
else: # 否则不成,存进规则无效字典,计算次数
invalid_rules[(premise, conclusion)] += 1
# 计算支持度和置信度,得到字典
support = valid_rules # 规则应验次数
confidence = defaultdict(float) # 规则准确率
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurences[premise]
features = ["面包", "牛奶", "奶酪", "苹果", "香蕉"]
def print_rule(premise, conclusion, support,confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
rule = (premise, conclusion)
print("规则:如果顾客购买了{0},他们还会买{1}".format(premise_name, conclusion_name))
print("支持度:{0}".format(support[rule]))
print("置信度:{0:.3f}".format(confidence[rule]))
if __name__ == '__main__':
premise = 0
conclusion = 1
print_rule(premise, conclusion, support,confidence, features)
那么,已经统计出了所有规则的支持度和置信度了,下一次来讲如何排序,选出最优规则