数据挖掘 | 亲和性分析(二)

上回讲了亲和性分析的简单分析,但只计算了一条规则的支持度和置信度,现在来说说怎么计算所有规则的支持度和置信度

数据挖掘 | 亲和性分析(一)

首先先创建字典,分别创建有效规则字典、无效规则字典以及条件相同的规则数量

# 创建字典,存储规则有效数据及无效数据
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)   # 条件相同的规则数量

字典创建完成后,开始统计规则,判断是否有效,然后存进相应的字典里

里面的键值表示为(1,2),表示为买了牛奶又买了奶酪,对应的值为7,表示有7个人买了牛奶又买了奶酪

通过循环,判断各项条件是否成立,然后存进对应的字典里

# 字典键值表示为(1,2),表示购买了牛奶和奶酪,对应的值表示次数
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue  # 前提:购买了某一种商品
        num_occurences[premise] += 1       # 满足前提,存进字典,记录该前提出现次数
        for conclusion in range(n_features):  # 结论,满足前提条件下还购买了什么
            if premise == conclusion:      # 过滤条件和结论相同情况
                continue
            if sample[conclusion] == 1:    # 规则成立,存进规则有效字典,计算次数
                valid_rules[(premise, conclusion)] += 1
            else:                          # 否则不成,存进规则无效字典,计算次数
                invalid_rules[(premise, conclusion)] += 1

打印三个字典出来看一下,结果如下

defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21}
defaultdict(<class 'int'>, {(1, 0): 23, (1, 3): 31, (2, 0): 32, (2, 3): 17, (4, 0): 40, (4, 3): 37, (1, 2): 38, (1, 4): 26, (2, 1): 28, (3, 0): 28, (3, 1): 24, (4, 1): 41, (3, 2): 20, (3, 4): 15, (2, 4): 7, (0, 1): 20, (0, 3): 32, (4, 2): 32, (0, 2): 39, (0, 4): 22}
defaultdict(<class 'int'>, {1: 46, 2: 36, 4: 61, 3: 39, 0: 43})

有了上述字典,我们还需要计算各个规则的支持度和置信度,对此我们还要建立支持度字典和置信度字典

# 计算支持度和置信度,得到字典
support = valid_rules # 规则应验次数
confidence = defaultdict(float) # 规则准确率
for premise, conclusion in valid_rules.keys():
    rule = (premise, conclusion)
    confidence[rule] = valid_rules[rule] / num_occurences[premise]

打印结果如下

defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21})
defaultdict(<class 'float'>, {(1, 2): 0.17391304347826086, (1, 4): 0.43478260869565216, (2, 1): 0.2222222222222222, (2, 4): 0.8055555555555556, (4, 1): 0.32786885245901637, (4, 2): 0.47540983606557374, (2, 3): 0.5277777777777778, (3, 2): 0.48717948717948717, (3, 4): 0.6153846153846154, (4, 3): 0.39344262295081966, (1, 3): 0.32608695652173914, (3, 1): 0.38461538461538464, (0, 1): 0.5348837209302325, (0, 2): 0.09302325581395349, (0, 3): 0.2558139534883721, (0, 4): 0.4883720930232558, (1, 0): 0.5, (2, 0): 0.1111111111111111, (3, 0): 0.28205128205128205, (4, 0): 0.3442622950819672})

有了这些字典,我们就可以查询任意规则的支持度和置信度啦

先定义一个列表,方便我们读懂数据

然后定义一个输出函数,将规则信息、支持度和置信度全部输出

代码如下

features = ["面包", "牛奶", "奶酪", "苹果", "香蕉"]

def print_rule(premise, conclusion, support,confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    rule = (premise, conclusion)
    print("规则:如果顾客购买了{0},他们还会买{1}".format(premise_name, conclusion_name))
    print("支持度:{0}".format(support[rule]))
    print("置信度:{0:.3f}".format(confidence[rule])

if __name__ == '__main__':
    premise = 0
    conclusion = 1
    print_rule(premise, conclusion, support,confidence, features)

这里表示查询买了面包又买了牛奶这条规则的支持度和置信度,结果如下

规则:如果顾客购买了面包,他们还会买牛奶
支持度:23
置信度:0.535

完整代码:

#coding: utf-8
import numpy as np
# 定义数据集文件名
dataset_filename = "affinity_dataset.txt"
# 加载数据集
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape

# 寻找规则:如果购买了X,可能愿意购买Y
# 判断规则优劣:支持度(规则应验次数)和置信度(规则准确率)
# 一条规则由前提条件和结论两部分组成


# 创建字典,存储规则有效数据及无效数据
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)   # 条件相同的规则数量

# 字典键值表示为(1,2),表示购买了牛奶和奶酪,对应的值表示次数
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue  # 前提:购买了某一种商品
        num_occurences[premise] += 1       # 满足前提,存进字典,记录该前提出现次数
        for conclusion in range(n_features):  # 结论,满足前提条件下还购买了什么
            if premise == conclusion:      # 过滤条件和结论相同情况
                continue
            if sample[conclusion] == 1:    # 规则成立,存进规则有效字典,计算次数
                valid_rules[(premise, conclusion)] += 1
            else:                          # 否则不成,存进规则无效字典,计算次数
                invalid_rules[(premise, conclusion)] += 1

# 计算支持度和置信度,得到字典
support = valid_rules # 规则应验次数
confidence = defaultdict(float) # 规则准确率
for premise, conclusion in valid_rules.keys():
    rule = (premise, conclusion)
    confidence[rule] = valid_rules[rule] / num_occurences[premise]

features = ["面包", "牛奶", "奶酪", "苹果", "香蕉"]

def print_rule(premise, conclusion, support,confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    rule = (premise, conclusion)
    print("规则:如果顾客购买了{0},他们还会买{1}".format(premise_name, conclusion_name))
    print("支持度:{0}".format(support[rule]))
    print("置信度:{0:.3f}".format(confidence[rule]))


if __name__ == '__main__':
    premise = 0
    conclusion = 1
    print_rule(premise, conclusion, support,confidence, features)

那么,已经统计出了所有规则的支持度和置信度了,下一次来讲如何排序,选出最优规则

你可能感兴趣的:(数据挖掘,python,数据挖掘,亲和性分析,python,rollby)