Pandas简单排序规则的支持度和置信度计算

《Python数据挖掘与实战》第一章1.3.4关于“顾客买了产品X,有可能购买产品Y”这样的规则”,用Python计算支持度和置信度时,作者用的是numpy库,但代码又臭又长,看的心累。

臭又长代码如下所示(不想看的可以直接跳过):
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue
        # Record that the premise was bought in another transaction
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # It makes little sense to measure if X -> X.
                continue
            if sample[conclusion] == 1:
                # This person also bought the conclusion item
                valid_rules[(premise, conclusion)] += 1
            else:
                # This person bought the premise, but not the conclusion
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

现使用Pandas库,对支持度和置信度进行计算:

首先导入相关库和数据文件

import numpy as np
import pandas as pd
dataset_filename = 'affinity_dataset.txt'
x = np.loadtxt(dataset_filename)
df = pd.DataFrame(x,columns=['面包','牛奶','奶酪','苹果','香蕉'])
df.head()

数据文件的前五条数据如下所示:

image.png

其中数值为0即为没有购买,数值为1即为购买。

接着准备两个数据框,来接收支持度和置信度的数值

# 创建两个表 分别作为支持度和置信度的准备表
df2 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
df3 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
df2
image.png

第三步计算支持度(只有7行代码)

这里是首先遍历每个样本每个值,然后将未购买(数值为0)的样本跳过,在遍历每个人关于某个上面的购买情况时,若购买(值为1),再次从下一个开始再次遍历(不遍历自己),若除当前商品以外,有购买其他商品的,在支持度数据框对应的[j,k]和[k,j]值分别加1

for i in x:
    for j in range(5):
        # 如果为0 就跳过
        if not i[j] : continue
        # 如果不0,继续遍历,如果有购买,便+1
        for k in range(j+1,5):
            if not i[k] : continue
            df2.iloc[j,k] += 1
            df2.iloc[k,j] += 1
# 返回支持度的结果
df2

得到的支持度表如下:

image.png

最后一步计算置信度(3行代码)

# 用支持度除以购买过此类别的数量获得自信度
for j in range(5):
    df3.iloc[j] = df2.iloc[j] / df.sum()[j]

df3.round(3) # 以3位小数返回置信度表
image.png

大功告成,哈哈哈哈,完美~

你可能感兴趣的:(Pandas简单排序规则的支持度和置信度计算)