《Python数据挖掘与实战》第一章1.3.4关于“顾客买了产品X，有可能购买产品Y”这样的规则”，用Python计算支持度和置信度时，作者用的是numpy库，但代码又臭又长，看的心累。

臭又长代码如下所示（不想看的可以直接跳过）：

from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue
        # Record that the premise was bought in another transaction
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # It makes little sense to measure if X -> X.
                continue
            if sample[conclusion] == 1:
                # This person also bought the conclusion item
                valid_rules[(premise, conclusion)] += 1
            else:
                # This person bought the premise, but not the conclusion
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]

for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

现使用Pandas库，对支持度和置信度进行计算：

首先导入相关库和数据文件

import numpy as np
import pandas as pd
dataset_filename = 'affinity_dataset.txt'
x = np.loadtxt(dataset_filename)
df = pd.DataFrame(x,columns=['面包','牛奶','奶酪','苹果','香蕉'])
df.head()

数据文件的前五条数据如下所示：

image.png

其中数值为0即为没有购买，数值为1即为购买。

接着准备两个数据框，来接收支持度和置信度的数值

# 创建两个表 分别作为支持度和置信度的准备表
df2 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
df3 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
df2

image.png

第三步计算支持度（只有7行代码）

这里是首先遍历每个样本每个值，然后将未购买（数值为0）的样本跳过，在遍历每个人关于某个上面的购买情况时，若购买（值为1），再次从下一个开始再次遍历（不遍历自己），若除当前商品以外，有购买其他商品的，在支持度数据框对应的[j,k]和[k,j]值分别加1

for i in x:
    for j in range(5):
        # 如果为0 就跳过
        if not i[j] : continue
        # 如果不0，继续遍历，如果有购买，便+1
        for k in range(j+1,5):
            if not i[k] : continue
            df2.iloc[j,k] += 1
            df2.iloc[k,j] += 1
# 返回支持度的结果
df2

得到的支持度表如下：

image.png

最后一步计算置信度(3行代码）

# 用支持度除以购买过此类别的数量获得自信度
for j in range(5):
    df3.iloc[j] = df2.iloc[j] / df.sum()[j]

df3.round(3) # 以3位小数返回置信度表

image.png

Pandas简单排序规则的支持度和置信度计算

臭又长代码如下所示（不想看的可以直接跳过）：

现使用Pandas库，对支持度和置信度进行计算：

首先导入相关库和数据文件

接着准备两个数据框，来接收支持度和置信度的数值

第三步计算支持度（只有7行代码）

最后一步计算置信度(3行代码）

大功告成，哈哈哈哈，完美~

你可能感兴趣的:(Pandas简单排序规则的支持度和置信度计算)