在讲Apriori算法之前,先引进几个概念
项目就是数据集种最小的独立单位,举个例子,就像购物单中的啤酒,就是一个项目
项集就是项目的集合,比如{牛奶,啤酒,尿不湿}就是一个集合
关联规则X->Y表示在X事件发生时,Y事件也发生
可以理解为对某个项集的支持度,就是某一个项集在数据集中出现的频率,例如有这么一个数据集
1,3,4,5
6,3,4,2
3,4,2,6
1,3,6 ,6
1,2,3,4
项集{2,3}在里面一共出现了3次,那么项集{2,3}的支持度为3/5=0.6
这个可以理解为关联关系的可信赖程度,例如存在关联关系X->Y
置信度condidence(X->Y)=P(Y|X)=P(X,Y)/P(X)
k项集就是一个项集含有k个项目,那么频繁是什么意思,显然字面意思,就是频繁出现的项集,那么我们要这么判断一个项集是否频繁出现?提出一个新的概念,最小支持度,上文我们提到,每一个项集我们都存在一个支持度,那么我们给定一个具体的数值,可以是1/2,也可以是1/3,如果该k项集的最小支持率大于等于给出的最小支持率,那么我们把它看作为频繁k项集。
但是我们会发现如果我们直接循环对所有的k项集进行判断的话,我们需要执行2^k次判断,这显然是不合理的,于是我们提出了一种用于快速求解k项集算法——Apriori算法
Apriori第一定律:
如果存在某个r项集不满足最小支持率,那么r项集连接成的r+1项集(超集)也必不满足
即如果{1,2}不满足,那么{1,2,3},{1,2,4}也一定不满足
Apriori第二定律:
如果存在某个r项集满足最小支持率,那么可以连接成r项的r-1项(子集)也一定满足
即如果存在{2,3,4}满足,那么{2,3},{3,4}均一定满足。
给出数据集:
1,3,4,5
6,3,4,2
3,4,2,6
1,3,6
1,2,3,4
算法如下:(取最小支持率为0.7)
1、扫描整个数据集得到所有出现过的数据,作为候选频繁1项集。
1项集 | 支持率 |
---|---|
{1} | 0.6 |
{2} | 0.6 |
{3} | 1 |
{4} | 0.8 |
{5} | 0.2 |
{6} | 0.6 |
2、寻找频繁k项集
a、得到所有k项集的支持率,除去不满足最小支持率的,得到频繁k项集,若为空,则返回频繁k-1项集的结果;若等于1,则直接返回频繁k项集的结果,算法结束
b、若大于1,则连接成k+1次项集,并除去那些置信度不大于最小置信度的连接
3、k=k+1,转步骤2
继续实现以上的步骤 得到频繁1项集之后进行连接,得到2项集:
2项集 | 支持率 |
---|---|
{1,4} | 0.6 |
发现不满足,返回k-1项集,结束
# Apriori.py
from apyori import apriori
# 项目集的支持率计算,这里用的是出现的次数,不是真正的支持率
def support_rate(data, project):
result = 0
for i in data:
bl = True
for x in project:
if x not in i:
bl = False
if bl:
result += 1
return result
# 计算机X->Y的置信度
def confidence(X, Y, data):
XY = list(set(X + Y))
return support_rate(data, XY) / support_rate(data, X)
# 判断是否需要剪切
def ispruning(data, project, rate):
return support_rate(data, project) < rate
# 寻找数据集中所有出现过的项目构成1项集,并计算支持率,这里直接用了出现的次数
def find_first_map(data):
result = []
for i in range(len(data)):
for j in range(len(data[i])):
if [data[i][j]] not in result:
result.append([data[i][j]])
return result
# 剪切操作
def pruning_map(data, projects, rate):
for i in range(len(projects) - 1, -1, -1): # 因为会减小projects的长度,如果使用正序就会数组越界,因此采用逆序遍历
if ispruning(data, projects[i], rate):
if i < len(projects) - 1:
projects = projects[:i] + projects[i + 1:]
else:
projects = projects[:i]
return projects
# 连接操作
def connection(projects, now, min_confidence, data):
new_pro = []
for i in range(len(projects)):
for j in range(i + 1, len(projects)):
if now < 2: # 此时为1项集可以直接连接
pro = list(set(projects[i] + projects[j]))
if pro not in new_pro:
new_pro.append(pro)
else:
count = 0
for one in projects[i]:
if one in projects[j]:
count += 1
if count == now - 1: # 类似的,3项连接成4项的前提是有两个相等的
pro = list(set(projects[i] + projects[j]))
if pro not in new_pro and confidence(projects[i], projects[j], data) >= min_confidence:
new_pro.append(pro)
new_pro2 = []
for i in new_pro:
if i not in new_pro2:
new_pro2.append(i)
return new_pro2
# Apriori算法
def Apriori(data, min_support, min_condidence):
pros = find_first_map(data)
oldpro = pros
newpro = oldpro
min_support = len(data) * min_support
i = 1
while True:
newpro = pruning_map(data, newpro, min_support)
if len(newpro) == 0:
return oldpro
if len(newpro) == 1:
return newpro
oldpro = newpro
print("k=", i, " ", oldpro)
newpro = connection(oldpro, i, min_condidence, data)
i += 1
file = open("mushroom.txt", 'r')
data = []
while True:
line = file.readline()
if not line:
break
line = line.split()
data.append(list(map(int, line)))
result = Apriori(data=data, min_support=0.50, min_condidence=0.8)
for i in result:
print("Set of items:", i, " support:{:.3f}".format(support_rate(data, i) / len(data)))
以下为输出数据,为防止结果太多,本文选取最低支持率为0.5,以下对应的是k不同值的情况下的频繁k项表,
k= 1 [[34], [36], [59], [63], [67], [76], [85], [86], [90], [2], [39], [24], [53]]
k= 2 [[34, 36], [34, 59], [34, 63], [34, 67], [34, 76], [34, 85], [34, 86], [34, 90], [34, 39], [24, 34], [34, 53], [59, 36], [36, 63], [36, 85], [36, 86], [90, 36], [36, 39], [59, 63], [59, 85], [59, 86], [90, 59], [85, 63], [86, 63], [90, 63], [67, 85], [67, 86], [76, 85], [76, 86], [85, 86], [90, 85], [2, 85], [85, 39], [24, 85], [53, 85], [90, 86], [86, 39], [24, 86], [53, 86], [90, 39], [24, 90], [90, 53]]
k= 3 [[34, 36, 85], [34, 36, 86], [34, 36, 90], [34, 59, 85], [34, 59, 86], [34, 59, 90], [34, 59, 36], [34, 85, 63], [34, 86, 63], [34, 90, 63], [34, 67, 85], [34, 67, 86], [34, 76, 85], [34, 76, 86], [34, 85, 86], [34, 90, 85], [34, 90, 86], [34, 85, 90], [34, 36, 39], [34, 85, 39], [34, 86, 39], [34, 90, 39], [24, 34, 85], [24, 34, 86], [24, 34, 90], [34, 85, 53], [34, 53, 86], [34, 90, 53], [59, 36, 85], [59, 36, 86], [90, 59, 36], [36, 85, 63], [36, 85, 86], [90, 36, 85], [90, 36, 86], [36, 85, 39], [36, 86, 39], [90, 36, 39], [59, 85, 63], [59, 85, 86], [90, 59, 85], [90, 59, 86], [85, 86, 63], [90, 85, 63], [90, 86, 63], [67, 85, 86], [76, 85, 86], [90, 85, 86], [85, 86, 39], [90, 85, 39], [24, 85, 86], [24, 90, 85], [85, 53, 86], [90, 85, 53], [90, 86, 39], [24, 90, 86], [90, 53, 86]]
k= 4 [[34, 36, 85, 86], [34, 90, 36, 85], [34, 36, 90, 86], [34, 85, 36, 90], [34, 59, 85, 86], [90, 34, 59, 85], [34, 59, 36, 85], [34, 59, 90, 86], [34, 59, 36, 86], [34, 59, 85, 90], [34, 85, 86, 63], [34, 90, 85, 63], [34, 90, 86, 63], [34, 85, 90, 63], [34, 67, 85, 86], [34, 76, 85, 86], [34, 90, 85, 86], [34, 85, 90, 86], [34, 85, 90, 36], [34, 36, 85, 39], [34, 36, 86, 39], [34, 85, 86, 39], [34, 90, 85, 39], [34, 90, 86, 39], [34, 85, 90, 39], [24, 34, 85, 86], [24, 34, 90, 85], [24, 34, 90, 86], [24, 34, 85, 90], [34, 53, 85, 86], [90, 34, 53, 85], [34, 90, 53, 86], [34, 85, 53, 86], [34, 53, 85, 90], [34, 53, 90, 86], [59, 36, 85, 86], [90, 59, 36, 85], [90, 36, 85, 86], [36, 85, 86, 39], [90, 36, 85, 39], [90, 59, 85, 86], [90, 85, 86, 63], [90, 85, 86, 39], [24, 90, 85, 86], [90, 53, 85, 86]]
k= 5 [[34, 36, 85, 86, 90], [34, 85, 86, 90, 59], [34, 36, 85, 86, 59], [34, 85, 86, 90, 63], [34, 36, 39, 85, 86], [34, 39, 85, 86, 90], [34, 85, 86, 24, 90], [34, 85, 53, 86, 90], [34, 53, 85, 86, 90]]
Set of items: [34, 36, 85, 86, 90] support:0.772
Set of items: [34, 85, 86, 90, 59] support:0.559
Set of items: [34, 36, 85, 86, 59] support:0.521
Set of items: [34, 85, 86, 90, 63] support:0.530
Set of items: [34, 36, 39, 85, 86] support:0.535
Set of items: [34, 39, 85, 86, 90] support:0.589
Set of items: [34, 85, 86, 24, 90] support:0.518
Set of items: [34, 85, 53, 86, 90] support:0.567
Set of items: [34, 53, 85, 86, 90] support:0.567
以上为全部内容。