数据分析有时候需要对多个维度上的数据进行聚集操作,完全立方体的聚集操作的复杂度是 2 n 2^n 2n,过于复杂,而且有一些维度由于数据过小我们不需要去关心,因此需要对这种聚集操作进行剪枝,取其中符合的部分进行操作,也就是计算冰山立方体。
一种很粗暴的方法是先计算完全立方体,随后进行剪枝,但这种方法并没有减少复杂度,因此需要在聚集的过程中就进行剪枝,这就是算法BUC要实现的过程。
有点像Apriori算法,后代是频繁的当且仅当父亲是频繁的,如果A有4维,B有4维,C有2维,那么最多要遍历的次数应该是4*4* 2
,即先计算 (a1,*,*)
,然后计算(a1,b1,*)
,(a1,b1,c1)
,(a1,b1,c2)
,(a1,b2,*)
,(a1,b2,c1)
…但是如果(a1,b1,*)
的聚集结果不符合,那么就进行剪枝,不需要计算(a1,b1,c)
了
import pandas as pd
import random
A = ["a1","a2","a3","a4"]
B = ["b1","b2","b3","b4"]
C = ["c1","c2","c3","c4"]
D = ["d1","d2"]
ALL = [A,B,C,D]
def sample(size = 6):
li = []
for i in range(size):
tmp = []
for d in ALL:
tmp.append(random.choice(d))
tmp.append(1)
li.append(tmp)
li.sort()
li = pd.DataFrame(li)
return li
def sample_fixed():
data = [
["a1","b1","c1","d1"],
["a1","b1","c1","d2"],
["a1","b1","c2","d1"],
["a1","b1","c2","d2"],
["a2","b1","c1","d2"],
["a2","b2","c2","d2"],
["a3","b3","c1","d1"],
["a4","b4","c1","d2"],
]
# data = [
# ["a1","b1","c1"],
# ["a1","b1","c2"],
# ["a2","b1","c1"],
# ]
for i in data:
i.append(1)
data = pd.DataFrame(data)
return data
candidate = {}
min_sup = 2
def BUC(data, prefix=None):
if len(data.columns) == 1:#确保要进行的不是最后的计数列,可以用增加总维度和当前维度这两个参数来替换
return
if prefix is None:#保存前缀参数,用来存储
prefix = []
for i in data.columns:#对每一个维度进行BUC
if i == data.columns[-1]:#判定是否是最后一个计数维度
continue
count = data.groupby([i]).count()#对当前维度计数
count = count[count >= min_sup].dropna()#直接对不符合min_sup的分区舍弃
for j in count.index:#遍历符合的分区
subdata = data[data[i].isin([j])]
subdata = subdata.drop(i,1)#筛选子分区
BUC(subdata,prefix+[str(j)])#递归的进行更小维度上的BUC
candidate[",".join(prefix+[str(j)])] = int(count[fco][j])#添加进记录,即输出
# print(",".join(prefix+[str(j)])) #当min_sup 设置为1的时候,能清楚的看到BUC计算的过程
data = data.drop(i,1)
data = sample_fixed()
fco = data.columns[-1]
result = BUC(data)
for k,v in candidate.items():
print(k,v)