BUC算法及其Python实现

数据分析有时候需要对多个维度上的数据进行聚集操作,完全立方体的聚集操作的复杂度是 2 n 2^n 2n,过于复杂,而且有一些维度由于数据过小我们不需要去关心,因此需要对这种聚集操作进行剪枝,取其中符合的部分进行操作,也就是计算冰山立方体。

一种很粗暴的方法是先计算完全立方体,随后进行剪枝,但这种方法并没有减少复杂度,因此需要在聚集的过程中就进行剪枝,这就是算法BUC要实现的过程。

算法描述

有点像Apriori算法,后代是频繁的当且仅当父亲是频繁的,如果A有4维,B有4维,C有2维,那么最多要遍历的次数应该是4*4* 2,即先计算 (a1,*,*),然后计算(a1,b1,*)(a1,b1,c1),(a1,b1,c2),(a1,b2,*),(a1,b2,c1)…但是如果(a1,b1,*)的聚集结果不符合,那么就进行剪枝,不需要计算(a1,b1,c)

算法实现

import pandas as pd
import random

A = ["a1","a2","a3","a4"]
B = ["b1","b2","b3","b4"]
C = ["c1","c2","c3","c4"]
D = ["d1","d2"]
ALL = [A,B,C,D]
def sample(size = 6):
    li = []
    for i in range(size):
        tmp = []
        for d in ALL:
            tmp.append(random.choice(d))
        tmp.append(1)
        li.append(tmp)
    li.sort()
    li = pd.DataFrame(li)
    return li

def sample_fixed():
    data = [
    ["a1","b1","c1","d1"],
    ["a1","b1","c1","d2"],
    ["a1","b1","c2","d1"],
    ["a1","b1","c2","d2"],
    ["a2","b1","c1","d2"],
    ["a2","b2","c2","d2"],
    ["a3","b3","c1","d1"],
    ["a4","b4","c1","d2"],
    ]

    # data = [
    # ["a1","b1","c1"],
    # ["a1","b1","c2"],
    # ["a2","b1","c1"],
    # ]
    for i in data:
        i.append(1)

    data = pd.DataFrame(data)
    return data



candidate = {}
min_sup = 2

def BUC(data, prefix=None):
    if len(data.columns) == 1:#确保要进行的不是最后的计数列,可以用增加总维度和当前维度这两个参数来替换
        return

    if prefix is None:#保存前缀参数,用来存储
        prefix = []

    for i in data.columns:#对每一个维度进行BUC
        if i == data.columns[-1]:#判定是否是最后一个计数维度
            continue
        count = data.groupby([i]).count()#对当前维度计数
        count = count[count >= min_sup].dropna()#直接对不符合min_sup的分区舍弃

        for j in count.index:#遍历符合的分区
            subdata = data[data[i].isin([j])]
            subdata = subdata.drop(i,1)#筛选子分区

            BUC(subdata,prefix+[str(j)])#递归的进行更小维度上的BUC
            candidate[",".join(prefix+[str(j)])] = int(count[fco][j])#添加进记录,即输出
            # print(",".join(prefix+[str(j)])) #当min_sup 设置为1的时候,能清楚的看到BUC计算的过程

        data = data.drop(i,1)

data = sample_fixed()
fco = data.columns[-1]

result = BUC(data)
for k,v in candidate.items():
    print(k,v)

你可能感兴趣的:(算法)