(原创)Bugzilla众包行为模式挖掘

论文名为:

The Firefox Temporal Defect Dataset

这篇论文是加拿大学者在去年夏天时在Bugzilla上的Firefox的bug反馈讨论组中的1998~2014年反馈数据得到的一些挖掘成果,其实说是成果也可能抬高了它,其实就是发现了在bug反馈的过程中,开发者之间存在着某种比较频繁的交流模式。具体这种模式是什么样的,在后文会有提及,这与我目前所做的课题很相似,所以写一篇博文记录一下。


首先展示一下数据形式,数据为CSV格式,即逗号分隔符形式,这种形式的好处是直观简介,易于开发者快速使用,数据形式如下图1:

数据集说明:A列:bug_id,B列:该行为的提出时间,C列:这是一种什么行为,D列:该行为距离bug提出经过了多少天。

每一种行为都由一个字母所表示。


具体的字母含义如下图2所示:

(原创)Bugzilla众包行为模式挖掘_第1张图片

其中相同类型的行为我用同一种颜色来表示,这些行为一般在一个bug钟只出现几个,比如:在图1中第1~4列,bug_id=322067的这个bug,只有四种行为就将bug解决了,而bug_id=322075的bug只通过两个行为,一天之内就解决了,可见不同bug的难度不同,处理的过程也是不一样的。


原作者做了什么?

由于这些行为是有时间顺序的,原作者将这些行为进行了基于时间序列的频繁模式挖掘,挖掘出了一些频繁出现的行为模式,如下图3所示:

(原创)Bugzilla众包行为模式挖掘_第2张图片

我们可以看到,二元组存在的行为模式的频度很高,举例:N-C模式,表示先发生了N行为再发生了C行为,它们是有严格的发生次序的,在16年中一共发生了1952次。而五元组N-C-W-C-Z形式同样是有严格的发生次序的,出现了143次。我估计一定还会有六元、七元组的时间序列模式,只不过频度会越来越少而已。

用GSP算法就能轻松求解这些频繁模式,算法参见:http://blog.csdn.net/chixujohnny/article/details/47335911


我要做什么?

我所要做的是使用一种扩展的时间序列频繁模式挖掘,扩展出更多的模式来,比如:二元组C-C,三元组C-C-C,五元组C-C-C-C-C模式,就可以化简为(C)模式,表示多次出现了行为C,具体算法参见:http://blog.csdn.net/chixujohnny/article/details/50569430

这中模式之外我还要再加一种“乱序”的模式,具体算法还没有实现,如果实现了我会在下面加上链接。


预期结果?

对于最终的结果,我不确定能挖掘出里程碑式意义的东西来,但是做学术好歹也要先尝试一下。

预期的结果是,对于我所探索出的扩展型时间序列频繁模式挖掘算法,能够挖掘出想要的结果,并且该模式具有较高的频度,我们才能认为这种模式是比较频繁的(如果频度太低就没什么意义了,基本上等于白挖了)



好了差不多就这样,先写算法吧,有结果我会在下面继续更新:

3.5更新1.3版本源码:

原本我以为不可以剪枝的,经过钻研发现还是有枝可剪的,即min_support == 0的分支(汗),虽然看似这个剪枝减不了多少时间复杂度,但是之前我都是用随机序列测试代码的,这次放上了1w多的数据,原本有生之年算不出的数据居然能算出来了,大笑。但是没有加入时间阈值概念,所以说还是不够完善,时间阈值版本将在2.x版本发布

不多说了上代码,分享学习:Pycharm直接run就行


#coding:utf-8

__author__ = 'ChiXu_15s103144_HIT'

import copy
import sys
import csv
#----------------------------------------------------------#
#                    计算Frequent_1                         #
#----------------------------------------------------------#
def freq1(dataSet, freq_num):
    freq_1 = [] ; Ele = []
    for i in range(len(dataSet)):
        SID = splitToAlphabet(dataSet[i])
        setSID = list(set(SID))
        Ele += setSID
    setEle = list(set(Ele))
    for item in setEle:
        if Ele.count(item) >= freq_num:
            freq_1.append(item)
    print('频繁1项集为: %s' %freq_1)
    return freq_1

#----------------------------------------------------------#
#                  计算Frequent_more                       #
#----------------------------------------------------------#
def freq_more(dataSet, freq_num, freq_1):
    x = []
    queue = []
    itemAppear = []
    itemAppearNum = []
    freqItem = []

    for a in freq_1:
        x.append(a)
        queue.append(x)# queue = [['A'],['B'],['D'],['F']]
        x = []

    while queue != []: # 先处理多重形式
        queueDemo = extendMember(queue, freq_1) # 扩展queue成员 example:[['A', 'A'], ['A', 'B'], ['A', 'D'], ['B', 'A'], ['B', 'B'], ['B', 'D'], ['D', 'A'], ['D', 'B'], ['D', 'D']]
        queue = []

        for item in queueDemo: # item:['M','W','M']
            itemBrief = combinToString(item)
            lenth = 3 # 最多查找长度为 3 的对子
            for i in range(lenth):
                itemBrief = Brief(itemBrief, i+1) # 这里的 i 表示要找的对子长度 itemBrief: 'D(C)(AB)'

            itemFreqNum = frequentNum(dataSet, item)

            if '(' in itemBrief: # 如果有缩略形式的话,且不是freq1
                itemNoBracket = noBracket(itemBrief) # 去括号形式

                if itemFreqNum == 0:
                    sys.stdout.write('.')
                    continue
                elif len(itemNoBracket) == 1: # 是 freq1 的话
                    freqItem.append(itemNoBracket)
                    print ('')
                    sys.stdout.write('遇到频繁1项集 :%s' %combinToString(itemBrief))
                    continue
                elif itemNoBracket in itemAppear:
                    itemAppear.append(combinToString(item))
                    itemAppearNum.append(itemFreqNum)
                    if itemFreqNum + itemAppearNum[itemAppear.index(itemNoBracket)] >= freq_num:
                        print ('')
                        sys.stdout.write('缩略 频繁项:%s -> %s' %(combinToString(item), itemBrief))
                    else:
                        sys.stdout.write('.')

            else: # 无缩略形式
                itemAppear.append(combinToString(item))
                itemAppearNum.append(itemFreqNum)

                if itemFreqNum >= freq_num:
                    print ('')
                    sys.stdout.write('无缩略 频繁项:%s' %combinToString(item))
                elif itemFreqNum == 0:
                    sys.stdout.write('.')
                    continue
                else: # 0 < itemFreqNum < freq_num
                    sys.stdout.write('.')
                    pass

            if itemFreqNum >0: # 频度为 0 不作为待扩展成员
                queue.append(item)

#----------------------------------------------------------#
#                  将queue成员进行扩展                       #
#----------------------------------------------------------#
def extendMember(queue, freq_1): #queueDemo
    queueDemo = []

    for item in queue:
        itemString = combinToString(item)
        for alphabet in freq_1:
            String = itemString + alphabet
            queueDemo.append(splitToAlphabet(String))
    #print(queueDemo)
    return  queueDemo

#----------------------------------------------------------#
#                       计算item频度                        ##
#----------------------------------------------------------#
def frequentNum(dataSet, item): #freq_num
    # item: ['A','B','B','D']
    flag = 0
    alphabetAppearTimes = 0
    freq_num = 0

    for SID in dataSet:
        SIDalphabetList = splitToAlphabet(SID) # 将该SID分解为字母列表
        for alphabet in item:
            if alphabet in SIDalphabetList: # 该字母存在于SID中
                while flag <= len(SIDalphabetList)-1:
                    if SIDalphabetList[flag] == alphabet:
                        flag += 1
                        alphabetAppearTimes += 1 # 记录有几个item字母在该SID中出现过
                        break
                    else:
                        flag += 1
            else:
                break # item中某个字母在列表中没有出现则不用检查SID了
        if alphabetAppearTimes == len(item): # 这几个字母都在这个SID中出现了
            freq_num += 1
        flag = 0
        alphabetAppearTimes = 0
    return freq_num

#----------------------------------------------------------#
#                 如果 item 出现过则返回TRUE                 ##
#----------------------------------------------------------#
def frequentNumOnlyAppear (dataSet, item):
    # item: ['A','B','B','D']
    flag = 0
    alphabetAppearTimes = 0

    for SID in dataSet:
        SIDalphabetList = splitToAlphabet(SID) # 将该SID分解为字母列表
        for alphabet in item:
            if alphabet in SIDalphabetList: # 该字母存在于SID中
                while flag <= len(SIDalphabetList)-1:
                    if SIDalphabetList[flag] == alphabet:
                        flag += 1
                        alphabetAppearTimes += 1 # 记录有几个item字母在该SID中出现过
                        break
                    else:
                        flag += 1
            else:
                break # item中某个字母在列表中没有出现则不用检查SID了
        if alphabetAppearTimes == len(item): # 这几个字母都在这个SID中出现了
            return True
        flag = 0
        alphabetAppearTimes = 0
    return False

#----------------------------------------------------------#
#                          Brief                           #
#----------------------------------------------------------#
def Brief(item, lenth): # item:'D(C)ABABAB'  return:'D(C)(AB)'  lenth:每个对子的长度

    itemSplit = []
    breakFlag = 0 # 提前跳出 while 的标志
    groupNewDemoList = [] # 存放所有的对子序列,从中选出长度最短的作为最优对子
    finalString = []
    minItem = ''

    if '(' in item:
        key = 0 # 一个指针
        itemLen = len(item)
        while key < itemLen:
            string = ''
            while key < itemLen and item[key] != '(':
                string += item[key]
                key += 1
            if string != '':
                itemSplit.append(string)
            key += 1
            string = ''
            while key < itemLen and item[key] != ')':
                string += item[key].lower()
                key += 1
            if string != '':
                itemSplit.append(string)
            key += 1

    else:
        itemSplit.append(item)
    #  ['DCCCABABAB']

    for x in itemSplit:
        flag = 0 # 分对子时的标志位
        while flag<lenth:
            if len(x) >= 2*lenth: # 这才有找对子的意义
                alphabetList = splitToAlphabet(x)
                alphabetList.append('')
                alphabetList.insert(0, '') # ['','A','B','A','B','A','B','']
                groupNewDemo = []
                group = makeGroup(alphabetList, lenth, flag) # 进行分组,比如两两一组或者三三一组,flag是分组的起始位置
                longestNum = longestItemNum(group, lenth) # 看两两一组或者三三一组的组数有多少
                if longestNum == 1: # 就一组就不用分对了,break
                    breakFlag = 1 # 提前跳出 while 的标志
                    break
                if longestNum > 1:
                    groupNew = copy.deepcopy(group)
                    j = flag + 1
                    while j<len(groupNew)-1:
                        if groupNew[j]==groupNew[j+1] and groupNew[j]!=groupNew[j-1] and len(groupNew[j])==len(groupNew[j+1])==lenth: # 添加左括号
                            groupNew.insert(j, '(')
                            j += 2
                        elif groupNew[j]!=groupNew[j+1] and groupNew[j]==groupNew[j-1] and len(groupNew[j])==len(groupNew[j-1])==lenth: # 添加右括号
                            groupNew.insert(j+1, ')')
                            j += 2
                        else:
                            j += 1
                    # example: groupNew = ['','A','(','BD','BD',')','F','']
                    sign = 1
                    if '(' in groupNew:
                        while sign<len(groupNew)-1: # 只要groupNew里面还有未处理的对子
                            if groupNew[sign]!='(':
                                groupNewDemo.append(groupNew[sign])
                                sign += 1
                            else: # 遇到了'('
                                groupNewDemo.append('(')
                                groupNew.remove('(')
                                groupNewDemo.append(groupNew[sign])
                                groupNewDemo.append(')')
                                positionBracket = groupNew.index(')')
                                groupNew.remove(')')
                                sign = positionBracket
                        for i in range(sign, len(groupNew)-1):
                            groupNewDemo.append(groupNew[sign]) # groupNewDemo = ['A','(','BD',')','F']
                        groupNewDemoList.append(combinToString(groupNewDemo)) # ['(AB)','A(BA)B']
                    minLen = 10000
                    ti = 0
                    while ti < len(groupNewDemoList):
                        if len(groupNewDemoList[ti]) < minLen:
                            minLen = len(groupNewDemoList[ti])
                            minItem = groupNewDemoList[ti]
                            ti += 1
                        else:
                            ti += 1
                else:
                    breakFlag=1
                    break
            else:
                break
            if breakFlag == 1:
                break
            flag += 1
        if minItem != '':
            finalString.append(minItem)
            minItem = ''
        else:
            finalString.append(x)
            minItem = ''
    # finalString = ['D','C','(AB)'] 下面-> 'D(C)(AB)'
    flag = 0 # 指针
    final = '' #return
    while flag < len(finalString):
        if finalString[flag][0].islower() == True: # 如果是小写字母,要加括号
            final = final + '(' + finalString[flag].upper() + ')'
            flag += 1
        else: # 是大写字母,或括号
            final += finalString[flag]
            flag += 1
    #print(final)
    return final # 'D(C)ABABAB'

#----------------------------------------------------------#
#               将 itemBrief 中的括号全部去掉                 #
#----------------------------------------------------------#
def noBracket(itemBrief):
    alphabetList = splitToAlphabet(itemBrief)
    flag = 0

    while flag < len(alphabetList):
        if alphabetList[flag] != '(' and alphabetList[flag] != ')':
            flag += 1
        else:
            alphabetList.pop(flag)
    #print combinToString(alphabetList)
    return combinToString(alphabetList)

#----------------------------------------------------------#
#            计算item在全转成大写的特殊列表中存在的次数          #
#----------------------------------------------------------#
def changeSpecialNum(changeSpecial, item): #appearTimes
    if item not in changeSpecial:
        return 0
    else:
        appearTimes = changeSpecial.count(item)
        return appearTimes
#----------------------------------------------------------#
#                     将字符串分解为字母                      ##
#----------------------------------------------------------#
def splitToAlphabet(item): #alphabetList
    alphabetList = []
    for i in range(len(item)):
        alphabetList.append(item[i])
    return alphabetList
#----------------------------------------------------------#
#                     将字母合成成字符串                      ##
#----------------------------------------------------------#
def combinToString(briefItemList): #briefItem
    briefItem = ''

    for alphabet in briefItemList:
        briefItem += alphabet
    return briefItem

#----------------------------------------------------------#
#                      将字符串进行分组                       # alphabetList=['','A','B','B','B','D','']
#----------------------------------------------------------#
def makeGroup(alphabetList, num, flag): # group     num:几几一组
    alphabet = ''
    alphabetListNew = []

    #alphabetList = ['','A','B','B','B','D','']
    if num == 1:
        #print(alphabetList)
        return alphabetList
    else:
        alphabetList.pop(0)
        alphabetList.pop(-1) # 把首末的空位去掉
        for i in range(flag):
            alphabetListNew.append(alphabetList[i])
        while len(alphabetList) - flag >= num:
            for i in range(num):
                alphabet += alphabetList[flag+i]
            alphabetListNew.append(alphabet)
            flag = flag + num # 标志位后移num
            alphabet = ''
        for i in range(flag, len(alphabetList)): # 把剩下几个字母扔进去
            alphabetListNew.append(alphabetList[i])
        alphabetListNew.insert(0, '')
        alphabetListNew.append('')
        # alphabetListNew = ['','AB','BB','D','']
        #print(alphabetListNew)
        alphabetList.append('')
        alphabetList.insert(0, '')
        return alphabetListNew
#----------------------------------------------------------#
#                两两一组或三三一组的组数有多少                 #
#----------------------------------------------------------#
def longestItemNum(group, lenth):
    longest = 0
    itemNum = 0

    if lenth == 1:
        return len(group) - 2
    else:
        for item in group:
            if len(item) == longest:
                itemNum += 1
            elif len(item) > longest:
                itemNum = 1
                longest = len(item)
            else:
                continue
        return itemNum


# main
print('文件路径为:/Users/John/Desktop/temporal_activity.csv')

dataset = []
csvfile = file('/Users/John/Desktop/temporal_activity.csv', 'rb')
reader = csv.reader(csvfile)

#读取 csv 文件写入 dataset
for line in reader:
    if dataset == []:
        issue = []
        issue.append(line[0]) # issue = ['1116689']
        issueDetail = []
        for i in xrange(1, len(line)): # issueDetail = ['2015-01-11 22:22:48', 'Z', '12']
            issueDetail.append(line[i])
        issue.append(issueDetail) # issue = ['1116689', ['2015-01-11 22:22:48', 'Z', '12']]
        dataset.append(issue) # dataset = [['1116689', ['2015-01-11 22:22:48', 'Z', '12']]]
    else:
        if line[0] != dataset[-1][0]: # 遇到了新的issue_id
            issue = []
            issue.append(line[0]) # issue = ['1116689']
            issueDetail = []
            for i in xrange(1, len(line)): # issueDetail = ['2015-01-11 22:22:48', 'Z', '12']
                issueDetail.append(line[i])
            issue.append(issueDetail) # issue = ['1116689', ['2015-01-11 22:22:48', 'Z', '12']]
            dataset.append(issue) # dataset = [['1116689', ['2015-01-11 22:22:48', 'Z', '12']]]
        else: # 还是上一个issue_id
            issueDetail = []
            for i in xrange(1, len(line)):
                issueDetail.append(line[i])
            dataset[-1].append(issueDetail) # dataset = [['1116689', ['2015-01-11 22:22:48', 'Z', '12'], ['2015-01-11 22:22:48', 'C', '12']]]

csvfile.close()

#提取 dataset 并对数据进行处理
datasetActions = []
for item in dataset:
    datasetActionsItem = ''
    for i in xrange(1, len(item)):
        datasetActionsItem += item[i][1]
    datasetActions.append(datasetActionsItem)
print(datasetActions)

print('数据预处理完毕!Bug总数: %d' %len(datasetActions))

datasetActions_less = []
datasetActions_more = []
while i < 10000:
    if len(datasetActions[i]) <= 16:
        datasetActions_less.append(datasetActions[i])
    else:
        datasetActions_more.append(datasetActions[i])
    i += 1
print('小于长度15的社交行为有:%d' %len(datasetActions_less))
print('小于长度15的社交行为有:%d' %len(datasetActions_more))


freq1_num = input('请输入freq1最小支持度: ')
freq_1 = freq1(datasetActions_less, freq1_num)
freqMore_num = input('请输入freq_more最小支持度: ')
freq_more(datasetActions_less, freqMore_num, freq_1)


你可能感兴趣的:(算法,数据挖掘,firefox,Bugzilla)