python数据挖掘-读书笔记(一)关联规则挖掘中Apriori算法与例子

导读

1.KDD过程:问题陈述、数据收集和储存、数据清理、数据挖掘、表示和可视化、问题解决。

2.频繁项集:若干个项的集合。在本篇文章中,频繁项集被延伸为购物篮。

3.支持度( S ):先导与后继在一个项集中出现的频率。

4.置信度( C ):同时包含先导和后继的项集的百分比除以只包含先导的项集的百分比。

5.关联规则:先举一个简单的例子。

香草威化 -> 香蕉,生奶油 [支持度 = 1%, 置信度 = 40%]

我们可以将这条规则读作:在所有的篮子(项集)中,有1%的香草威化、香蕉和生奶油的组合,在购买香草威化的顾客中有40%同时购买了香蕉和生奶油。

6.附加值( AV ):规则置信度-右侧支持度,如附加值为大的正数则该值有益;如趋近于零则无大用处;如为大的负数,则负相关。

一、例子——商店的购物篮

作者Megan Squire生活在北美,设定:
a.人们喜欢吃香蕉布丁(原材料是香草威化饼干与香蕉)
b.香蕉很受欢迎
记录该店10次购物记录如下表:
python数据挖掘-读书笔记(一)关联规则挖掘中Apriori算法与例子_第1张图片
计算单个商品的的支持度,如下表:

python数据挖掘-读书笔记(一)关联规则挖掘中Apriori算法与例子_第2张图片

我们选取一个项集{香草威化,香蕉}来看。

香草威化 -> 香蕉

这一规则的置信度=支持度(香草威化U香蕉)/支持度(香草威化)

c = 4 / 6
av = c - s (香蕉)

进而关联规则可以写为

香草威化 -> 香蕉 [支持度=40%, 置信度=67%]

二、寻找软件项目标签中的关联规则

运用Apriori算法的的思想。

https://baike.baidu.com/pic/APRIORI/2000746/0/32bb9c8b07d15384fc1f100c?fr=lemma&ct=single#aid=0&pic=32bb9c8b07d15384fc1f100c

1.首先,设置一个支持阀值(剔除非频繁项集)

2.构建一个 1-项集列表,标记为candidatesingletonslist
计算该列表中所有单例支持度
把符合支持阀值的单例加入singletonslist

3.构建一个 2-项集列表
建立singletonslist中的项目的所有可能匹配的列表,标记为candidatedoubletonslist
把符合支持阀值的二元组加入doubletonlist

4.构建一个3-项集列表
建立doubletonlist中每个可能单项的列表,将其与doubletonslist中的每个项目匹配建立三元组,标记为candidatetripletonlist
把符合支持阀值的二元组加入tripletonlist

5.重复4步骤直到频繁项集用完

个人理解:通过不断的构建{1, 2, … , n - 1}的项目集、不断的筛选,进行挖掘。
附梅甘的代码(可能略有出入。。。),如下:

# Aprior 算法

#todo : 找到篮子(交易)数量,并使用最小支持阀值找出单例

import pymysql
import itertools

#设置一个百分数
#for eample , 5% of 篮子[Baskets](交易) = 2325

MINSUPPORTPCT = 5 # 5 percent
allSingletonTags = []
allDoubletonTags = set()#函数创建一个无序不重复元素集,可进行关系测试,删除重复数据,还可以计算交集、差集、并集等。
doubletonSet = set()

#open loacl db connection
db = pymysql.connect(host = 'localhost',
                     db = 'test',
                     user = 'root',
                     passwd = '123456',
                     port = 3306,
                     charset = 'utf8')# wrong on 'utf-8'

cursor = db.cursor()

#comput baskets number
querBaskets = "select count(distinct project_id) from fc_project_tags;"
cursor.execute(querBaskets)
baskets = cursor.fetchone()[0] #篮子数量 , 这个[0]应该表示取数字,如果注释掉,baskets输出变为 (46510,)
#print(baskets)


#comput baskets number under minsupport percent: 5%
minsupport = baskets*(MINSUPPORTPCT/100)
print("Minimum support count :", minsupport, "(", MINSUPPORTPCT, "% of", baskets,")")
#print("Minimun suppoer count: %s (%s percents of %s)" % (str(minsupport), str(MINSUPPORTPCT), str(baskets)))

#get  the tags of  minsuppoet
cursor.execute("select distinct tag_name \
               from fc_project_tags \
               group by 1 \
               having count(project_id) >= %s \
               order by tag_name ",(minsupport))

singletons = cursor.fetchall()

for singleton in singletons:
    allSingletonTags.append(singleton[0])
#print(allSingletonTags)



def findBoubletons():
    print("=====")
    print("Frequent doubletons found:")
    print("=====")
    # use the list of allSingletonTags to make the doubleton candidates
    doubletonCandidates = list(itertools.combinations(allSingletonTags, 2)) # combinations('ABCD', 2) --> AB AC AD BC BD CD
                                                                            # 不考虑顺序 类似的方法:permutations(考虑顺序)
    for (index, candidate) in enumerate(doubletonCandidates):# enumerate 枚举。。。
        # figure out this doubleton candidate is frequent
        tag1 = candidate[0]
        tag2 = candidate[1]
        cursor.execute("select count(fpt1.project_id) \
                        from fc_project_tags fpt1 \
                        inner join fc_project_tags fpt2 \
                        on fpt1.project_id = fpt2.project_id \
                        where fpt1.tag_name = %s \
                        and fpt2.tag_name = %s ",(tag1, tag2))
        count = cursor.fetchone()[0]
        # add frequent doubleeton to database
        if count > minsupport:
            print(tag1, tag2, "[", count, "]")
            cursor.execute("insert into fc_project_tag_pairs (tag1, tag2, num_projs) values (%s, %s, %s)", (tag1, tag2, count))
            # save the frequent doubleton to our final list
            doubletonSet.add(candidate)
            #add terms to a set of all doubleton terms (no duplicates)
            allDoubletonTags.add(tag1)
            allDoubletonTags.add(tag2)

def findTripletons():
    print("======")
    print("Frequently tripletons found:")
    print("======")
    #use the list of allDoubletonTags to make the tripleton candidates
    tripletonCandidates = list(itertools.combinations(allDoubletonTags, 3))
    #sort each candidate tuole and add these to a new sorted candidate list
    tripletonCandidatesSorted = []
    for tc in tripletonCandidates:
        tripletonCandidatesSorted.append(sorted(tc))
#    print(tripletonCandidatesSorted)

    #figure out if this tripleton candidate is frequent
    for (index, candidate) in enumerate(tripletonCandidatesSorted):
        # all doubletons inside this tripleton candidate must be frequent
        doubletonInsideTripleton = list(itertools.combinations(candidate, 2))
        tripletonCandidateRejected = 0
        for (index, doubleton) in enumerate(doubletonInsideTripleton):
            if doubleton not in doubletonSet:
                tripletonCandidateRejected = 1
                break
        # set up query
        getTripletonFrequencyQuery = "SELECT count(fpt1.project_id) \
                                             FROM fc_project_tags fpt1 \
                                             INNER JOIN fc_project_tags fpt2 \
                                             ON fpt1.project_id = fpt2.project_id \
                                             INNER JOIN fc_project_tags fpt3 \
                                             ON fpt2.project_id = fpt3.project_id \
                                             WHERE (fpt1.tag_name = %s \
                                             AND fpt2.tag_name = %s \
                                             AND fpt3.tag_name = %s)"

        insertTripletonQuery = "INSERT INTO fc_project_tag_triples \
                                     (tag1, tag2, tag3, num_projs) \
                                     VALUES (%s,%s,%s,%s)"
        if tripletonCandidateRejected == 0:
            cursor.execute(getTripletonFrequencyQuery, (candidate[0],
                                                        candidate[1],
                                                        candidate[2]))
            count = cursor.fetchone()[0]
            if count > minsupport:
                print(candidate[0], ",",
                      candidate[1], ",",
                      candidate[2],
                      "[", count, "]")
                cursor.execute(insertTripletonQuery, (candidate[0],
                                                      candidate[1],
                                                      candidate[2],
                                                      count))

def generateRules():
    print("======")
    print("Association Rules:")
    print("=====")
    # pull final list of tripletons to make the rules
    getFinalListQuery = "select tag1, tag2, tag3, num_projs \
                         from fc_project_tag_triples  "
    cursor.execute(getFinalListQuery)
    triples = cursor.fetchall()
    for (triple) in triples:
        tag1 = triple[0]
        tag2 = triple[1]
        tag3 = triple[2]
        ruleSupport = triple[3]
        calcSCAV(tag1, tag2, tag3, ruleSupport)
        calcSCAV(tag1, tag3, tag2, ruleSupport)
        calcSCAV(tag2, tag3, tag1, ruleSupport)


def calcSCAV(tagA, tagB, tagC, rulesupport):
    #support
    ruleSuppprtpct = round((rulesupport / baskets), 2)  # return rulesupport/baskets 四舍五入值, the digit/number after point(.) is 2

    #Confidence
    query1 = "select num_projs \
             from fc_project_tag_pairs \
             where (tag1 = %s and tag2 = %s) \
             or (tag2 = %s and tag1 = %s)"
    cursor.execute(query1, (tagA, tagB, tagB, tagA))
    pairSupport = cursor.fetchone()[0]
    confidence = round((rulesupport/pairSupport), 2)

    #added value
    query2 = "select count(*) \
             from fc_project_tags \
             where tag_name = %s"
    cursor.execute(query2, tagC)
    supportTagC = cursor.fetchone()[0]
    supportTagCPct = supportTagC/baskets
    addedValue = round((confidence - supportTagCPct), 2)

    #result
    print(tagA, ",", tagB, "->", tagC,
          "[S = ", ruleSuppprtpct,
          ", C = ", confidence,
          ", AV = ", addedValue,
          "]")


findBoubletons()

findTripletons()

generateRules()

db.close()

在得到结果后,发现web、Internet之间附加值相对很高(0.8),我们继续寻找两个标签之间的关联,代码如下:

#todo :寻找 web 与 Internet之间的关联
#x -> y
#s = x+y / all baskets
#c = s(x and y) / s(x)
#av = c - s(y)


import pymysql

#specific tag
X = "Internet"
Y = "Web"

#open loacl db connection
db = pymysql.connect(host = 'localhost',
                     db = 'test',
                     user = 'root',
                     passwd = '123456',
                     port = 3306,
                     charset = 'utf8')# wrong on 'utf-8'

cursor = db.cursor()

#get basic counts from the database
#all amount of basket
numBasketsQuery = "select count(distinct project_id) from fc_project_tags;"
cursor.execute(numBasketsQuery)
numBaskets = cursor.fetchone()[0]

supportForXYQuery = "select count(*) from fc_project_tags where tag_name = %s"
# the support of X
cursor.execute(supportForXYQuery, (X))
supportForX = cursor.fetchone()[0]
#the support of Y
cursor.execute(supportForXYQuery, (Y))
supportForY = cursor.fetchone()[0]
# X,Y appear at same time
pairSupportQuery = "select num_projs from fc_project_tag_pairs where tag1 = %s and tag2 = %s"
cursor.execute(pairSupportQuery, (X, Y))
pairSupport = cursor.fetchone()[0]

#get support percent
#X and Y
pairsupportAsPct = pairSupport / numBaskets

#calculate confidence of X -> Y
supportForXAsPct = supportForX / numBaskets
confidenceXY = pairsupportAsPct / supportForXAsPct

#calculate confidence of Y -> X
supportForYAsPct = supportForY / numBaskets
confidenceYX = pairsupportAsPct / supportForYAsPct

#calulate added value
AVXY = confidenceXY - supportForYAsPct
AVYX = confidenceYX - supportForXAsPct

print("Support for ", X, "U", Y, ":", round(pairsupportAsPct, 4))
print("Conf.", X, "->", Y, ":", round(confidenceXY, 4))
print("Conf.", Y, "->", X, ":", round(confidenceYX, 4))
print("AV.", X, "->", Y, ":", round(AVXY, 4))
print("AV.", Y, "->", X, ":", round(AVYX, 4))
db.close()

结果如下:

Support for  Internet U Web : 0.1285
Conf. Internet -> Web : 0.738
Conf. Web -> Internet : 0.9539
AV. Internet -> Web : 0.6033
AV. Web -> Internet : 0.7797

我们断定web与Internet有着紧密的联系。
web项目也许可以打上internet标签;
也可以向关注internet项目的人推送web项目。

三、总结

umm,说说收获吧。
1.对apriori算法有了点小小的认识,生成项集,再逐步向下检测的思想很惊艳。

2.pymysql中 cursor.fetchone()[0]很好用,有种茅塞顿开的感觉。
相同的还有itertools.combinations()

l = ['a', 'b', 'c', 'd']

c = list(itertools.permutations(l, 2))#组合
for i in enumerate(c):
    print(i)

(0, ('a', 'b'))
(1, ('a', 'c'))
(2, ('a', 'd'))
(3, ('b', 'a'))
(4, ('b', 'c'))
(5, ('b', 'd'))
(6, ('c', 'a'))
(7, ('c', 'b'))
(8, ('c', 'd'))
(9, ('d', 'a'))
(10, ('d', 'b'))
(11, ('d', 'c'))

以及itertools.combinations

l = ['a', 'b', 'c', 'd']

c = list(itertools.combinations(l, 2))#排列
for i in enumerate(c):
    print(i)

(0, ('a', 'b'))
(1, ('a', 'c'))
(2, ('a', 'd'))
(3, ('b', 'c'))
(4, ('b', 'd'))
(5, ('c', 'd'))

如有问题,请您指出。

参考图书:
梅甘 斯夸尔 《Python数据挖掘:概念、方法与实践》姚军 译

你可能感兴趣的:(python,数据挖掘,apriori算法,机器学习,python数据挖掘)