11.2、关联规则实例

实例三、下面的实例部分使用的数据为rattle包中,csv目录下的dvdtrans.csv文件,该数据显示了DVD购买的记录。

1、加载数据和包:

>library(rattle)

> library(arules)

> dvd <- read.csv("F:\\R\\R-3.2.2\\library\\rattle\\csv/dvdtrans.csv", header=T)    在安装包的目录下

> class(dvd)

[1]"data.frame"             dvd数据集为data.frame类型,需要转化为arules包能处理的“transactions”类型

> str(dvd)

'data.frame':   30 obs. of 2 variables:

 $ ID  :int  1 1 1 1 1 2 2 2 3 3 ...

 $ Item: Factor w/ 10 levels"Braveheart","Gladiator",..: 10 7 4 3 8 2 9 1 7 8 ...

 

2、该数据为常见的数据框格式数据,需要将其转换为aprior()函数和eclat()函数所能接受的格式

> data <-as(split(dvd[, 'Item'], dvd[, 'ID']), 'transactions')

> class(data)

[1]"transactions"

attr(,"package")

[1]"arules"

 

 查看转化后的数据:

>inspect(data[1:3])

  items           transactionID

1 {Green Mile,                

   Harry Potter1,             

   LOTR1,                     

   LOTR2,                     

   Sixth Sense}               1

2 {Braveheart,                

   Gladiator,                 

   Patriot}                   2

3 {LOTR1,                     

   LOTR2}                     3

> dim(data)

[1] 10 10

> summary(data)

transactions asitemMatrix in sparse format with

 10 rows (elements/itemsets/transactions) and

 10 columns (items) anda density of 0.3

总共有10条交易,10中商品,density=0.3表示稀疏矩阵中1的占比;

 

mostfrequent items:

    Gladiator       Patriot  Sixth Sense    Green Mile   HarryPotter1       (Other)

            7             6                         6             2                               2                             7

最频繁出现的商品及对应的频次,显然电影Gladiator最受欢迎,其次是电影Patriot和SixthSense;

 

element (itemset/transaction) length distribution:

sizes

2 3 4 5

3 5 1 1

   Min. 1st Qu. Median    Mean 3rd Qu.    Max.

   2.00   2.25    3.00    3.00   3.00    5.00

每笔交易包含的商品数目,以及其对应的5个分位数和均值的统计信息。如3笔交易包含2个商品,5笔交易包含3个商品,两个1笔交易各包含4种和5种商品;第一个分位数是2,表示25%的交易不超过2种商品,平均值表示所有交易中,平均每笔购买3种商品;

 

includes extended item information - examples:

      labels

1 Braveheart

2  Gladiator

3 Green Mile

includes extended transaction information - examples:

  transactionID

1             1

2             2

3             3

如果数据集包含除了TransactionId 和 Item之外的其他列(如,发生交易的时间,用户ID等等),会显示在这里。这个例子,其实没有新的列,labels就是item的名字。

 

3、下面就可以使用apriori()函数和eclat()函数对数据集data进行关联规则算法的运用:由于交易量较少,这里不妨将参数支持度的最小值设为0.3,置信度的最小值设为0.5,其他参数暂不作修改。

> rules <-apriori(data=data, parameter=list(support=0.3, confidence=0.5))

Apriori

Parameterspecification:

 confidenceminval smax arem  aval originalSupportsupport minlen

        0.5   0.1    1 none FALSE            TRUE    0.3      1

 maxlentarget   ext

     10  rules FALSE

 

Algorithmic control:

 filter tree heap memopt load sort verbose

    0.1 TRUE TRUE  FALSE TRUE   2    TRUE

 

Absolute minimum support count: 3

 

set item appearances...[0 item(s)] done [0.00s].

set transactions...[10 item(s), 10 transaction(s)] done [0.00s].

sorting and recodingitems ... [3 item(s)] done [0.00s].

creating transactiontree ... done [0.00s].

checking subsets ofsize 1 2 3 done [0.00s].

writing ... [12rule(s)] done [0.00s].

creating S4object  ... done [0.00s].

包括模型中指定的最小支持度、置信度、项集中最大/最小项数、输出结果形式和算法的执行细节;

 

> rules <-apriori(titanic.raw, control=list(verbose=F),parameter=list(minlen=2, supp=0.005, conf=0.8),appearance=list(rhs=c("Survived=No","Survived=Yes"), default="lhs"))

arules包中的apriori(data,parameter=list(support,minlen, maxlen, confidence))函数:Aprior算法挖掘频繁项集的关联规则,Apriori算法采用逐层寻找频繁项集。

data为transactions类型对象或者任何可以转化为transactions类型的数据结构;

parameter为APparameter类型的对象,参数有support项目集的最小支持度值(默认为0.1);minlen为项目集每项的最小数目2(默认为1),可以删除项集为空和1的;maxlen为项目集每项的最大数目(默认为10);confidence为可信度(默认为0.8)

data为apriori函数和eclat函数所能接受的“交易”格式数据,可以通过as()函数将常见的二元矩阵、数据框进行转换;

parameter以列表的形式存储模型所需的支持度、置信度、每个项集所含项数的最大值/最小值和输出结果类型等参数,默认情况下支持度为0.1,置信度为0.8,项集中最大项数为10,最小项数为1,输出关联规则/频繁项集类型的结果;

appearance可为先决条件X和关联结果Y指定明确的项集(一般是分析人员感兴趣的项集),默认情况下不为X和Y指定某些项集;在appearance中设置rhs=c("Survived=No","Survived=Yes")确保关联规则的右侧rhs只出现"Survived=No" "Survived=Yes",当设置default="lhs"时所有的项集都可以出现在lhs上,关联规则的左侧both为关联规则的两侧。

control用来控制函数性能,如对项集进行升序或降序,生成算法运行的报告进程等,verbose=F可以设置压缩过程的细节信息

 

> rules

set of 12 rules      显示该算法产生12条关联规则;

 

> inspect(sort(rules, by="lift", decreasing=T)[1:3])       通过lift进行排序,也可以通过supportconfidence进行排序

   lhs                      rhs         support confidence lift   

6  {Patriot}             => {Gladiator} 0.6     1.0000000 1.428571

7  {Gladiator}           => {Patriot}   0.6    0.8571429  1.428571

10 {Patriot,SixthSense} => {Gladiator} 0.4    1.0000000  1.428571

> inspect(rules)

   lhs                        rhs           support confidence lift   

1  {}                      => {Patriot}     0.6    0.6000000  1.000000

2  {}                      => {Sixth Sense}0.6     0.6000000  1.000000

3  {}                      => {Gladiator}   0.7    0.7000000  1.000000

4  {Patriot}               => {Sixth Sense} 0.4     0.6666667 1.111111

5  {Sixth Sense}           => {Patriot}     0.4    0.6666667  1.111111

6  {Patriot}               => {Gladiator}   0.6    1.0000000  1.428571

7  {Gladiator}             => {Patriot}     0.6    0.8571429  1.428571

8  {Sixth Sense}           => {Gladiator}   0.5    0.8333333  1.190476

9  {Gladiator}             => {Sixth Sense} 0.5     0.7142857 1.190476

10 {Patriot,SixthSense}   => {Gladiator}   0.4    1.0000000  1.428571

11{Gladiator,Patriot}     => {SixthSense} 0.4     0.6666667  1.111111

12 {Gladiator,SixthSense} => {Patriot}     0.4     0.8000000 1.333333

显示了12种关联规则的明细,如果数据量特别大,且产生的关联规则也特别多时,直接输出结果并不能看出什么有意义的结果,可以通过sort函数对支持度support、置信度confidence和提升度lift进行排序,选出比较靠前的关联规则。

 

4、对生成的关联规则进行强度控制

一般来说,对于数据量比较大的交易,可以直接使用apriori()函数的默认参数来生成规则,再根据业务情况进行调整。比如,先将阈值设置很低,再逐步提升阈值,直到达到理想的规则和强度。对于强度的控制,可以设置不同的支持度support、置信度confidence和提升度lift来实现。

同时调整支持度和置信度进行调整

> rules1 <-apriori(data=data, parameter=list(support=0.5, confidence=0.6))

Apriori

Parameterspecification:

 confidence minval smax arem  aval originalSupport support minlen maxlen

        0.6   0.1    1 none FALSE            TRUE     0.5     1     10

 target  ext

  rules FALSE

 

Algorithmic control:

 filter tree heap memopt load sort verbose

    0.1 TRUE TRUE  FALSE TRUE   2    TRUE

 

Absolute minimumsupport count: 5

 

set item appearances...[0 item(s)] done [0.00s].

set transactions...[10 item(s), 10 transaction(s)] done [0.00s].

sorting and recodingitems ... [3 item(s)] done [0.00s].

creating transactiontree ... done [0.00s].

checking subsets ofsize 1 2 done [0.00s].

writing ... [7rule(s)] done [0.00s].

creating S4object  ... done [0.00s].

> rules1

set of 7 rules       产生了7条关联规则

 

> rules2 <-apriori(data=data, parameter=list(support=0.3, confidence=0.75))

Apriori

 

Parameterspecification:

 confidence minval smax arem  aval originalSupport support minlen maxlen

       0.75   0.1    1 none FALSE            TRUE     0.3     1     10

 target  ext

  rules FALSE

 

Algorithmic control:

 filter tree heap memopt load sort verbose

    0.1 TRUE TRUE  FALSE TRUE   2    TRUE

 

Absolute minimumsupport count: 3

 

set item appearances...[0 item(s)] done [0.00s].

set transactions...[10 item(s), 10 transaction(s)] done [0.00s].

sorting and recodingitems ... [3 item(s)] done [0.00s].

creating transactiontree ... done [0.00s].

checking subsets ofsize 1 2 3 done [0.00s].

writing ... [5rule(s)] done [0.00s].

creating S4object  ... done [0.00s].

> rules2

set of 5 rules     产生了5条关联规则

 

>rules3 <- apriori(data=data,parameter=list(support=0.5, confidence=0.75))

Apriori

Parameterspecification:

 confidence minval smax arem  aval originalSupport support minlen maxlen

       0.75   0.1    1 none FALSE            TRUE     0.5     1     10

 target  ext

  rules FALSE

 

Algorithmic control:

 filter tree heap memopt load sort verbose

    0.1 TRUE TRUE  FALSE TRUE   2    TRUE

 

Absolute minimumsupport count: 5

 

set item appearances...[0 item(s)] done [0.00s].

set transactions...[10 item(s), 10 transaction(s)] done [0.00s].

sorting and recodingitems ... [3 item(s)] done [0.00s].

creating transactiontree ... done [0.00s].

checking subsets ofsize 1 2 done [0.00s].

writing ... [3rule(s)] done [0.00s].

creating S4object  ... done [0.00s].

>rules3

set of 3 rules       产生了3条关联规则

在支持度和置信度的调整过程中,如果更关注关联项集在总体中所占比重,可以适当提高支持度;如果更注重规则本身的可靠性,可提高置信度;提升度可以说是筛选关联规则最可靠的指标,并且根据该指标还可以得到比较有趣的结论。

 

5、使用eclat()函数获取最适合进行捆绑销售的产品

> freq <-eclat(data=data, parameter=list(minlen=2, maxlen=3, support=0.3,target='frequent itemsets'), control=list(sort=-1))

Eclat

parameterspecification:

 tidLists support minlen maxlen            target   ext

    FALSE    0.3      2      3 frequent itemsets FALSE

 

algorithmic control:

 sparse sort verbose

      7  -1    TRUE

 

Absolute minimumsupport count: 3

 

create itemset ...

set transactions...[10 item(s), 10 transaction(s)] done [0.00s].

sorting and recodingitems ... [3 item(s)] done [0.00s].

creating bit matrix... [3 row(s), 10 column(s)] done [0.00s].

writing  ... [4 set(s)] done [0.00s].

Creating S4object  ... done [0.00s].

> freq

set of 4 itemsets

> inspect(freq)

  items                           support

1{Gladiator,Patriot,Sixth Sense} 0.4   

2{Gladiator,Patriot}             0.6   

3 {Patriot,SixthSense}           0.4   

4 {Gladiator,SixthSense}         0.5  

你可能感兴趣的:(机器学习,关联规则)