实例三、下面的实例部分使用的数据为rattle包中,csv目录下的dvdtrans.csv文件,该数据显示了DVD购买的记录。
1、加载数据和包:
>library(rattle)
> library(arules)
> dvd <- read.csv("F:\\R\\R-3.2.2\\library\\rattle\\csv/dvdtrans.csv", header=T) 在安装包的目录下
> class(dvd)
[1]"data.frame" dvd数据集为data.frame类型,需要转化为arules包能处理的“transactions”类型
> str(dvd)
'data.frame': 30 obs. of 2 variables:
$ ID :int 1 1 1 1 1 2 2 2 3 3 ...
$ Item: Factor w/ 10 levels"Braveheart","Gladiator",..: 10 7 4 3 8 2 9 1 7 8 ...
2、该数据为常见的数据框格式数据,需要将其转换为aprior()函数和eclat()函数所能接受的格式
> data <-as(split(dvd[, 'Item'], dvd[, 'ID']), 'transactions')
> class(data)
[1]"transactions"
attr(,"package")
[1]"arules"
查看转化后的数据:
>inspect(data[1:3])
items transactionID
1 {Green Mile,
Harry Potter1,
LOTR1,
LOTR2,
Sixth Sense} 1
2 {Braveheart,
Gladiator,
Patriot} 2
3 {LOTR1,
LOTR2} 3
> dim(data)
[1] 10 10
> summary(data)
transactions asitemMatrix in sparse format with
10 rows (elements/itemsets/transactions) and
10 columns (items) anda density of 0.3
总共有10条交易,10中商品,density=0.3表示稀疏矩阵中1的占比;
mostfrequent items:
Gladiator Patriot Sixth Sense Green Mile HarryPotter1 (Other)
7 6 6 2 2 7
最频繁出现的商品及对应的频次,显然电影Gladiator最受欢迎,其次是电影Patriot和SixthSense;
element (itemset/transaction) length distribution:
sizes
2 3 4 5
3 5 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 2.25 3.00 3.00 3.00 5.00
每笔交易包含的商品数目,以及其对应的5个分位数和均值的统计信息。如3笔交易包含2个商品,5笔交易包含3个商品,两个1笔交易各包含4种和5种商品;第一个分位数是2,表示25%的交易不超过2种商品,平均值表示所有交易中,平均每笔购买3种商品;
includes extended item information - examples:
labels
1 Braveheart
2 Gladiator
3 Green Mile
includes extended transaction information - examples:
transactionID
1 1
2 2
3 3
如果数据集包含除了TransactionId 和 Item之外的其他列(如,发生交易的时间,用户ID等等),会显示在这里。这个例子,其实没有新的列,labels就是item的名字。
3、下面就可以使用apriori()函数和eclat()函数对数据集data进行关联规则算法的运用:由于交易量较少,这里不妨将参数支持度的最小值设为0.3,置信度的最小值设为0.5,其他参数暂不作修改。
> rules <-apriori(data=data, parameter=list(support=0.3, confidence=0.5))
Apriori
Parameterspecification:
confidenceminval smax arem aval originalSupportsupport minlen
0.5 0.1 1 none FALSE TRUE 0.3 1
maxlentarget ext
10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 3
set item appearances...[0 item(s)] done [0.00s].
set transactions...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recodingitems ... [3 item(s)] done [0.00s].
creating transactiontree ... done [0.00s].
checking subsets ofsize 1 2 3 done [0.00s].
writing ... [12rule(s)] done [0.00s].
creating S4object ... done [0.00s].
包括模型中指定的最小支持度、置信度、项集中最大/最小项数、输出结果形式和算法的执行细节;
> rules <-apriori(titanic.raw, control=list(verbose=F),parameter=list(minlen=2, supp=0.005, conf=0.8),appearance=list(rhs=c("Survived=No","Survived=Yes"), default="lhs"))
arules包中的apriori(data,parameter=list(support,minlen, maxlen, confidence))函数:Aprior算法挖掘频繁项集的关联规则,Apriori算法采用逐层寻找频繁项集。
data为transactions类型对象或者任何可以转化为transactions类型的数据结构;
parameter为APparameter类型的对象,参数有support项目集的最小支持度值(默认为0.1);minlen为项目集每项的最小数目2(默认为1),可以删除项集为空和1的;maxlen为项目集每项的最大数目(默认为10);confidence为可信度(默认为0.8)
data为apriori函数和eclat函数所能接受的“交易”格式数据,可以通过as()函数将常见的二元矩阵、数据框进行转换;
parameter以列表的形式存储模型所需的支持度、置信度、每个项集所含项数的最大值/最小值和输出结果类型等参数,默认情况下支持度为0.1,置信度为0.8,项集中最大项数为10,最小项数为1,输出关联规则/频繁项集类型的结果;
appearance可为先决条件X和关联结果Y指定明确的项集(一般是分析人员感兴趣的项集),默认情况下不为X和Y指定某些项集;在appearance中设置rhs=c("Survived=No","Survived=Yes")确保关联规则的右侧rhs只出现"Survived=No"和 "Survived=Yes",当设置default="lhs"时所有的项集都可以出现在lhs上,关联规则的左侧;both为关联规则的两侧。
control用来控制函数性能,如对项集进行升序或降序,生成算法运行的报告进程等,verbose=F可以设置压缩过程的细节信息
> rules
set of 12 rules 显示该算法产生12条关联规则;
> inspect(sort(rules, by="lift", decreasing=T)[1:3]) 通过lift进行排序,也可以通过support和confidence进行排序
lhs rhs support confidence lift
6 {Patriot} => {Gladiator} 0.6 1.0000000 1.428571
7 {Gladiator} => {Patriot} 0.6 0.8571429 1.428571
10 {Patriot,SixthSense} => {Gladiator} 0.4 1.0000000 1.428571
> inspect(rules)
lhs rhs support confidence lift
1 {} => {Patriot} 0.6 0.6000000 1.000000
2 {} => {Sixth Sense}0.6 0.6000000 1.000000
3 {} => {Gladiator} 0.7 0.7000000 1.000000
4 {Patriot} => {Sixth Sense} 0.4 0.6666667 1.111111
5 {Sixth Sense} => {Patriot} 0.4 0.6666667 1.111111
6 {Patriot} => {Gladiator} 0.6 1.0000000 1.428571
7 {Gladiator} => {Patriot} 0.6 0.8571429 1.428571
8 {Sixth Sense} => {Gladiator} 0.5 0.8333333 1.190476
9 {Gladiator} => {Sixth Sense} 0.5 0.7142857 1.190476
10 {Patriot,SixthSense} => {Gladiator} 0.4 1.0000000 1.428571
11{Gladiator,Patriot} => {SixthSense} 0.4 0.6666667 1.111111
12 {Gladiator,SixthSense} => {Patriot} 0.4 0.8000000 1.333333
显示了12种关联规则的明细,如果数据量特别大,且产生的关联规则也特别多时,直接输出结果并不能看出什么有意义的结果,可以通过sort函数对支持度support、置信度confidence和提升度lift进行排序,选出比较靠前的关联规则。
4、对生成的关联规则进行强度控制
一般来说,对于数据量比较大的交易,可以直接使用apriori()函数的默认参数来生成规则,再根据业务情况进行调整。比如,先将阈值设置很低,再逐步提升阈值,直到达到理想的规则和强度。对于强度的控制,可以设置不同的支持度support、置信度confidence和提升度lift来实现。
同时调整支持度和置信度进行调整
> rules1 <-apriori(data=data, parameter=list(support=0.5, confidence=0.6))
Apriori
Parameterspecification:
confidence minval smax arem aval originalSupport support minlen maxlen
0.6 0.1 1 none FALSE TRUE 0.5 1 10
target ext
rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimumsupport count: 5
set item appearances...[0 item(s)] done [0.00s].
set transactions...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recodingitems ... [3 item(s)] done [0.00s].
creating transactiontree ... done [0.00s].
checking subsets ofsize 1 2 done [0.00s].
writing ... [7rule(s)] done [0.00s].
creating S4object ... done [0.00s].
> rules1
set of 7 rules 产生了7条关联规则
> rules2 <-apriori(data=data, parameter=list(support=0.3, confidence=0.75))
Apriori
Parameterspecification:
confidence minval smax arem aval originalSupport support minlen maxlen
0.75 0.1 1 none FALSE TRUE 0.3 1 10
target ext
rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimumsupport count: 3
set item appearances...[0 item(s)] done [0.00s].
set transactions...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recodingitems ... [3 item(s)] done [0.00s].
creating transactiontree ... done [0.00s].
checking subsets ofsize 1 2 3 done [0.00s].
writing ... [5rule(s)] done [0.00s].
creating S4object ... done [0.00s].
> rules2
set of 5 rules 产生了5条关联规则
>rules3 <- apriori(data=data,parameter=list(support=0.5, confidence=0.75))
Apriori
Parameterspecification:
confidence minval smax arem aval originalSupport support minlen maxlen
0.75 0.1 1 none FALSE TRUE 0.5 1 10
target ext
rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimumsupport count: 5
set item appearances...[0 item(s)] done [0.00s].
set transactions...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recodingitems ... [3 item(s)] done [0.00s].
creating transactiontree ... done [0.00s].
checking subsets ofsize 1 2 done [0.00s].
writing ... [3rule(s)] done [0.00s].
creating S4object ... done [0.00s].
>rules3
set of 3 rules 产生了3条关联规则
在支持度和置信度的调整过程中,如果更关注关联项集在总体中所占比重,可以适当提高支持度;如果更注重规则本身的可靠性,可提高置信度;提升度可以说是筛选关联规则最可靠的指标,并且根据该指标还可以得到比较有趣的结论。
5、使用eclat()函数获取最适合进行捆绑销售的产品
> freq <-eclat(data=data, parameter=list(minlen=2, maxlen=3, support=0.3,target='frequent itemsets'), control=list(sort=-1))
Eclat
parameterspecification:
tidLists support minlen maxlen target ext
FALSE 0.3 2 3 frequent itemsets FALSE
algorithmic control:
sparse sort verbose
7 -1 TRUE
Absolute minimumsupport count: 3
create itemset ...
set transactions...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recodingitems ... [3 item(s)] done [0.00s].
creating bit matrix... [3 row(s), 10 column(s)] done [0.00s].
writing ... [4 set(s)] done [0.00s].
Creating S4object ... done [0.00s].
> freq
set of 4 itemsets
> inspect(freq)
items support
1{Gladiator,Patriot,Sixth Sense} 0.4
2{Gladiator,Patriot} 0.6
3 {Patriot,SixthSense} 0.4
4 {Gladiator,SixthSense} 0.5