大数据项目2:内存受限的大数据预测模型



一、项目简介:回归树用于分类预测

1、项目集数据介绍

使用randomForest包和party包来创建随机森林的区别:randomForest包无法处理包含缺失值或者拥有超过32个等级水平的分类变量。

本例子是在内存受限的情况下简历一个预测模型。由于训练集太大而不能直接通过R构建决策树,所以需要先从训练集中通过随机抽样的方式抽取多个子集,并分别对每一个子集构建决策树,只选取决策树中存在的所有变量,以便缩减训练集的规模。在评分时,得分的集合同样被划分为多个子集,以便在内存受限的条件下成功运行。

数据简介 KDD Cup 1998年竞赛的目标是估计一个直邮的回复量,以便获得最多的捐款。数据集的格式是以逗号作为分隔符,其中学习数据集”cup98lrn.txt”包含了95412条记录,481个字段,验证数据集“cup98val.txt”包含了96367条记录,479个字段。每条记录都包含一个CONTROLN字段,该字段是记录的唯一标识符;有两个目标变量TARGET_B和TARGET_D,TARGET_B是一个二进制变量,表示当一条记录中的TARGET_D变量中有捐款时,该条记录是否对邮件做了回复。学习数据集和验证数据集的数据格式相同,但是在验证数据集中没有包含TARGET_B和TARGET_D这两个变量。

2、研究方法

本例的数据分为两类:目标客户和非目标客户,分别为1和0,与客户的风险模型相似。 本例仍然使用决策树技术,因为对于商人和管理者来说,决策树更易于理解,规则也更简单。与SVM或神经网络相比,决策树应用到业务上更容易被接受和执行。决策树还支持分类变量和数值变量的混合数据类型,同时还可以处理缺失值。特别地,party包中提供了函数ctree()来构建决策树。 在大数据上训练模型需要花费很长时间,特别是对于分类变量含有多个水平值的情况。一种方法是使用一个小样本来训练模型。这里我们使用另一种方法:它能够使用尽可能多的数据进行训练。首先,从训练数据中抽取20个随机样本集,并分别对每一个样本集创建一棵决策树,每一棵树中含大约20-30个变量,其中有多棵决策树包含了相同的变量。然后,收集包含在决策树中的所有变量,大约60个。接着使用原始数据中的这60个变量的数据进行训练。这样的方法可以将所有的训练实例都用于最后模型的训练,而不仅仅是抽样数据的20棵决策树中的变量。

二、项目过程

1、加载数据并查看

#1)加载数据
cup98 <- read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98lrn.txt")
dim(cup98)
## [1] 95412   481
n.missing <- rowSums(is.na(cup98))
sum(n.missing > 0)  #计算存在NA值的行数
## [1] 95412
#2)选择变量
varSet <- c(
 #demographics
 "ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB", "RECINHSE", "MDMAUD",
 "DOMAIN", "CLUSTER", "AGE", "HOMEOWNR", "CHILD03", "CHILD07", "CHILD12", "CHILD18",
 "NUMCHLD", "INCOME", "GENDER", "WEALTH1", "HIT",
 #donor interests
 "COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", "CDPLAY", "STEREO",
 "PCOWNERS", "PHOTO", "CRAFTS", "FISHER", "GARDENIN", "BOATS", "WALKER", "KIDSTUFF",
 "CARDS", "PLATES", "PEPSTRFL",
 #summary variables of promotion history
 "CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12",
 #summary variables of giving history
 "RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT", "LASTGIFT", "LASTDATE",
 "FISTDATE", "TIMELAG", "AVGGIFT",
 #ID & targets
 "CONTROLN", "TARGET_B", "TARGET_D", "HPHONE_D", 
 #RFA
 "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A",
 #OTHERS
 "CLUSTER2", "GEOCODE2")

#删除Id和TARGET_D属性
vars <- setdiff(varSet, c("CONTROLN", "TARGET_D"))
cup98 <- cup98[, vars]
dim(cup98)
## [1] 95412    64

2、使用随机森林创建模型

查看缺失值以及分类变量等级超过10 的数据

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
#model <- randomForest(TARGET_B~., data=cup98)
#此时会报:存在缺失值

#1)检测缺失值
n.missing <- rowSums(is.na(cup98))
(tab.missing <- table(n.missing))
## n.missing
##     0     1     2     3     4     5     6     7 
##  6782 36864 23841 13684 11716  2483    41     1
#查看没有确实值数据的比例
round(tab.missing["0"]/nrow(cup98), digits=2)
##    0 
## 0.07
#2)检查分类变量的等级大于10的属性
(idx.cat <- which(sapply(cup98, is.factor)))
##  OSOURCE    STATE      ZIP PVASTATE RECINHSE   MDMAUD   DOMAIN HOMEOWNR 
##        2        3        4        5        7        8        9       12 
##  CHILD03  CHILD07  CHILD12  CHILD18   GENDER COLLECT1 VETERANS    BIBLE 
##       13       14       15       16       19       22       23       24 
##    CATLG    HOMEE     PETS   CDPLAY   STEREO PCOWNERS    PHOTO   CRAFTS 
##       25       26       27       28       29       30       31       32 
##   FISHER GARDENIN    BOATS   WALKER KIDSTUFF    CARDS   PLATES PEPSTRFL 
##       33       34       35       36       37       38       39       40 
##   RFA_2A MDMAUD_R MDMAUD_F MDMAUD_A GEOCODE2 
##       59       60       61       62       64
all.levels <- sapply(names(idx.cat), function(x) nlevels(cup98[, x]))
(levels10 <- all.levels[all.levels > 10])
## OSOURCE   STATE     ZIP  MDMAUD  DOMAIN 
##     896      57   19938      28      17
#3)创建训练集和测试集数据:
ind <- sample(1:2, nrow(cup98), prob=c(80, 20), replace = T)
trainData <- cup98[ind==1, ]
testData <- cup98[ind==2, ]

#4)使用party包中的函数cforest()创建随机森林:内存受限而报错
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
(time1 <- Sys.time())
## [1] "2016-02-16 11:54:18 CST"
#cf <- cforest(TARGET_B~., data=trainData, control=cforest_unbiased(mtry=2, ntree=50))
#错误: 无法分配大小为11.4 Gb的矢量
(time2 <- Sys.time())
## [1] "2016-02-16 11:54:18 CST"
(time2-time1)
## Time difference of 0.06253886 secs
#print(object.size(cf), units="Mb")
#注意:此处代码不能正确运行,回报内存溢出。因为ZIP有19938个分类等级,OSOURCE有896个分类等级。

#5)减少内存需要的一种方法是对有多个等级水平的分类变量进行分组或者删除。删除"ZIP", "OSOURCE"两个属性,并重新创建测试集和训练集数据
cup <- cup98[, setdiff(names(cup98), c("ZIP", "OSOURCE"))]
train <- cup[ind==1, ]
test <- cup[ind==2, ]

#建模
#(teme1 <- Sys.time())
#cf <- cforest(TARGET_B~., data=train, controls = cforest_unbiased(mtry=2, ntree=50))
#print(object.size(cf), units="Mb")
#(time2 <- Sys.time())
#(time2 - time1)

#预测
#myPrediction <- predict(cf, newdata=test)
#(time3 <- Sys.time())
#print(object.size(myPrediction), units="Mb")
#time3 -time2
#总结:10万条记录,62个字段,字段的最大等级水平为57个;80%的数据用于建模,耗时将近一个小时;20%的数据用于预测,耗时10多分钟。(删除"ZIP"19938, "OSOURCE896"两个属性的情况下)

3、解决内存受限问题

减少内存需求的一种方法是对含有多个等级水平的分类变量进行分组或者删除。 确定哪些变量用于建模:为了找出哪些变量将用于建模,在本节中需要对创建决策树的过程重复10次。然后收集出现在所有决策树中的每一个变量,并将收集到的变量用于建立最终模型。

#1)创建训练集数据和测试集数据:将数据集划分为3个子集,训练数据集30%、测试数据集20%和其余的数据。划分出一小部分的数据是为了缩减训练数据和测试数据的规模,以便在内存受限的环境下成功的执行训练和测试。
library(party)
trainPercentage <- 30
testPercentage <- 20
restPercentage <- 100 - trainPercentage - testPercentage
(fileName <- paste("cup98-ctree", trainPercentage, testPercentage, sep="-"))
## [1] "cup98-ctree-30-20"
(vars <- setdiff(varSet, c("TARGET_D", "CONTROLN", "ZIP", "OSOURCE")))
##  [1] "ODATEDW"  "STATE"    "PVASTATE" "DOB"      "RECINHSE" "MDMAUD"  
##  [7] "DOMAIN"   "CLUSTER"  "AGE"      "HOMEOWNR" "CHILD03"  "CHILD07" 
## [13] "CHILD12"  "CHILD18"  "NUMCHLD"  "INCOME"   "GENDER"   "WEALTH1" 
## [19] "HIT"      "COLLECT1" "VETERANS" "BIBLE"    "CATLG"    "HOMEE"   
## [25] "PETS"     "CDPLAY"   "STEREO"   "PCOWNERS" "PHOTO"    "CRAFTS"  
## [31] "FISHER"   "GARDENIN" "BOATS"    "WALKER"   "KIDSTUFF" "CARDS"   
## [37] "PLATES"   "PEPSTRFL" "CARDPROM" "MAXADATE" "NUMPROM"  "CARDPM12"
## [43] "NUMPRM12" "RAMNTALL" "NGIFTALL" "CARDGIFT" "MINRAMNT" "MAXRAMNT"
## [49] "LASTGIFT" "LASTDATE" "FISTDATE" "TIMELAG"  "AVGGIFT"  "TARGET_B"
## [55] "HPHONE_D" "RFA_2F"   "RFA_2A"   "MDMAUD_R" "MDMAUD_F" "MDMAUD_A"
## [61] "CLUSTER2" "GEOCODE2"
ind <- sample(3, nrow(cup98), replace = T, prob=c(trainPercentage, testPercentage, restPercentage))
trainData <- cup98[ind==1, vars]
testData <- cup98[ind==2, vars]

#2)检查抽样后的训练集和测试集中的因变量,看其分布与原始数据中的分布时候一致,如果不一致,可是使用分层抽样
round(prop.table(table(cup98$TARGET_B)), digits = 3)
## 
##     0     1 
## 0.949 0.051
round(prop.table(table(trainData$TARGET_B)), digits = 3)
## 
##     0     1 
## 0.951 0.049
round(prop.table(table(testData$TARGET_B)), digits = 3)
## 
##    0    1 
## 0.95 0.05
#rm(cup98, ind)
gc()
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  578656 31.0     940480  50.3   750400  40.1
## Vcells 9117436 69.6   83217080 634.9 91593727 698.9
#3)创建决策树
myCtree <- NULL
startTime <- Sys.time()
myCtree <- ctree(TARGET_B~., data=trainData)
Sys.time() - startTime
## Time difference of 5.178561 secs
print(object.size(myCtree), units="Mb")
## 4.4 Mb
memory.size()
## [1] 417.92
pdf(paste("F:\\R\\Rworkspace\\", fileName, ".pdf", sep=""))
plot(myCtree, type="simple",  ip_args=list(pval=F), ep_args=list(digits=0), main=fileName)
graphics.off()

#4)创建10棵决策树:通过自定义的脚本创建
#source('F:/R/Rworkspace/ctreeN.R')
#ctreeN(10)
#大约耗时6分钟

4、使用已选变量建立模型

上面建立了10棵决策树之后,选取其中包含的所有变量来创建最后的模型。这一次所有的数据都用于学习,80%作为训练集和20%作为测试集。

#1)选择变量
vars.selected <- c("CARDS", "CARDGIFT", "CARDPM12", "CHILD12", "CLUSTER2", "DOMAIN", "GENDER", "GEOCODE2", "HIT", "HOMEOWNR", "INCOME", "LASTDATE", "MINRAMNT", "NGIFTALL", "PEPSTRFL", "RECINHSE", "RFA_2A", "RFA_2F", "STATE", "WALKER")

#2)创建训练集和测试集数据
trainPercentage <- 80
testPercentage <- 20
(fileName <- paste("cup98-ctree", trainPercentage, testPercentage, sep="-"))
## [1] "cup98-ctree-80-20"
vars <- c("TARGET_B", vars.selected)
ind <- sample(2, nrow(cup98), replace=T, prob=c(trainPercentage, testPercentage))
trainData <- cup98[ind==1, vars]
testData <- cup98[ind==2, vars]
round(100*prop.table(table(trainData$TARGET_B)), digits = 1)
## 
##  0  1 
## 95  5
round(100*prop.table(table(testData$TARGET_B)), digits = 1)
## 
##    0    1 
## 94.7  5.3
#3)建模
myCtree <- ctree(TARGET_B~., data=trainData)
print(object.size(myCtree), units="Mb")
## 43.6 Mb
memory.size()
## [1] 344.93
print(myCtree)
## 
##   Conditional inference tree with 23 terminal nodes
## 
## Response:  TARGET_B 
## Inputs:  CARDS, CARDGIFT, CARDPM12, CHILD12, CLUSTER2, DOMAIN, GENDER, GEOCODE2, HIT, HOMEOWNR, INCOME, LASTDATE, MINRAMNT, NGIFTALL, PEPSTRFL, RECINHSE, RFA_2A, RFA_2F, STATE, WALKER 
## Number of observations:  76081 
## 
## 1) RFA_2A == {D, E}; criterion = 1, statistic = 416.197
##   2) LASTDATE <= 9611; criterion = 1, statistic = 79.624
##     3) RFA_2F <= 2; criterion = 1, statistic = 69.366
##       4) INCOME <= 6; criterion = 0.997, statistic = 49.32
##         5)*  weights = 7159 
##       4) INCOME > 6
##         6)*  weights = 429 
##     3) RFA_2F > 2
##       7) WALKER == {Y}; criterion = 1, statistic = 58.471
##         8)*  weights = 1762 
##       7) WALKER == { }
##         9) CARDPM12 <= 4; criterion = 0.999, statistic = 55.405
##           10)*  weights = 1295 
##         9) CARDPM12 > 4
##           11) PEPSTRFL == {X}; criterion = 0.998, statistic = 37.816
##             12) LASTDATE <= 9512; criterion = 0.978, statistic = 37.025
##               13)*  weights = 3794 
##             12) LASTDATE > 9512
##               14)*  weights = 4693 
##           11) PEPSTRFL == { }
##             15)*  weights = 3310 
##   2) LASTDATE > 9611
##     16) RFA_2F <= 2; criterion = 0.962, statistic = 29.529
##       17)*  weights = 237 
##     16) RFA_2F > 2
##       18)*  weights = 363 
## 1) RFA_2A == {F, G}
##   19) PEPSTRFL == {X}; criterion = 1, statistic = 109.472
##     20) LASTDATE <= 9607; criterion = 1, statistic = 59.983
##       21) RFA_2F <= 1; criterion = 1, statistic = 55.059
##         22) MINRAMNT <= 13; criterion = 0.993, statistic = 37.24
##           23) INCOME <= 2; criterion = 0.964, statistic = 34.578
##             24)*  weights = 1929 
##           23) INCOME > 2
##             25)*  weights = 6252 
##         22) MINRAMNT > 13
##           26) RFA_2A == {F}; criterion = 0.999, statistic = 24.021
##             27)*  weights = 76 
##           26) RFA_2A == {G}
##             28)*  weights = 250 
##       21) RFA_2F > 1
##         29) GENDER == { , A, J}; criterion = 0.999, statistic = 54.434
##           30) GENDER == {A, J}; criterion = 0.994, statistic = 32.28
##             31)*  weights = 36 
##           30) GENDER == { }
##             32)*  weights = 316 
##         29) GENDER == {F, M, U}
##           33)*  weights = 8015 
##     20) LASTDATE > 9607
##       34) CARDPM12 <= 10; criterion = 1, statistic = 27.286
##         35)*  weights = 874 
##       34) CARDPM12 > 10
##         36)*  weights = 109 
##   19) PEPSTRFL == { }
##     37) CARDGIFT <= 3; criterion = 1, statistic = 90.392
##       38) CLUSTER2 <= 42; criterion = 1, statistic = 100.831
##         39) STATE == {AA, AE, AP, AZ, CA, CO, CT, HI, ID, ND, NE, OK, OR, PA, SC, SD, WY}; criterion = 0.985, statistic = 90.333
##           40)*  weights = 7563 
##         39) STATE == {AK, AL, AR, AS, DE, FL, GA, IA, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, NJ, NM, NV, NY, OH, RI, TN, TX, UT, VA, VI, VT, WA, WI}
##           41)*  weights = 12950 
##       38) CLUSTER2 > 42
##         42)*  weights = 9404 
##     37) CARDGIFT > 3
##       43) CLUSTER2 <= 20; criterion = 0.959, statistic = 46.778
##         44)*  weights = 2153 
##       43) CLUSTER2 > 20
##         45)*  weights = 3112
#4)将所有已建立的决策树保存为一个Rdata文件,并将决策树的图像保存到一个PDF文件中
save(myCtree, file=paste("F:\\R\\Rworkspace/项目/", fileName, ".rdata", sep=""))
#pdf(paste("F:\\R\\Rworkspace/项目/", ".pdf", sep=""),width=12, height=9, paper="a4r", pointsize=6)
#plot(myCtree, type="simple", ip_args=list(pval=F), ep_args=list(digits=0),main=fileName)
#plot(myCtree, terminal_panel=node_barplot(myCtree), ip_args=list(pval=F), ep_args=list(digits=0),main=fileName)
#graphics.off()

#5)预测并使用测试数据对决策树模型进行测试
myPrediction <- predict(myCtree, testData)
myPrediction <- predict(myCtree, testData, type="node")
str(myPrediction)
##  int [1:19331] 45 42 41 41 8 5 41 41 41 33 ...
(testResult <- table(myPrediction, testData$TARGET_B))
##             
## myPrediction    0    1
##           5  1778  108
##           6   103    8
##           8   399   38
##           10  262   43
##           13  911   73
##           14 1150  110
##           15  808   45
##           17   70    8
##           18   86    9
##           24  446   19
##           25 1504   68
##           27   17    2
##           28   54    5
##           31   10    0
##           32   47    5
##           33 1944  114
##           35  205   16
##           36   27    7
##           40 1827   84
##           41 3119  123
##           42 2241   75
##           44  505   26
##           45  798   34
(percentageOfOne <- round(100*testResult[, 2]/(testResult[, 1] + testResult[, 2]), digits=1))
##    5    6    8   10   13   14   15   17   18   24   25   27   28   31   32 
##  5.7  7.2  8.7 14.1  7.4  8.7  5.3 10.3  9.5  4.1  4.3 10.5  8.5  0.0  9.6 
##   33   35   36   40   41   42   44   45 
##  5.5  7.2 20.6  4.4  3.8  3.2  4.9  4.1
(testResult <- cbind(testResult, percentageOfOne))
##       0   1 percentageOfOne
## 5  1778 108             5.7
## 6   103   8             7.2
## 8   399  38             8.7
## 10  262  43            14.1
## 13  911  73             7.4
## 14 1150 110             8.7
## 15  808  45             5.3
## 17   70   8            10.3
## 18   86   9             9.5
## 24  446  19             4.1
## 25 1504  68             4.3
## 27   17   2            10.5
## 28   54   5             8.5
## 31   10   0             0.0
## 32   47   5             9.6
## 33 1944 114             5.5
## 35  205  16             7.2
## 36   27   7            20.6
## 40 1827  84             4.4
## 41 3119 123             3.8
## 42 2241  75             3.2
## 44  505  26             4.9
## 45  798  34             4.1
#绘制预测数据0/1的箱线图
boxplot(myPrediction~testData$TARGET_B, xlab="TARGET_B", ylab="Prediction", ylim=c(0, 0.25))

#模型评估
s1 <- sort(myPrediction, decreasing = T, method="quick", index.return=T)
testSize <- nrow(testData)
TotalNumOfTarget <- sum(testData$TARGET_B)
NumOfTarget <- rep(0, testSize)
NumOfTarget[1] <- (testData$TARGET_B)[s1$ix[1]]
for(i in 2:testSize) {
  NumOfTarget[i] <- NumOfTarget[i-1] + testData$TARGET_B[s1$ix[i]]
}
plot(1:testSize, NumOfTarget, pty=".", type="l", lty="solid", col="red", ylab="Count Of Responses in Top k", xlab="Top K", main=fileName)
grid(col="gray", lty="dotted")

percentile <- 100*(1:testSize)/testSize
percentileTarget <- 100*NumOfTarget/TotalNumOfTarget
plot(percentile, percentileTarget, pty=".", type="l", lty="solid", col="red", ylab="Percentage of Predicted Donations(%)", xlab="Percentage of Pool", main=fileName)
grid(col="gray", lty="dotted")

5、评分

当使用一棵较大的决策树对大数据评分是,将会出现内存溢出。为了减少内存消耗,将评分数据划分为多个子集,并对每一个子集分别使用预测模型,然后再将所有的评分结果进行融合。

#1)加载评分数据
cup98val <- read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98val.txt")
cup98 <- read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98lrn.txt")
library(party)
treeFileName <- "cup98-ctree-80-20"
splitNum <- 10

#2)设置评分数据的因子水平:把评分数据scoreData中的分类变量的等级水平设置和训练集数据trainData的一致
trainData <- cup98[, vars]
vars2 <- setdiff(c(vars, "CONTROLN"), "TARGET_B")

scoreData <- cup98val[, vars2]
#rm(cup98, cup98val)
trainNames <- names(trainData)
scoreNames <- names(scoreData)
newScoreData <- scoreData

variableList <- intersect(trainNames, scoreNames)

for(i in 1:length(variableList)) {
   varname <- variableList[i]
   trainLevels <- levels(trainData[, varname])
   scoreLevels <- levels(newScoreData[, varname])
   if(is.factor(trainData[, varname]) & setequal(trainLevels, scoreLevels)==F) {
    cat("Waring: new values found in score data, and they will be changed to NA!\n")
 cat(varname, "\n")
 cat("train:", length(trainLevels), ", ", trainLevels, "\n")
 cat("score:", length(scoreLevels), ", ", scoreLevels, "\n\n")
 newScoreData[, varname] <- factor(newScoreData[, varname], levels=trainLevels)
   }
}
## Waring: new values found in score data, and they will be changed to NA!
## GENDER 
## train: 7 ,    A C F J M U 
## score: 5 ,    F J M U 
## 
## Waring: new values found in score data, and they will be changed to NA!
## STATE 
## train: 57 ,  AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VI VT WA WI WV WY 
## score: 59 ,  AA AE AK AL AP AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA WI WV WY
#3)加载决策树模型并查看其大小
load(paste("F:\\R\\Rworkspace/项目/", fileName, ".rdata", sep=""))
print(object.size(trainData), units="Mb")
## 8 Mb
print(object.size(scoreData), units="Mb")
## 8.1 Mb
print(object.size(newScoreData), units="Mb")
## 8.1 Mb
print(object.size(myCtree), units="Mb")
## 43.6 Mb
#回收内存
memory.size()
## [1] 1086.55
gc() 
##             used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells    702415  37.6    1442291   77.1   1168576   62.5
## Vcells 113783508 868.1  172988376 1319.8 172951373 1319.6
#4)将预测(评分)数据划分为多个子集,并对每一个子集建立一棵决策树以便降低内存消耗
nScore <- dim(newScoreData)[1]
(splitSize <- round(nScore/splitNum))
## [1] 9637
myPred <- NULL
for(i in 1:splitNum) {
 startPos <- 1 + (i-1)*splitSize
 if(i==splitNum) { 
  endPos <- nScore
 }  else{
  endPos <- i*splitSize
 }
 print(paste("Predictions:", startPos, "-", endPos))
 tmpPred <- predict(myCtree, newdata=newScoreData[startPos:endPos, ])
 myPred <- c(myPred, tmpPred)
}
## [1] "Predictions: 1 - 9637"
## [1] "Predictions: 9638 - 19274"
## [1] "Predictions: 19275 - 28911"
## [1] "Predictions: 28912 - 38548"
## [1] "Predictions: 38549 - 48185"
## [1] "Predictions: 48186 - 57822"
## [1] "Predictions: 57823 - 67459"
## [1] "Predictions: 67460 - 77096"
## [1] "Predictions: 77097 - 86733"
## [1] "Predictions: 86734 - 96367"
#计算预测的数量及其所占的百分比
length(myPred)
## [1] 96367
(rankedLevels <- table(round(myPred, digits=4)))
## 
## 0.0262 0.0295   0.03 0.0402 0.0443 0.0467 0.0515  0.055 0.0553  0.056 
##  11904   2358  16415   3978   7848   9650   8997   4226  10208    311 
## 0.0595 0.0651 0.0665 0.0789  0.084 0.0862 0.0928 0.1061 0.1127 0.1928 
##   2628   4705    367   1138   6122    552    358   2172   1623    465 
## 0.1944 0.2105 0.2294 
##     68    133    141
#颠倒rankedLevels
rankedLevels <- rankedLevels[length(rankedLevels):1]
(levelNum <- length(rankedLevels)) 
## [1] 23
cumCnt <- rep(0, levelNum)
cumCnt[1] <- rankedLevels[1]
for(i in 2:levelNum) {
 cumCnt[i] <- cumCnt[i-1] + rankedLevels[i]
}

(cumPercent <- 100*cumCnt/nScore)
##  [1]   0.1463156   0.2843297   0.3548933   0.8374236   2.5216101
##  [6]   4.7754937   5.1469902   5.7198003  12.0725975  13.2534996
## [11]  13.6343354  18.5167122  21.2437868  21.5665114  32.1593492
## [16]  36.5446678  45.8808513  55.8946527  64.0385194  68.1664885
## [21]  85.2003279  87.6472236 100.0000000
cumPercent <- round(cumPercent,digits=1)
percent <- 100*rankedLevels/nScore
precent <- round(percent,digits=1)
cumRanking <- data.frame(rankedLevels,  cumCnt, percent, cumPercent)
names(cumRanking) <- c("Frequency", "CumFrequency", "Percentage", "CumPercentage")
print(cumRanking)
##        Frequency CumFrequency  Percentage CumPercentage
## 0.2294       141          141  0.14631565           0.1
## 0.2105       133          274  0.13801405           0.3
## 0.1944        68          342  0.07056357           0.4
## 0.1928       465          807  0.48253033           0.8
## 0.1127      1623         2430  1.68418650           2.5
## 0.1061      2172         4602  2.25388359           4.8
## 0.0928       358         4960  0.37149647           5.1
## 0.0862       552         5512  0.57281019           5.7
## 0.084       6122        11634  6.35279712          12.1
## 0.0789      1138        12772  1.18090218          13.3
## 0.0665       367        13139  0.38083576          13.6
## 0.0651      4705        17844  4.88237675          18.5
## 0.0595      2628        20472  2.72707462          21.2
## 0.056        311        20783  0.32272458          21.6
## 0.0553     10208        30991 10.59283780          32.2
## 0.055       4226        35217  4.38531863          36.5
## 0.0515      8997        44214  9.33618355          45.9
## 0.0467      9650        53864 10.01380141          55.9
## 0.0443      7848        61712  8.14386668          64.0
## 0.0402      3978        65690  4.12796912          68.2
## 0.03       16415        82105 17.03383938          85.2
## 0.0295      2358        84463  2.44689572          87.6
## 0.0262     11904        96367 12.35277637         100.0
#5)保存结果
#write.csv(cumRanking, "F:\\R\\Rworkspace/项目/cup98-cumulative-ranking.csv", row.names=T)
#pdf(paste("F:\\R\\Rworkspace/项目/cup98-score-distribution.pdf", sep=""))
#plot(rankedLevels, x=names(rankedLevels), type="h", xlab="Score", ylab="# of Customers")
#graphics.off()

#6)使用预测结果得分对客户进行排名,并将结果保存到一个.csv文件中
s1 <- sort(myPred, decreasing=T, method="quick", index.return=T)
varToOutput <-  c("CONTROLN")
score <- round(myPred[s1$ix], digits=4)
table(score, useNA="ifany")
## score
## 0.0262 0.0295   0.03 0.0402 0.0443 0.0467 0.0515  0.055 0.0553  0.056 
##  11904   2358  16415   3978   7848   9650   8997   4226  10208    311 
## 0.0595 0.0651 0.0665 0.0789  0.084 0.0862 0.0928 0.1061 0.1127 0.1928 
##   2628   4705    367   1138   6122    552    358   2172   1623    465 
## 0.1944 0.2105 0.2294 
##     68    133    141
result <- data.frame(cbind(newScoreData[s1$ix, varToOutput]), score)
names(result) <- c(varToOutput, "score")
#write.csv(result, "cup98-predicted-score.csv", row.names=F)

你可能感兴趣的:(大数据,机器学习,大数据集,回归树,party包)