查看因子水平是否大于10
数据集下载地址: http://www.sigkdd.org/kddcup/index.php?section=1998&method=data
1、把预测数据与训练数据类型不一样的属性,修改为训练数据的类型(因子类型):请参考R项目客户回复预测与效益最大化
1)、读取所需要预测的数据:
> cup98val <-read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98val.txt")
Warning message:
In scan(file = file,what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
> cup98val <-cup98val[, c("CONTROLN", varSet2)]
> dim(cup98val)
[1] 96367 30
2)、查看预测数据在训练数据中没有的属性:
> trainNames<- names(cup98)
> scoreNames<- names(cup98val)
> idx <-which(trainNames %in% scoreNames)
> idx
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24 25 26 27 28 29 30
>print(trainNames[-idx])
[1]"TARGET_D"
3)、确保训练集和测试集数据中的所有factor的levels相同:把预测数据与训练数据类型不一样的属性,修改为训练数据的类型(因子类型)
> scoreData <-cup98val
> vars <-intersect(trainNames,scoreNames)
> for(i in1:length(vars)) {
+ varname <- vars[i]
+ trainLevels <- levels(cup98[, varname]) #返回因子水平
+ scoreLevels <- levels(scoreData[,varname])
+ if(is.factor(cup98[, varname])& setequal(trainLevels, scoreLevels)==F) { #如果两个条件不同时满足
+ cat("Warning:new values found in scoredata, and they will be changed to NA!\n")
+ cat(varname, "\n")
+ cat("train:",length(trainLevels), ", ", trainLevels, "\n")
+ cat("score:",length(scoreLelves), ", ", scoreLelves, "\n\n")
+ scoreData[, varname]<- factor(scoreData[, varname], levels=trainLevels) #把预测数据修改为训练数据类型
+ }
+ }
Warning:new valuesfound in score data, and they will be changed to NA!
GENDER
。。。
4)、查看因子类型的水平等级:
> idx.cat <-which(sapply(cup98, is.factor))
> level_count<- sapply(names(idx.cat), function(x) nlevels(cup98[, x]))
> level_count
OSOURCE STATE ZIP PVASTATERECINHSE MDMAUD DOMAIN
896 57 19938 3 2 28 17
。。。
> level_count[level_count>10] 求因子水平大于10的变量
OSOURCE STATE ZIP MDMAUD DOMAIN
896 57 19938 28 17
2、确保训练集和测试集数据中的所有factor的levels相同:
> train <-read.table("F:\\R\\Rworkspace\\RandomForest/train.csv", header=T,sep=",")
> test <-read.table("F:\\R\\Rworkspace\\RandomForest/test.csv", header=T,sep=",")
注意:训练集和测试集数据来自不同的数据集,一定要注意测试集和训练集的factor的levels相同,否则,在利用训练集训练的模型对测试集进行预测时,会报错!!!
>str(train_impute)
'data.frame': 891 obs. of 8 variables:
$ Survived: Factor w/ 2 levels"0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male":2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05...
$ Embarked: Factor w/4 levels"","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
> str(test)
'data.frame': 418 obs. of 8 variables:
$ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
$ Sex : Factor w/ 2 levels "female","male":2 1 2 2 1 2 1 2 1 2 ...
$ Age : num 34.5 47 62 27 22 14 30 2618 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Embarked: Factor w/ 3 levels "C","Q","S":2 3 2 3 3 3 2 3 1 3 ...
$ Id : chr "1" "2" "3" "4" … 这一属性是训练集没有的
>levels(train_impute$Sex)
[1]"female" "male"
>levels(test$Sex)
[1]"female" "male"
> levels(test$Embarked)
[1] "C""Q" "S" factor的levels为3
>levels(train_impute$Embarked)
[1] "" "C" "Q" "S" factor的levels为4
>levels(test$Embarked) <- levels(train_impute$Embarked) 把训练集的factor赋值为测试集
>levels(test$Embarked)
[1]"" "C" "Q""S"