R语言|数据预处理--2因子类型:训练测试集数据

查看因子水平是否大于10

数据集下载地址: http://www.sigkdd.org/kddcup/index.php?section=1998&method=data

 

1、把预测数据与训练数据类型不一样的属性,修改为训练数据的类型(因子类型):请参考R项目客户回复预测与效益最大化

1)、读取所需要预测的数据:

> cup98val <-read.csv("F:\\R\\Rworkspace\\cup98lrn/cup98val.txt")

Warning message:

In scan(file = file,what = what, sep = sep, quote = quote, dec = dec,  :

  embedded nul(s) found in input

> cup98val <-cup98val[, c("CONTROLN", varSet2)]

> dim(cup98val)

[1] 96367    30

 

2)、查看预测数据在训练数据中没有的属性:

> trainNames<- names(cup98)

> scoreNames<- names(cup98val)

> idx <-which(trainNames %in% scoreNames)

> idx

 [1] 2  3  4 5  6  7 8  9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24 25 26 27 28 29 30

>print(trainNames[-idx])

[1]"TARGET_D"

 

3)、确保训练集和测试集数据中的所有factorlevels相同:把预测数据与训练数据类型不一样的属性,修改为训练数据的类型(因子类型)

> scoreData <-cup98val

> vars <-intersect(trainNames,scoreNames)

> for(i in1:length(vars)) {

+  varname <- vars[i]

+  trainLevels <- levels(cup98[, varname])           #返回因子水平

+  scoreLevels <- levels(scoreData[,varname])

+  if(is.factor(cup98[, varname])& setequal(trainLevels, scoreLevels)==F) {   #如果两个条件不同时满足

+    cat("Warning:new values found in scoredata, and they will be changed to NA!\n")

+    cat(varname, "\n")

+   cat("train:",length(trainLevels), ", ", trainLevels, "\n")

+   cat("score:",length(scoreLelves), ", ", scoreLelves, "\n\n")

+    scoreData[, varname]<- factor(scoreData[, varname], levels=trainLevels)   #把预测数据修改为训练数据类型

+   }

+ }

Warning:new valuesfound in score data, and they will be changed to NA!

GENDER

。。。

 

4)、查看因子类型的水平等级:

> idx.cat <-which(sapply(cup98, is.factor))

> level_count<- sapply(names(idx.cat), function(x) nlevels(cup98[, x]))

> level_count

 OSOURCE   STATE      ZIP PVASTATERECINHSE   MDMAUD   DOMAIN

     896      57    19938        3       2       28       17 

。。。

> level_count[level_count>10]                      求因子水平大于10的变量

OSOURCE   STATE    ZIP  MDMAUD  DOMAIN

    896     57   19938      28     17

 

 

2、确保训练集和测试集数据中的所有factorlevels相同:

> train <-read.table("F:\\R\\Rworkspace\\RandomForest/train.csv", header=T,sep=",")

> test <-read.table("F:\\R\\Rworkspace\\RandomForest/test.csv", header=T,sep=",")

注意:训练集和测试集数据来自不同的数据集,一定要注意测试集和训练集的factor的levels相同,否则,在利用训练集训练的模型对测试集进行预测时,会报错!!!

>str(train_impute)

'data.frame':   891 obs. of 8 variables:

 $ Survived: Factor w/ 2 levels"0","1": 1 2 2 2 1 1 1 1 2 2 ...

 $ Pclass : int  3 1 3 1 3 3 1 3 3 2 ...

 $ Sex    : Factor w/ 2 levels "female","male":2 1 1 1 2 2 2 2 1 1 ...

 $ Age    : num  22 38 26 35 35 ...

 $ SibSp  : int  1 1 0 1 0 0 0 3 0 1 ...

 $ Parch  : int  0 0 0 0 0 0 0 1 2 0 ...

 $ Fare   : num  7.25 71.28 7.92 53.1 8.05...

 $ Embarked: Factor w/4 levels"","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

> str(test)

'data.frame':   418 obs. of 8 variables:

 $ Pclass : int  3 3 2 3 3 3 3 2 3 3 ...

 $ Sex    : Factor w/ 2 levels "female","male":2 1 2 2 1 2 1 2 1 2 ...

 $ Age    : num  34.5 47 62 27 22 14 30 2618 21 ...

 $ SibSp  : int  0 1 0 0 1 0 0 1 0 2 ...

 $ Parch  : int  0 0 0 0 1 0 0 1 0 0 ...

 $ Fare   : num  7.83 7 9.69 8.66 12.29 ...

 $ Embarked: Factor w/ 3 levels "C","Q","S":2 3 2 3 3 3 2 3 1 3 ...

 $ Id      : chr "1" "2" "3" "4" …               这一属性是训练集没有的

>levels(train_impute$Sex)

[1]"female" "male" 

>levels(test$Sex)

[1]"female" "male" 

> levels(test$Embarked)

[1] "C""Q" "S"                                            factorlevels3

>levels(train_impute$Embarked)

[1] ""  "C" "Q" "S"                                       factorlevels4

>levels(test$Embarked) <- levels(train_impute$Embarked)    把训练集的factor赋值为测试集

>levels(test$Embarked)

[1]""  "C" "Q""S"

你可能感兴趣的:(R语言,数据清洗,r语言,数据清洗)