数据预处理(第二课居多)

一、数据预处理 (Data Mining lec 01和02)

1、筛选数据select函数,返回特定列的数据

sub_data<-select(iris,Sepal.Length, Sepal.Width )

第一个参数为数据源,剩下的参数为要筛选出来的列

2、筛选函数filter函数,返回满足条件的特定行的数据

filter(iris,Species=="setosa")

第一个参数为数据源,剩下的为筛选条件,同一个属性的“或”的筛选为“|”
eg:filter(iris,Species=="setosa"|Species=="versicolor", Sepal.Length>4.5 )

3、分组group函数,将数据源按照某一特定属性分组
g_a<-group_by(iris,Species) ##将iris数据源按照种类分组

4、样本函数sample
eg:sample_n(iris,15,replace=TRUE) ##表示抽出数量为15的样本
eg:sample_frac(iris,0.1,replace = FALSE)#############################没看懂。。。

5、去掉方差为0的数列(即所有数据的该属性都是同一个值)

# remove some variables: low variance
zerovar<-nearZeroVar(mydata)
newdata1<-mydata[,-zerovar]
summary(mydata[,31:34])

6、去除包含空缺值的数据(行)

# remove some NA
flights_and_weather<-na.omit(flights_and_weather)

7、去掉高度相关的列(即每一条数据都独一无二的属性值,如id)

①初级去除

# remove some variables: high corelation
comboInfo = findLinearCombos(newdata1)
newdata2=newdata1[, - comboInfo$remove]

②二次去除(去除相似度由自己定)

# remove some variables: high corelation
descrCorr = cor(newdata1)
highCorr = findCorrelation(descrCorr, 0.90)
# cor(mydata[,5],mydata[,12])
# cor(mydata[,12],mydata[,13])
newdata3 = newdata2[, -highCorr]

其他数据预处理都在Data Mining lec 02,具体是些啥我没看懂,留个坑,啥时候回来填了
8、preProcess函数


# normalize/mising value/feature reduce
# preProcess(data, method='range'/method = c("center", "scale"))
# preProcess(data, method='knnImpute'/method='medianImpute')
Process <- preProcess(newdata3,method = 'range')
newdata4 <- predict(Process, newdata3)
summary(as.data.frame(newdata4[,1:6]))


Process <- preProcess(newdata3,method = c("center", "scale"))
newdata4 <- predict(Process, newdata3)
summary(as.data.frame(newdata4[,1:6]))

Process <- preProcess(newdata4,method = 'medianImpute')
newdata4 <- predict(Process, newdata4)

9、K-Fold

# K-Fold
TenFold<-createFolds(newdata4$AMW,k=10,list=TRUE,returnTrain = FALSE)
TenFold2<-createMultiFolds(newdata4$AMW,k=10,times = 2)
Fold_1<-newdata4[TenFold$Fold01,]
Fold_2<-newdata4[-TenFold$Fold01,]

10、data seperation

# data seperation
inTrain = createDataPartition(newdata4$AMW, p = 3/4, list =FALSE)
trainx = newdata4[inTrain,]
testx = newdata4[-inTrain,]
trainy = mdrrClass[inTrain]
testy = mdrrClass[-inTrain]

你可能感兴趣的:(数据挖掘,大数据)