读入数据,格式转化
> ws<-read.csv(choose.files())
> head(ws)
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2 3 12669 9656 7561 214 2674 1338
2 2 3 7057 9810 9568 1762 3293 1776
3 2 3 6353 8808 7684 2405 3516 7844
4 1 3 13265 1196 4221 6404 507 1788
5 2 3 22615 5410 7198 3915 1777 5185
6 2 3 9413 8259 5126 666 1795 1451
> > summary(ws)
Channel Region Fresh Milk Grocery Frozen Detergents_Paper
Min. :1.000 Min. :1.000 Min. : 3 Min. : 55 Min. : 3 Min. : 25.0 Min. : 3.0
1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8
Median :1.000 Median :3.000 Median : 8504 Median : 3627 Median : 4756 Median : 1526.0 Median : 816.5
Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9 Mean : 2881.5
3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0
Max. :2.000 Max. :3.000 Max. :112151 Max. :73498 Max. :92780 Max. :60869.0 Max. :40827.0
Delicassen
Min. : 3.0
1st Qu.: 408.2
Median : 965.5
Mean : 1524.9
3rd Qu.: 1820.2
Max. :47943.0
> colnames(ws)
[1] "Channel" "Region" "Fresh" "Milk" "Grocery" "Frozen"
[7] "Detergents_Paper" "Delicassen"
> ws$Channel<-factor(ws$Channel,labels=c('horeca','retail'))
> ws$Region<-factor(ws$Region,labels=c('Lisbon','Oporto','Other Region'))
> dim(ws)
[1] 440 8
散点矩阵图
> plot(ws[,3:8],col=(ws$Region),pch=as.numeric(ws$Region))
数据划分
> sampa<-sample(1:440,220)
> table(ws[sampa,2])
Lisbon Oporto Other Region
35 27 158
> sampb<-(1:440)[-sampa]
> table(ws[sampb,2])
Lisbon Oporto Other Region
42 20 158
预测
> library(class)
> ws.knn<-knn(ws[sampa,3:8],ws[sampb,3:8],ws[sampa,2]) #knn函数预测测试数据的类别(即ws[sampb,2])
交叉统计,计算误分类率。
> table(ws.knn,ws[sampb,2])
ws.knn Lisbon Oporto Other Region
Lisbon 11 4 27
Oporto 6 4 21
Other Region 29 14 104
> cr<-table(ws.knn,ws[sampb,2])
> cr
ws.knn Lisbon Oporto Other Region
Lisbon 11 4 27
Oporto 6 4 21
Other Region 29 14 104
> (sum(cr)-sum(diag(cr)))/sum(cr)
[1] 0.4590909
误差率较高。k 取值一般不大于记录总数的平方根,所以取k15,对k=3到k=20循环预测,比较误差率。
> sqrt(220)
[1] 14.8324
> mcrate<-numeric(20)
> mcrate
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> for (i in 3:20){ws.knn<-knn(ws[sampa,3:8],ws[sampb,3:8],ws[sampa,2],k=i)
+ cr<-table(ws.knn,ws[sampb,2])
+ mcrate[i]<-(sum(cr)-sum(diag(cr)))/sum(cr)}
> plot(3:20,mcrate[3:20],type='l')
> which(mcrate==min(mcrate[3:15]))
[1] 11 13 14 15 16 17 18 19 20
> mcrate[which(mcrate==min(mcrate[3:15]))]
[1] 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909
实际使用KNN法时需要对变量进行基准化或标准化。
**基准化:**使各变量在0-1之间取值。
> normalize<-function(x){return ((x-min(x))/(max(x)-min(x)))}
> ws.n[,3]<-normalize(ws.n[,3])
> summary(ws.n[,3])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02786 0.07580 0.10698 0.15097 1.00000
> for (i in 4:8){ws.n[,i]<-normalize(ws.n[,i])}
> ws.n.knn<-knn(ws.n[sampa,3:8],ws.n[sampb,3:8],ws.n[sampa,2],k=15)
> table(ws.n[sampb,2],ws.n.knn)
ws.n.knn
Lisbon Oporto Other Region
Lisbon 0 0 43
Oporto 0 0 22
Other Region 0 1 154
> (sum(cr)-sum(diag(cr)))/sum(cr)
[1] 0.3
标准化:
> ws.z<-ws
> ws.z[,3]<-scale(ws.z[,3])
> c(mean(ws.z[,3]),sd(ws.z[,3]))
[1] -3.394982e-17 1.000000e+00
> for (i in 4:8){ws.z[,i]<-scale(ws.z[,i])}
> ws.z.knn<-knn(ws.z[sampa,3:8],ws.z[sampb,3:8],ws.z[sampa,2],k=15)
> cr<-table(ws.z.knn,ws.z[sampb,2])
> cr
ws.z.knn Lisbon Oporto Other Region
Lisbon 0 0 0
Oporto 0 0 0
Other Region 43 22 155
> (sum(cr)-sum(diag(cr)))/sum(cr)
[1] 0.2954545