R语言k最近邻法(KNN)

读入数据,格式转化

> ws<-read.csv(choose.files())
> head(ws)
  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1       2      3 12669 9656    7561    214             2674       1338
2       2      3  7057 9810    9568   1762             3293       1776
3       2      3  6353 8808    7684   2405             3516       7844
4       1      3 13265 1196    4221   6404              507       1788
5       2      3 22615 5410    7198   3915             1777       5185
6       2      3  9413 8259    5126    666             1795       1451
> > summary(ws)
    Channel          Region          Fresh             Milk          Grocery          Frozen        Detergents_Paper 
 Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0   Min.   :    3.0  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8  
 Median :1.000   Median :3.000   Median :  8504   Median : 3627   Median : 4756   Median : 1526.0   Median :  816.5  
 Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0  
 Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0   Max.   :40827.0  
   Delicassen     
 Min.   :    3.0  
 1st Qu.:  408.2  
 Median :  965.5  
 Mean   : 1524.9  
 3rd Qu.: 1820.2  
 Max.   :47943.0  
> colnames(ws)
[1] "Channel"          "Region"           "Fresh"            "Milk"             "Grocery"          "Frozen"          
[7] "Detergents_Paper" "Delicassen" 
> ws$Channel<-factor(ws$Channel,labels=c('horeca','retail'))
> ws$Region<-factor(ws$Region,labels=c('Lisbon','Oporto','Other Region'))
> dim(ws)
[1] 440   8

散点矩阵图

> plot(ws[,3:8],col=(ws$Region),pch=as.numeric(ws$Region))

R语言k最近邻法(KNN)_第1张图片

数据划分

> sampa<-sample(1:440,220)
> table(ws[sampa,2])

      Lisbon       Oporto Other Region 
          35           27          158 
> sampb<-(1:440)[-sampa]
> table(ws[sampb,2])

      Lisbon       Oporto Other Region 
          42           20          158 

预测

> library(class)
> ws.knn<-knn(ws[sampa,3:8],ws[sampb,3:8],ws[sampa,2]) #knn函数预测测试数据的类别(即ws[sampb,2])

交叉统计,计算误分类率。

> table(ws.knn,ws[sampb,2])
              
ws.knn         Lisbon Oporto Other Region
  Lisbon           11      4           27
  Oporto            6      4           21
  Other Region     29     14          104

> cr<-table(ws.knn,ws[sampb,2])
> cr
              
ws.knn         Lisbon Oporto Other Region
  Lisbon           11      4           27
  Oporto            6      4           21
  Other Region     29     14          104
> (sum(cr)-sum(diag(cr)))/sum(cr)
[1] 0.4590909

误差率较高。k 取值一般不大于记录总数的平方根,所以取k15,对k=3到k=20循环预测,比较误差率。

> sqrt(220)
[1] 14.8324
> mcrate<-numeric(20)
> mcrate
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> for (i in 3:20){ws.knn<-knn(ws[sampa,3:8],ws[sampb,3:8],ws[sampa,2],k=i)
+ cr<-table(ws.knn,ws[sampb,2])
+ mcrate[i]<-(sum(cr)-sum(diag(cr)))/sum(cr)}
> plot(3:20,mcrate[3:20],type='l')

R语言k最近邻法(KNN)_第2张图片

> which(mcrate==min(mcrate[3:15]))
[1] 11 13 14 15 16 17 18 19 20
> mcrate[which(mcrate==min(mcrate[3:15]))]
 [1] 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909 0.3090909

变量的基准化和标准化

实际使用KNN法时需要对变量进行基准化或标准化。
**基准化:**使各变量在0-1之间取值。

> normalize<-function(x){return ((x-min(x))/(max(x)-min(x)))}
> ws.n[,3]<-normalize(ws.n[,3])
> summary(ws.n[,3])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.02786 0.07580 0.10698 0.15097 1.00000 
> for (i in 4:8){ws.n[,i]<-normalize(ws.n[,i])}
> ws.n.knn<-knn(ws.n[sampa,3:8],ws.n[sampb,3:8],ws.n[sampa,2],k=15)
> table(ws.n[sampb,2],ws.n.knn)
              ws.n.knn
               Lisbon Oporto Other Region
  Lisbon            0      0           43
  Oporto            0      0           22
  Other Region      0      1          154
> (sum(cr)-sum(diag(cr)))/sum(cr)
[1] 0.3

标准化:

> ws.z<-ws
> ws.z[,3]<-scale(ws.z[,3])
> c(mean(ws.z[,3]),sd(ws.z[,3]))
[1] -3.394982e-17  1.000000e+00
> for (i in 4:8){ws.z[,i]<-scale(ws.z[,i])}
> ws.z.knn<-knn(ws.z[sampa,3:8],ws.z[sampb,3:8],ws.z[sampa,2],k=15)
> cr<-table(ws.z.knn,ws.z[sampb,2])
> cr             
ws.z.knn       Lisbon Oporto Other Region
  Lisbon            0      0            0
  Oporto            0      0            0
  Other Region     43     22          155
> (sum(cr)-sum(diag(cr)))/sum(cr)
[1] 0.2954545

你可能感兴趣的:(数据分析及挖掘,R语言,数据分析)