【R语言】白葡萄酒的EDA分析

白葡萄酒的EDA分析

  • 1.项目相关信息
    • 1.1 评估标准
    • 1.2 项目模板
    • 1.3 数据集列表
    • 1.4 项目示例
    • 1.5 数据选择
      • 1.5.1 选择
      • 1.5.2 详细数据说明
      • 1.5.3 有关项目提交的常见问题
  • 2.环境准备
    • 2.1 导入相关包
    • 2.2 加载数据集
    • 2 数据整理
    • 2.1 数据评估
      • 2.1.1 质量类问题
      • 2.1.2 结构性问题
    • 2.1 数据清洗
      • 2.1.1 质量类问题整理
      • 2.1.2 结构性问题整理
  • 3.数据探索
    • 3.1 数据概要认识
      • 3.1.1 数据集介绍
      • 3.1.2 数据字段概要描述
      • 3.1.3 数据字段类型
      • 3.1.4 数据整理
    • 3.2 单变量分析
      • 3.2.1 fixed.acidity:非挥发性酸度
        • 3.2.1.1 histogram分布
        • 3.2.1.2 箱型图
      • 3.2.2 volatile.acidity:挥发性酸度
        • 3.2.2.1 histogram分布
        • 3.2.2.2 箱型图
      • 3.2.3 citric.acid:柠檬酸
        • 3.2.3.1 histogram分布
        • 3.2.3.2 箱型图
      • 3.2.4 residual.sugar:剩余糖分
      • 3.2.5 chlorides:含盐量
      • 3.2.6 free.sulfur.dioxide:游离二氧化硫
      • 3.2.7 total.sulfur.dioxide:总二氧化硫
      • 3.2.8 density:密度
      • 3.2.9 pH:酸碱度
      • 3.2.10 sulphates:硫酸盐
      • 3.2.11 alcohol:酒精浓度
      • 3.2.12 quality:质量评分
      • 3.2.13 反思
        • 反思问题1
    • 3.3 分析两个变量
      • 3.3.1 酸度与pH值
        • 3.3.1.1 pH与各种酸的关系
      • 3.3.2 酸度与成分
        • 3.3.2.2 citric.acid与fixed.acidity,volatile.acidity之间关系
        • 3.3.2.3 sulphates,free.sulfur.dioxide,total.sulfur.dioxide三者的关系
      • 3.3.3 白葡萄酒密度density
        • 3.3.3.1 density与alcohol,sulphates,citric.acid,total.sulfur.dioxide的关系
      • 3.3.4 质量评分
      • 3.3.5 散点矩阵图分析
        • 3.3.5.1 散点矩阵图观察
        • 3.3.5.1 free.sulfur.dioxide与total.sulfur.dioxide的关系
      • 3.3.6 反思
    • 3.4 分析多个变量
      • 3.4.1 total.sulfur.dioxide+free.sulfur.dioxide对quality评分的影响
      • 3.4.2 total.sulfur.dioxide+density对quality评分的影响
      • 3.4.3 density+residual.sugar对quality评分的影响
      • 3.4.4 density+alcohol对quality评分的影响
      • 3.4.5 quality评分预测模型
  • 4 最终结论与反思
    • 4.1 结论
      • 4.1.1 图1
        • 4.1.1.1 绘图代码
        • 4.1.1.2 定稿图
        • 4.1.1.3 结论描述
      • 4.1.2 图2
        • 4.1.2.1 绘图代码
        • 4.1.2.2 定稿图
        • 4.1.2.3 结论描述
      • 4.1.3 图3
        • 4.1.3.1 绘图代码
        • 4.1.3.2 定稿图
        • 4.1.3.3 结论描述
    • 4.2 反思
      • 4.2.1 数据分析与模型预测的关系
    • 4.3 写在结尾

1.项目相关信息

1.1 评估标准

1.2 项目模板

1.3 数据集列表

1.4 项目示例

大西洋飓风气象学
美国音乐家地理学
克里斯·萨登的《钻石探险》

1.5 数据选择

1.5.1 选择

白葡萄酒质量数据

1.5.2 详细数据说明

1.5.3 有关项目提交的常见问题

· 数据处理或转换(创建分类变量)应包含在 RMD 文件和最终拼合的 HTML 输出中。
· 未包含反思部分。
· 反思部分不是 RMD 文件中的最后一个部分。
· 未包含最终图形部分。
· 最终图形部分未处于 RMD 文件的结尾(在反思部分之前)。
· 最终图形部分未包含三个图形。
· 最终图形部分中的一个或多个图形未揭示出数据集的结果或模式。
· 最终图形未经修饰,并且缺失标题或单位。
· 为最终图形部分中的数据选择了不合适的图形。

2.环境准备

2.1 导入相关包

2.2 加载数据集

2 数据整理

2.1 数据评估

2.1.1 质量类问题

2.1.2 结构性问题

2.1 数据清洗

2.1.1 质量类问题整理

2.1.2 结构性问题整理

3.数据探索

3.1 数据概要认识

3.1.1 数据集介绍

##  [1] "rows=4898"
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 6 5 7 8 4 3 9

数据集包含4898种⽩葡萄酒,及11个量化每种酒化学成分的变量。
⾄少3名葡萄酒专家对每种酒的质量进⾏了评分,分数在 0(⾮常差)和 10(⾮常好)之间。

nrow(wineQualityWhites)  
names(wineQualityWhites)  
unique(wineQualityWhites$quality)

3.1.2 数据字段概要描述

· fixed.acidity(g/l):非挥发性酸度, 该变量指的是葡萄酒中的固定或者非挥发性酸度
· volatile.acidity(g/l):挥发性酸度,葡萄酒中的醋酸含量过高,会导致醋的味道不愉快
· citric.acid(g/l):柠檬酸,柠檬酸含量小,能给葡萄酒增添新鲜感和风味
· residual.sugar(g/l):剩余糖分,发酵结束后剩下的糖分,很少发现低于1克/升的葡萄酒,超过45克/升的葡萄酒被认为是甜的
· chlorides(g/l):酒中的盐量
· free.sulfur.dioxide(mg/l):游离二氧化硫, 酒中带硫元素的离子,它可以防止微生物的生长和葡萄酒的氧化
· total.sulfur.dioxide(mg/l): 总二氧化硫,低浓度时检测不到,当浓度超过50 ppm时用鼻子可以闻到
· density(g/ml):密度,大致接近于水,具体取决于酒精和糖的含量
· pH:用于描述酒的酸碱度
· sulphates(g/l):硫酸盐,葡萄酒的添加剂,用于控制二氧化硫比例
· alcohol(% by volume):酒中的酒精浓度
· quality:酒的质量,从0到10分不等

3.1.3 数据字段类型

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

总计13个(包括index)字段:
11个num类型字段: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol
2个int类型字段: index, quality

str(wineQualityWhites)

3.1.4 数据整理

## [1] 6 5 7 8 4 3 9

quality评分是一个在3~9之间取值离散变量,应该转换为factor类型。
另外,’X’变量是索引,在分析中没有任何意义,提前删除。

unique(wineQualityWhites$quality)
#将原有评分复制为quality_score
wineQualityWhites$quality_score = wineQualityWhites$quality
#将quality转换为factor类型
wineQualityWhites$quality = factor(wineQualityWhites$quality)
#去除在分析中不用的索引字段
wineQualityWhites = subset(wineQualityWhites, select = -c(X))

3.2 单变量分析

3.2.1 fixed.acidity:非挥发性酸度

3.2.1.1 histogram分布

ggplot(data = wineQualityWhites, aes(x=fixed.acidity))+  
       geom_histogram(binwidth = 0.2)+  
       scale_x_continuous(limits = c(0, 15), breaks = seq(0, 15, 1))

变量fixed.acidity呈正态分布,集中分布在[4, 10]这个取值区间内
【R语言】白葡萄酒的EDA分析_第1张图片

3.2.1.2 箱型图

p1 = ggplot(data=wineQualityWhites, aes(x="", y=fixed.acidity))+
     geom_boxplot()
p2 = ggplot(data=wineQualityWhites, aes(x="", y=fixed.acidity))+
     geom_boxplot()+
     coord_cartesian(ylim = c(6, 8))
grid.arrange(p1, p2, ncol=2)

从箱型图中,可以看出变量fixed.acidity超过90%的值分布在[6.0, 7.5]区间内
【R语言】白葡萄酒的EDA分析_第2张图片

3.2.2 volatile.acidity:挥发性酸度

3.2.2.1 histogram分布

p1 = ggplot(data=wineQualityWhites, aes(x=volatile.acidity))+
     geom_histogram(binwidth = 0.2)+
     scale_x_continuous(limits = c(0, 15), breaks = seq(0, 15, 1))
p2 = ggplot(data=wineQualityWhites, aes(x=volatile.acidity))+
     geom_histogram(binwidth = 0.05)+
     scale_x_log10()
grid.arrange(p1, p2, ncol=2)

变量volatile.acidity呈长尾分布,通过log10转换后呈正太分布,主要集中分布在[0.1, 1]这个取值区间内
【R语言】白葡萄酒的EDA分析_第3张图片

3.2.2.2 箱型图

p1 = ggplot(data=wineQualityWhites, aes(x="", y=volatile.acidity))+
  	 geom_boxplot()
p2 = ggplot(data=wineQualityWhites, aes(x="", y=volatile.acidity))+
 	 geom_boxplot()+
  	 coord_cartesian(ylim = c(0.2, 0.35))
grid.arrange(p1, p2, ncol=2)

从箱型图中,可以看出变量volatile.acidity超过90%的值分布在[0.2, 0.35]区间内
【R语言】白葡萄酒的EDA分析_第4张图片

3.2.3 citric.acid:柠檬酸

3.2.3.1 histogram分布

ggplot(data=wineQualityWhites, aes(x=citric.acid))+
 	   geom_histogram(binwidth = 0.1)

可见大部分白葡萄酒的柠檬酸含量偏低,主要集中在[0,1.0]这个区间内。柠檬酸含量超过1.0的白葡萄酒极少
【R语言】白葡萄酒的EDA分析_第5张图片

3.2.3.2 箱型图

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
summary(wineQualityWhites$citric.acid)
p1 = ggplot(data=wineQualityWhites, aes(x="", y=citric.acid))+
 			geom_boxplot()
p2 = ggplot(data=wineQualityWhites, aes(x="", y=citric.acid))+
  			geom_boxplot()+
  			coord_cartesian(ylim = c(0.25, 0.4))
grid.arrange(p1, p2, ncol=2)

变量citric.acid均值为0.3342,从箱型图可以看出,超过90%的值分布在[0.25, 0.4]区间内
【R语言】白葡萄酒的EDA分析_第6张图片

3.2.4 residual.sugar:剩余糖分

summary(wineQualityWhites$residual.sugar)
p1 = ggplot(data=wineQualityWhites, aes(x=residual.sugar))+
		    geom_histogram(binwidth = 0.5)
p2 = ggplot(data=wineQualityWhites, aes(x=residual.sugar))+
  			geom_histogram(binwidth = 0.05)+
  			scale_x_log10()
p3 = ggplot(data=wineQualityWhites, aes(x="", y=residual.sugar))+
  			geom_boxplot()+
  			coord_cartesian(ylim=c(0, 12))
grid.arrange(p1, p2, p3, ncol=2)

可见大部分白葡萄酒的剩余糖分偏低,超过90%集中在[0,12]这个区间,剩余糖分含量超过20的白葡萄酒极少。median值5.200小于mean值6.391也反映了这个事实。
另外通过log10转换后,residual.sugar分布呈双正态分布。也许是不同的人群对于两种级别甜度的白葡萄酒有偏好,因此葡萄酒生产商会偏向于生产这两种剩余糖分的白葡萄酒。
【R语言】白葡萄酒的EDA分析_第7张图片

3.2.5 chlorides:含盐量

p1 = ggplot(data=wineQualityWhites, aes(x=chlorides))+
 			geom_histogram(binwidth = 0.005)+
  			scale_x_continuous(breaks = seq(0, 0.35, 0.02))
p2 = ggplot(data=wineQualityWhites, aes(x="", y=chlorides))+
  			geom_boxplot()+
  			coord_cartesian(ylim=c(0.03, 0.06))
grid.arrange(p1, p2, ncol=2)

大部分白葡萄酒的含盐量很低,绝对部分集中在[0.03,0.06]这个区间。酒基本不会有咸味,这与我们日常生活体验是相符的。

chlorides的分布似乎呈长尾分布,通过log10转换得到的分布更贴近正态分布。这可能是因为盐分含量本身低,而人们对盐度的感知基本要靠“数量级”层次的变化才能有明确的感知。因此在葡萄酒制造时,盐分相关的物质添加时也是按照这一规律来控制的。
【R语言】白葡萄酒的EDA分析_第8张图片

3.2.6 free.sulfur.dioxide:游离二氧化硫

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
## [1] "[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[-46.000000, [-11.500000, 80.500000], 115.000000]"
q = summary(wineQualityWhites$free.sulfur.dioxide)
print(q)
sprintf("[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[%f, [%f, %f], %f]",
        q['1st Qu.']-3*(q['3rd Qu.']-q['1st Qu.']), 
        q['1st Qu.']-1.5*(q['3rd Qu.']-q['1st Qu.']), 
        q['3rd Qu.']+1.5*(q['3rd Qu.']-q['1st Qu.']),
        q['3rd Qu.']+3*(q['3rd Qu.']-q['1st Qu.']))
p1 = ggplot(data = wineQualityWhites, aes(x = free.sulfur.dioxide))+
  			geom_histogram(binwidth = 10)+
  			scale_x_continuous(breaks = seq(0, 300, 20))
p2 = ggplot(data = wineQualityWhites, aes(x = "", y = free.sulfur.dioxide))+
 		    geom_boxplot()+
  			geom_hline(yintercept = 115.000000, col = "#FF6600")+
  			geom_hline(yintercept = 80.500000, col = "#6699FF")+
  			geom_hline(yintercept = -11.500000, col = "#6699FF")+
 			geom_hline(yintercept = -46.000000, col = "#FF6600")
grid.arrange(p1, p2, ncol = 2)

白葡萄酒的free.sulfur.dioxide超过95%集中在[0,80.500000]的区间内,基本符合正态分布。
wineQualityWhites数据的free.sulfur.dioxide存在极端异常值。这些异常值在后续模型和分析中需要处理。
【R语言】白葡萄酒的EDA分析_第9张图片

3.2.7 total.sulfur.dioxide:总二氧化硫

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
## [1] "[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[-69.000000, [19.500000, 255.500000], 344.000000]"
q = summary(wineQualityWhites$total.sulfur.dioxide)
print(q)
sprintf("[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[%f, [%f, %f], %f]",
        q['1st Qu.']-3*(q['3rd Qu.']-q['1st Qu.']), 
        q['1st Qu.']-1.5*(q['3rd Qu.']-q['1st Qu.']), 
        q['3rd Qu.']+1.5*(q['3rd Qu.']-q['1st Qu.']),
        q['3rd Qu.']+3*(q['3rd Qu.']-q['1st Qu.']))
p1 = ggplot(data = wineQualityWhites, aes(x = total.sulfur.dioxide))+
  			geom_histogram(binwidth = 10)+
  			scale_x_continuous(breaks = seq(0, 450, 25))
p2 = ggplot(data = wineQualityWhites, aes(x = "", y = total.sulfur.dioxide))+
 			geom_boxplot()+
  			geom_hline(yintercept = 344.000000, col = "#FF6600")+
  			geom_hline(yintercept = 255.500000, col = "#6699FF")+
  			geom_hline(yintercept = 19.500000, col = "#6699FF")+
  			geom_hline(yintercept = -69.000000, col = "#FF6600")
grid.arrange(p1, p2, ncol = 2)

白葡萄酒的total.sulfur.dioxide超过95%集中在[19.5,255.500000]的区间内,基本呈正态分布。
wineQualityWhites数据的total.sulfur.dioxide存在少量极大异常值,超出了Q3+3IQR范围。这些异常值在后续模型和分析中需要处理。
【R语言】白葡萄酒的EDA分析_第10张图片

3.2.8 density:密度

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## [1] "[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[0.978590, [0.985156, 1.002666], 1.009232]"
q = summary(wineQualityWhites$density)
print(q)
sprintf("[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[%f, [%f, %f], %f]",
        q['1st Qu.']-3*(q['3rd Qu.']-q['1st Qu.']), 
        q['1st Qu.']-1.5*(q['3rd Qu.']-q['1st Qu.']), 
        q['3rd Qu.']+1.5*(q['3rd Qu.']-q['1st Qu.']),
        q['3rd Qu.']+3*(q['3rd Qu.']-q['1st Qu.']))
p1 = ggplot(data = wineQualityWhites, aes(x = density))+
  			geom_histogram(binwidth = 0.001)+
  			scale_x_continuous(breaks = seq(0, 1.04, 0.005))
p2 = ggplot(data = wineQualityWhites, aes(x = "", y = density))+
 			geom_boxplot()+
  			coord_cartesian(ylim = c(0.985, 1.005))
grid.arrange(p1, p2, ncol = 2)

大部分白葡萄酒的密度略低于水密度1,酒中包含酒精和一些矿物质、维生素等成分。其中酒精密度低于水,矿物质和维生素成分高于水,但可能由于酒精含量高于矿物质+维生素成分,所以拉低了葡萄酒密度。
另外通过boxplot可知,wineQualityWhites数据的density存在极端异常值。这些异常值在后续模型和分析中需要处理。
【R语言】白葡萄酒的EDA分析_第11张图片

3.2.9 pH:酸碱度

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
## [1] "[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[2.520000, [2.805000, 3.565000], 3.850000]"
q = summary(wineQualityWhites$pH)
print(q)
sprintf("[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[%f, [%f, %f], %f]",
        q['1st Qu.']-3*(q['3rd Qu.']-q['1st Qu.']), 
        q['1st Qu.']-1.5*(q['3rd Qu.']-q['1st Qu.']), 
        q['3rd Qu.']+1.5*(q['3rd Qu.']-q['1st Qu.']),
        q['3rd Qu.']+3*(q['3rd Qu.']-q['1st Qu.']))
p1 = ggplot(data = wineQualityWhites, aes(x = pH))+
  			geom_histogram(binwidth = 0.02)+
  			scale_x_continuous(breaks = seq(2.7, 3.9, 0.1))
p2 = ggplot(data=wineQualityWhites, aes(x="", y=pH))+
  			geom_boxplot()+
  			geom_hline(yintercept = 3.850000, col = "#FF6600")+
  			geom_hline(yintercept = 3.565000, col = "#6699FF")+
  			geom_hline(yintercept = 2.805000, col = "#6699FF")+
  			geom_hline(yintercept = 2.520000, col = "#FF6600")
grid.arrange(p1, p2, ncol = 2)

根据数据的结构可知葡萄酒中含有酸性成分,因此它们的pH均低于7。并且葡萄酒基本呈酸味,因此可以猜测葡萄酒呈酸性,这也与“酸性物质pH低于7”的常识相符。
该变量包含了少数1.5倍IQR的异常值,但没有3倍IQR异常值。
【R语言】白葡萄酒的EDA分析_第12张图片

3.2.10 sulphates:硫酸盐

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## [1] "[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[-0.010000, [0.200000, 0.760000], 0.970000]"
q = summary(wineQualityWhites$sulphates)
print(q)
sprintf("[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[%f, [%f, %f], %f]",
        q['1st Qu.']-3*(q['3rd Qu.']-q['1st Qu.']), 
        q['1st Qu.']-1.5*(q['3rd Qu.']-q['1st Qu.']), 
        q['3rd Qu.']+1.5*(q['3rd Qu.']-q['1st Qu.']),
        q['3rd Qu.']+3*(q['3rd Qu.']-q['1st Qu.']))
p1 = ggplot(data = wineQualityWhites, aes(x = sulphates))+
  			geom_histogram(binwidth = 0.02)+
  			scale_x_continuous(breaks = seq(0.2, 1.1, 0.1))
p2 = ggplot(data = wineQualityWhites, aes(x = "", y = sulphates))+
  			geom_boxplot()+
  			geom_hline(yintercept = 0.97, col = "#FF6600")+
  			geom_hline(yintercept = 0.76, col = "#6699FF")+
  			geom_hline(yintercept = 0.2, col = "#6699FF")+
  			geom_hline(yintercept = -0.01, col = "#FF6600")
grid.arrange(p1, p2, ncol = 2)

白葡萄酒的sulphates超过95%集中在[0.20,0.76]的区间内,基本呈正态分布。
wineQualityWhites数据的sulphates存在较多超出Q3+1.5IQR的异常值,及少量的超出Q3+3IQR范围的极大异常值。这些异常值在后续模型和分析中需要处理。
【R语言】白葡萄酒的EDA分析_第13张图片

3.2.11 alcohol:酒精浓度

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
## [1] "[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[3.800000, [6.650000, 14.250000], 17.100000]"
q = summary(wineQualityWhites$alcohol)
print(q)
sprintf("[Q1-3IQR, [Q1-1.5IQR, Q3+1.5IQR], Q3+3IQR]=[%f, [%f, %f], %f]",
        q['1st Qu.']-3*(q['3rd Qu.']-q['1st Qu.']), 
        q['1st Qu.']-1.5*(q['3rd Qu.']-q['1st Qu.']), 
        q['3rd Qu.']+1.5*(q['3rd Qu.']-q['1st Qu.']),
        q['3rd Qu.']+3*(q['3rd Qu.']-q['1st Qu.']))
p1 = ggplot(data = wineQualityWhites, aes(x = alcohol))+
  			geom_histogram(binwidth = 0.1)+
  			scale_x_continuous(breaks = seq(0, 15, 0.5))
p2 = ggplot(data = wineQualityWhites, aes(x = "", y = alcohol))+
  			geom_boxplot()+
  			geom_hline(yintercept = 17.100000, col = "#FF6600")+
  			geom_hline(yintercept = 14.250000, col = "#6699FF")+
  			geom_hline(yintercept = 6.650000, col = "#6699FF")+
  			geom_hline(yintercept = 3.800000, col = "#FF6600")
grid.arrange(p1, p2, ncol = 2)

白葡萄酒一般酒精浓度较低,从图中可以看出超过14°以上的白葡萄酒极为稀少,绝大部分白葡萄酒在8°~13°之间。这与葡萄酒一般没有烈酒的常识相符。
该变量基本没有异常值。
【R语言】白葡萄酒的EDA分析_第14张图片

3.2.12 quality:质量评分

ggplot(data=wineQualityWhites, aes(x=quality))+
  	   geom_bar()

白葡萄酒quality评分一般是0~10分之间。但wineQualityWhites数据说明分数在[0, 3]和[9, 10]的白葡萄酒极为稀少。大多数白葡萄酒评分集中在[5, 7]内。酒评分基本符合正态分布。
【R语言】白葡萄酒的EDA分析_第15张图片

3.2.13 反思

反思问题1

对于boxplot中的一些异常值,该如何处理。 是需要等到模型生成的时候进行删除?还是说少量的异常值可以忽略不计?

3.3 分析两个变量

3.3.1 酸度与pH值

3.3.1.1 pH与各种酸的关系

## [1] "relevance between fixed.acidity and pH is : -0.425858"
## [1] "relevance between volatile.acidity and pH is : -0.031915"
## [1] "relevance between citric.acid and pH is : -0.163748"
p1 = ggplot(data = wineQualityWhites, aes(x = fixed.acidity, y = pH))+
  			geom_point()+
  			geom_smooth(method = 'lm')
p2 = ggplot(data = wineQualityWhites, aes(x = volatile.acidity, y = pH))+
  			geom_point()+
  			geom_smooth(method = 'lm')
p3 = ggplot(data = wineQualityWhites, aes(x = citric.acid, y = pH))+
  			geom_point()+
  			geom_smooth(method = 'lm')
grid.arrange(p1, p2, p3, ncol = 1)

fixed.acidity.pH.rel = with(wineQualityWhites, cor.test(fixed.acidity, pH), method = c("pearson"))
volatile.acidity.pH.rel = with(wineQualityWhites, cor.test(volatile.acidity, pH), method = c("pearson"))
citric.acid.pH.rel = with(wineQualityWhites, cor.test(citric.acid, pH), method = c("pearson"))
sprintf("relevance between fixed.acidity and pH is : %f",fixed.acidity.pH.rel['estimate'])
sprintf("relevance between volatile.acidity and pH is : %f", volatile.acidity.pH.rel['estimate'])
sprintf("relevance between citric.acid and pH is : %f", citric.acid.pH.rel['estimate'])

从图中可以看出来pH值与volatile.acidity没有太大相关性。
可能因为是挥发性酸,因此水溶性不好,不容易生成H离子,因此对酸度影响不大。
而pH与fixed.acidity, citric.acid呈负相关性。即fixed.acidity,citric.acid含量高的白葡萄酒,其pH也高。
酸性物质含量高pH值越低,这与化学原理相符

通过相关性测试值
relevance between fixed.acidity and pH is : -0.425858
relevance between volatile.acidity and pH is : -0.031915
relevance between citric.acid and pH is : -0.163748
也可以得出相应结论。
并且fixed.acidity与pH相关性值为-0.42585,对于pH影响最大。
【R语言】白葡萄酒的EDA分析_第16张图片

3.3.2 酸度与成分

3.3.2.2 citric.acid与fixed.acidity,volatile.acidity之间关系

## [1] "relevance between citric.acid and volatile.acidity is : -0.149472"
## [1] "relevance between citric.acid and fixed.acidity is : 0.289181"

稳定性酸与挥发性酸有没有可能是柠檬酸的不同成分而产生的?
这里通过柠檬酸与稳定性酸、挥发性酸的相关性来得出结论。

citric_and_volatile.acidity.rel = with(wineQualityWhites, 
                                     cor.test(citric.acid, volatile.acidity))
citric_and_fixed.acidity.rel = with(wineQualityWhites, 
                                  cor.test(citric.acid, fixed.acidity))
sprintf("relevance between citric.acid and volatile.acidity is : %f",
        citric_and_volatile.acidity.rel['estimate'])
sprintf("relevance between citric.acid and fixed.acidity is : %f",
        citric_and_fixed.acidity.rel['estimate'])

p1 = ggplot(data = wineQualityWhites, aes(x = citric.acid, y = volatile.acidity))+
  			geom_point()
p2 = ggplot(data = wineQualityWhites, aes(x = citric.acid, y = fixed.acidity))+
  			geom_point()
grid.arrange(p1, p2, ncol = 1)

从图和相关性测试值 relevance between citric.acid and volatile.acidity is : -0.149472
relevance between citric.acid and fixed.acidity is : 0.289181
可以看出citric.acid与volatile.acidity呈微弱的负相关性,与fixed.acidity呈正的弱相关性。
挥发性酸应该不是柠檬酸的成分。而柠檬酸应该是稳定性酸的一部分。
【R语言】白葡萄酒的EDA分析_第17张图片

3.3.2.3 sulphates,free.sulfur.dioxide,total.sulfur.dioxide三者的关系

猜想1:由于sulphates是用于控制二氧化硫的,那么sulphates与total.sulfur.dioxide应该呈负相关性
猜想2:sulphates这一化学成分应该是free.sulfur.dioxide主要构成部分

## [1] "relevance between sulphates and total.sulfur.dioxide is : 0.134562"
## [1] "relevance between sulphates and free.sulfur.dioxide is : 0.059217"
total.sulfur_and_sulphates.rel = with(wineQualityWhites, 
                                    cor.test(sulphates, total.sulfur.dioxide))
free.sulfur_and_sulphates.rel = with(wineQualityWhites, 
                                   cor.test(sulphates, free.sulfur.dioxide))
sprintf("relevance between sulphates and total.sulfur.dioxide is : %f",
        total.sulfur_and_sulphates.rel['estimate'])
sprintf("relevance between sulphates and free.sulfur.dioxide is : %f",
        free.sulfur_and_sulphates.rel['estimate'])

p1 = ggplot(data = wineQualityWhites, aes(x = total.sulfur.dioxide, y = sulphates)) + geom_point()
p2 = ggplot(data = wineQualityWhites, aes(x = free.sulfur.dioxide, y = sulphates)) + geom_point()
grid.arrange(p1, p2, ncol = 1)

从图和相关性测试值 relevance between sulphates and total.sulfur.dioxide is : 0.134562
relevance between sulphates and free.sulfur.dioxide is : 0.059217
可以看出sulphates与total.sulfur.dioxide, free.sulfur.dioxide的相关性都很弱。
即sulphates硫酸盐添加剂对葡萄酒中的二氧化硫含量影响不大,对free.sulfur.dioxide游离硫离子的影响更是微乎其微。
【R语言】白葡萄酒的EDA分析_第18张图片

3.3.3 白葡萄酒密度density

3.3.3.1 density与alcohol,sulphates,citric.acid,total.sulfur.dioxide的关系

疑问1:density会受到酒精含量、硫酸盐、二氧化硫、柠檬酸等成分的影响,但主要影响成分是哪些?
疑问2:是否因为酒精含量明显高于其它化学成分,才导致了density低于1

## [1] "relevance between density and alcohol is : -0.780138"
## [1] "relevance between density and sulphates is : 0.074493"
## [1] "relevance between density and citric.acid is : 0.149503"
## [1] "relevance between density and total.sulfur.dioxide is : 0.529881"
## [1] "relevance between density and residual.sugar is : 0.838966"
# 由于alcohol的密度是0.789/ml,并且wineQualityWhites中给出的单位是% by volume
# 因此对alcohol变量作了789*alcohol/100的转换,转成了g/l单位
p1 = ggplot(data = wineQualityWhites, aes(x = 789*alcohol/100, y = density))+
  			geom_point(alpha = 1/5, position = position_jitter(h = 0))+ xlab("alcohol(g/l)")
p2 = ggplot(data = wineQualityWhites, aes(x = sulphates, y = density))+
  			geom_point(alpha = 1/5, position = position_jitter(h = 0))+
  xlab("sulphates(g/l)")
p3 = ggplot(data = wineQualityWhites, aes(x = citric.acid, y = density))+
  			geom_point(alpha = 1/10, position = position_jitter(h = 0))+
  xlab("citric.acid(g/l)")
#wineQualityWhites中给出的单位是mg/dm^3,因此需要除以1000转换成单位(g/dm^3)
p4 = ggplot(data = wineQualityWhites, aes(x = total.sulfur.dioxide/1000, y = density))+ geom_point(alpha = 1/10, position = position_jitter(h = 0))+
  xlab("total.sulfur.dioxide(g/l)")
p5 = ggplot(data = wineQualityWhites, aes(x = residual.sugar, y = density))+
  geom_point(alpha = 1/10, position = position_jitter(h = 0))+
  xlab("residual.sugar(g/l)")
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)

density.alcohol.rel = with(wineQualityWhites, cor.test(density, alcohol))
density.residual.sugar.rel = with(wineQualityWhites, cor.test(density, residual.sugar))
density.sulphates.rel = with(wineQualityWhites, cor.test(density, sulphates))
density.citric.acid.rel = with(wineQualityWhites, cor.test(density, citric.acid))
density.total.sulfur.dioxide.rel = with(wineQualityWhites, cor.test(density, total.sulfur.dioxide))
sprintf("relevance between density and alcohol is : %f", 
        density.alcohol.rel['estimate'])
sprintf("relevance between density and sulphates is : %f", 
        density.sulphates.rel['estimate'])
sprintf("relevance between density and citric.acid is : %f", 
        density.citric.acid.rel['estimate'])
sprintf("relevance between density and total.sulfur.dioxide is : %f",
        density.total.sulfur.dioxide.rel['estimate'])
sprintf("relevance between density and residual.sugar is : %f",
        density.residual.sugar.rel['estimate'])

把alcohol,sulphates,citric.acid,total.sulfur.dioxide,residual.sugar五种化学成分都转化成g/l单位,再绘制它们与density的scatter图。
其中alcohol, residual.sugar, 与density有较高相关性,total.sulfur.dioxide与密度有一定相关性,sulphates, citric.acid与density相关性几乎可以忽略。 从中容易看出含量高的成分对density影响大。

另外alcohol与密度呈负相关性,residual.sugar, total.sulfur.dioxide与density呈正相关性。 这个现象是因为alcohol密度低于水,因此含量越高,酒密度越低。
而residual.sugar, total.sulfur.dioxide密度高于水,因此含量越高酒密度越高。
【R语言】白葡萄酒的EDA分析_第19张图片

3.3.4 质量评分

## [1] "relevance between quality and volatile.acidity is : -0.194723"
## [1] "relevance between quality and total.sulfur.dioxide is : -0.174737"
## [1] "relevance between quality and alcohol is : 0.435575"
## [1] "relevance between quality and citric.acid is : -0.009209"
## [1] "relevance between quality and chlorides is : -0.209934"
## [1] "relevance between quality and residual.sugar is : -0.097577"

人对酒的感知一般通过气味,口感这两个维度来感知。 而影响气味的主要包括volatile.acidity,alcohol,total.sulfur.dioxide,citric.acid这四个成分。影响口感则主要是citric.acid, chlorides,residual.sugar和alcohol。
疑问:酒的质量评分quality是否与volatile.acidity, total.sulfur.dioxide,citric.acid, chlorides, residual.sugar, alcohol存在什么关系呢?

p1 = ggplot(data = wineQualityWhites, aes(x = quality, y = volatile.acidity))+
  			geom_jitter(alpha = 0.3) +
  			geom_boxplot(alpha = 0.3, color = 'blue')+
  			stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+ geom_smooth(method = 'lm', aes(group = 1))
p2 = ggplot(data = wineQualityWhites, aes(x = quality, y = total.sulfur.dioxide))+
  			geom_jitter(alpha = 0.3) +
  			geom_boxplot(alpha = 0.3, color = 'blue')+
  			stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+
  			geom_smooth(method = 'lm', aes(group = 1))
p3 = ggplot(data = wineQualityWhites, aes(x = quality, y = alcohol))+
  			geom_jitter(alpha = 0.3) +
  			geom_boxplot(alpha = 0.3, color = 'blue')+
  			stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+
  			geom_smooth(method = 'lm', aes(group = 1))
p4 = ggplot(data = wineQualityWhites, aes(x = quality, y = citric.acid))+
  			geom_jitter(alpha = 0.3) +
  			geom_boxplot(alpha = 0.3, color = 'blue')+
  			stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+
 	 		geom_smooth(method = 'lm', aes(group = 1))
p5 = ggplot(data = wineQualityWhites, aes(x = quality, y = chlorides))+
  			geom_jitter(alpha = 0.3) +
  			geom_boxplot(alpha = 0.3, color = 'blue')+
  			stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+
  			geom_smooth(method = 'lm', aes(group = 1))
p6 = ggplot(data = wineQualityWhites, aes(x = quality, y = residual.sugar))+
  			geom_jitter(alpha = 0.3) +
  			geom_boxplot(alpha = 0.3, color = 'blue')+
  			stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+
  			geom_smooth(method = 'lm', aes(group = 1))
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

quality.volatile.acidity.rel = with(wineQualityWhites, cor.test(quality_score, volatile.acidity))
quality.total.sulfur.dioxide.rel = with(wineQualityWhites, cor.test(quality_score, total.sulfur.dioxide))
quality.alcohol.rel = with(wineQualityWhites, cor.test(quality_score, alcohol))
quality.citric.acid.rel = with(wineQualityWhites, cor.test(quality_score, citric.acid))
quality.chlorides.rel = with(wineQualityWhites, cor.test(quality_score, chlorides))
quality.sugar.rel = with(wineQualityWhites, cor.test(quality_score, residual.sugar))

sprintf("relevance between quality and volatile.acidity is : %f",
        quality.volatile.acidity.rel['estimate'])
sprintf("relevance between quality and total.sulfur.dioxide is : %f", 
        quality.total.sulfur.dioxide.rel['estimate'])
sprintf("relevance between quality and alcohol is : %f", 
        quality.alcohol.rel['estimate'])
sprintf("relevance between quality and citric.acid is : %f", 
        quality.citric.acid.rel['estimate'])
sprintf("relevance between quality and chlorides is : %f", 
        quality.chlorides.rel['estimate'])
sprintf("relevance between quality and residual.sugar is : %f", 
        quality.sugar.rel['estimate'])

从图中可以看出quality与volatile.acidity,citric.acid,chlorides相关性微弱,只有与alcohol有一定相关性。 并且当quality在5分及以上时,这个特征体现的特别明显,即alcohol浓度越高quality分数也越高,呈明显的正相关性。

相关性测试,也能说明这一点。

relevance between quality and volatile.acidity is : -0.194723
relevance between quality and total.sulfur.dioxide is : -0.174737
relevance between quality and alcohol is : 0.435575
relevance between quality and citric.acid is : -0.009209
relevance between quality and chlorides is : -0.209934
relevance between quality and residual.sugar is : -0.097577

【R语言】白葡萄酒的EDA分析_第20张图片

3.3.5 散点矩阵图分析

3.3.5.1 散点矩阵图观察

最后通过散点矩阵图看下还是否存在其它较强的变量间相关性。

theme_set(theme_minimal(20))
ggpairs(wineQualityWhites[sample.int(nrow(wineQualityWhites), 1000), ], 
        axisLabels = 'internal')+
        theme_gray()

ggcorr(wineQualityWhites, method = c("all.obs", 'spearman'),
       nbreaks = 4, palette = 'PuOr', label = TRUE,
       name = "spearman corr coeff.(rho)",
       hjust = 0.8, angle = -70, size = 3)+
  	   ggtitle("Pearson Correlation coefficient Matrix")+
	   theme_gray()

从图中可以看出之前分析的结论在图中均有较好的体现,散点矩阵图图左下角的散点图与两个变量关系分析中的分析是一致的。
除了之前分析的一些变量关系之外,从散点矩阵图的左下角散点图中能够发现free.sulfur.dioxide和total.sulfur.dioxide也存在一定的线性关系,这是之前的分析中没有发现的。
这也是散点矩阵图的意义,即帮助我们发现一些之前可能没有意识到的一些变量关系。

这一发现以及之前分析的几个变量的关系,通过散点矩阵图右上角的相关性数值也能够有体现出来。
提取变量相关性值大于0.5的双变量列举如下:

relevance between total.sulfur.dioxide and free.sulfur.dioxide is : 0.607
relevance between total.sulfur.dioxide and density is : 0.524
relevance between density and residual.sugar is : 0.825
relevance between density and alcohol is : -0.794

图Pearson Correlation coefficient Matrix更清晰地描述了各个变量之间的相关性,与散点矩阵图中的相关性数值四舍五入之后一致。
【R语言】白葡萄酒的EDA分析_第21张图片
【R语言】白葡萄酒的EDA分析_第22张图片

3.3.5.1 free.sulfur.dioxide与total.sulfur.dioxide的关系

最后看下free.sulfur.dioxide与total.sulfur.dioxide的关系。

## [1] "relevance between free.sulfur.dioxide and total.sulfur.dioxide is : 0.615501"
free.total.sulfur.rel = with(wineQualityWhites, cor.test(free.sulfur.dioxide, total.sulfur.dioxide))
sprintf("relevance between free.sulfur.dioxide and total.sulfur.dioxide is : %f",
        free.total.sulfur.rel['estimate'])

ggplot(data = wineQualityWhites, 
       aes(x = total.sulfur.dioxide, y = free.sulfur.dioxide))+
       geom_point()

首先
relevance between free.sulfur.dioxide and total.sulfur.dioxide is : 0.615501
可以知道free.sulfur.dioxide与total.sulfur.dioxide存在较强的相关性。
scatter图也较好地体现了这一相关性。
二氧化硫总含量越高,点解化到葡萄酒中的游离态的二氧化硫也越高,这是符合化学规律的。
【R语言】白葡萄酒的EDA分析_第23张图片

3.3.6 反思

问题1:从两个变量的分析来看,没有找到与quality有特别明显关系的变量。
可能quality的影响因素角度,需要通过更多变量的组合关系来进行分析。
接下来会希望尝试多个变量组合来进行quality的分析和预测。

3.4 分析多个变量

在3.3两个变量分析的过程中,没有发现对质量评分quality起决定性因素的变量。
但从3.3.5中可知
relevance between total.sulfur.dioxide and free.sulfur.dioxide is : 0.607
relevance between total.sulfur.dioxide and density is : 0.524
relevance between density and residual.sugar is : 0.825
relevance between density and alcohol is : -0.794
接下来尝试通过组合更多的变量来探索与quality评分的关系。

3.4.1 total.sulfur.dioxide+free.sulfur.dioxide对quality评分的影响

p1 = ggplot(data = wineQualityWhites, 
            aes(x = total.sulfur.dioxide, y = free.sulfur.dioxide,color = quality))+
  			geom_point( alpha = 0.3, size = 1, position = 'jitter')+
 			scale_color_brewer(type='Qual', palette='Accent',
    		guide = guide_legend(title = 'quality', reverse = T,
    		override.aes = list(alpha = 1, size = 2)))+
  			ggtitle("sulfur.dioxide&quality picture 1")

sulfur.dioxide = wineQualityWhites %>% group_by(free.sulfur.dioxide) %>%
  				 summarise(total_mean = median(total.sulfur.dioxide),
            	 total_median = median(total.sulfur.dioxide),
            	 quality_mean = mean(quality_score)) %>% arrange(free.sulfur.dioxide)
p2 = ggplot(data = sulfur.dioxide, aes(x = total_mean, y = free.sulfur.dioxide, color = quality_mean))+
  			geom_point( alpha = 0.7, size = 2, position = 'jitter')+
  			scale_colour_distiller(type='div', palette='Spectral',
    		guide = guide_legend(title = 'quality_mean', reverse = T))+ 		
    		ggtitle("sulfur.dioxide&quality picture 2")
grid.arrange(p1, p2, ncol = 1)

从图sulfur.dioxide&quality pictures 1中可以高quality评分大体集中在total,free sulfur.dioxide不太高也不太低的中间区段。但这个区段内的quality评分分界不是太明显。
对free.sulfur.dioxide进行分组,计算quality和total.sulfur.dioxide的平均值。并描绘对应散点图,能够有更清晰的验证。
【R语言】白葡萄酒的EDA分析_第24张图片

3.4.2 total.sulfur.dioxide+density对quality评分的影响

p1 = ggplot(data = wineQualityWhites, 
            aes(x = total.sulfur.dioxide, y = density,color = quality))+
  			geom_point(alpha = 0.1, size = 1, position = 'jitter')+
  			scale_color_brewer(type = 'Qual', palette = 'Accent',
    		guide = guide_legend(title = 'quality', reverse = T,
    		override.aes = list(alpha = 1, size = 2)))+
  			ggtitle("picture 1")

density_sulfur.dioxide = wineQualityWhites %>% group_by(density) %>%
  						 summarise(total_mean = median(total.sulfur.dioxide),
            			 total_median = median(total.sulfur.dioxide),
            			 quality_mean = mean(quality_score)) %>% arrange(density)
p2 = ggplot(data = density_sulfur.dioxide, 
            aes(x = total_mean, y = density, color = quality_mean))+
  			geom_point( alpha = 0.5, size = 2, position = 'jitter')+
  			scale_colour_distiller(type = 'div', palette = 'Spectral',
    		guide = guide_legend(title = 'quality_mean', reverse = T))+
		    ggtitle("picture 2")
grid.arrange(p1, p2, ncol = 1)

从图picture 1中可以看出高quality评分[>=7]的散点集中在高density和低density两端,而低quality评分[<=6]的散点集中在density中段位置。
对density进行分组,计算quality和total.sulfur.dioxide的平均值。并描绘对应散点图picture 2,能够验证这一猜测。
【R语言】白葡萄酒的EDA分析_第25张图片

3.4.3 density+residual.sugar对quality评分的影响

p1 = ggplot(data = wineQualityWhites, 
          	aes(x = residual.sugar,y = density,color = quality))+
  			coord_cartesian(xlim = c(0, 30))+
  			geom_point( alpha = 0.3, size = 1, position = 'jitter')+
  			scale_color_brewer(type='Qual', palette='Accent',
    		guide = guide_legend(title = 'quality', reverse = T,
    		override.aes = list(alpha = 1, size = 2)))+
  			ggtitle("residual.sugar+density~quality picture 1")+
			theme_dark()

wineQualityWhites$quality_level = with(wineQualityWhites, 
                                     ifelse(quality_score>6, 'high', 'low'))
wineQualityWhites$quality_level = factor(wineQualityWhites$quality_level)
p2 = ggplot(data = wineQualityWhites, 
       		aes(x = residual.sugar, y = density,color = quality_level))+
  			geom_point( alpha = 0.3, size = 1, position = 'jitter')+
  			scale_color_brewer(type='Qual', palette='Paired',
    		guide = guide_legend(title = 'quality_level', reverse = T,
    		override.aes = list(alpha = 1, size = 2)))+
  			ggtitle("residual.sugar+density~quality picture 2")+
  			theme_dark()+
  			coord_cartesian(xlim = c(0, 30))

p3 = ggplot(data = wineQualityWhites, aes(x = alcohol, y = density,color = quality))+
  			geom_point( alpha = 0.3, size = 1, position = 'jitter')+
 			scale_color_brewer(type='Qual', palette='Accent',
    		guide = guide_legend(title = 'quality', reverse = T,
    		override.aes = list(alpha = 1, size = 2)))+
  			ggtitle("alcohol+density~quality picture")+
  			theme_dark()
grid.arrange(p1, p2, p3, ncol = 2)

从图residual.sugar+density~quality picture 1中可以看出高quality评分[>=7]的低quality评分[<=6]有较为明显的区分。
高quality评分[>=7]的散点集中在低desnity低residual.sugar和高density高residual.sugar区域。而低quality评分[<=6]则正好相反。
将高quality评分[>=7]和低quality评分[<=6]分成两种不同的颜色标识,得到图residual.sugar+density~quality picture 2。
可以清晰的看到高quality评分[>=7]和低quality评分[<=6]的白葡萄酒density与residual.sugar呈线性相关。并且高quality评分[>=7]的斜率明显高于低quality评分[<=6]的白葡萄酒。
这可能是因为高quality评分[>=7]的葡萄酒的acohol浓度较为集中,低quality评分[<=6]的alcohol浓度较为分散,见图alcohol+density~quality picture。而density的主要影响因素就是alcohol和residual.sugar。
因此高quality评分[>=7]白葡萄酒的residual.sugar含量对于density影响更明显,故具有更高的斜率,即residual.sugar含量增加一个单位,density提升的密度量也要更高一些。
【R语言】白葡萄酒的EDA分析_第26张图片

3.4.4 density+alcohol对quality评分的影响

p1 = ggplot(data = wineQualityWhites, aes(x = alcohol, y = density,color = quality))+
  			geom_point(alpha = 0.6, size = 1, position = position_jitter(h = 0))+
  			scale_color_brewer(type = 'Qual', palette = 'Accent')+
  			ggtitle("alcohol+density~quality picture 1")+
  			theme_dark()

wineQualityWhites$quality_level = with(wineQualityWhites, 
                                     ifelse(quality_score>6, 'high', 'low'))
wineQualityWhites$quality_level = factor(wineQualityWhites$quality_level)
p2 = ggplot(data = wineQualityWhites, 
          	aes(x = alcohol, y = density,color = quality_level))+
  			geom_point(alpha = 0.5, size = 1, position = 'jitter')+
  			scale_color_brewer(type='Qual', palette='Accent',
    		guide = guide_legend(title = 'quality_level', reverse = T,
    		override.aes = list(alpha = 1, size = 2)))+
  			ggtitle("alcohol+density~quality picture 2")+
  			theme_dark()
grid.arrange(p1, p2, ncol = 1)

ggplot(data = wineQualityWhites, aes(x = alcohol, y = density,color = quality))+
   	   geom_point( alpha = 0.6, size = 1, position = 'jitter')+
  	   scale_color_brewer(type='Qual', palette='Accent',
       guide = guide_legend(title = 'quality', reverse = T,
       override.aes = list(alpha = 1, size = 2)))+
  	   facet_wrap(~quality)+
  	   ggtitle("alcohol+density~quality picture 3")+
  	   theme_grey()

从图alcohol+density~quality picture 1中可以看出高quality评分[>=7]的白葡萄酒大体分布在相对的低density高alcohol的区域以及少量的de高nsity低alcohol,而低quality评分[<=6]的散点集中在相对的高desnity低alcoho的区域。
将高quality评分[>=7]和低quality评分[<=6]分成两种不同的颜色标识,得到图alcohol+density~quality picture 2,有更明显的呈现。
这可能是因为density本身与alcohol呈负相关,而高quality评分[>=7]在alcohol的分布就更偏向于高浓度的alcohol,因此会出现这样的情况。
同时,从图alcohol+density~quality picture 3可以看出,alcohol与density不论在哪种quality评分下,都具有负向相关性,即酒精浓度越高,密度越低。【R语言】白葡萄酒的EDA分析_第27张图片

3.4.5 quality评分预测模型

从双变量和多变量都没有找到quality评分的决定性因素,仍然希望尝试通过机器学习算法对quality进行预测。
通过查看paperUsing Data Mining for Wine Quality Assessment可以知道,通过svm模型能够取得较好的quality预测效果。
因此接下来是对svm模型训练和预测检验。

## Call:
## svm(formula = quality_level ~ . - quality - quality_score, data = train[, 
##     -1])
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1 
## 
## Number of Support Vectors:  3361
## 
##  ( 765 1016 1580 )
## 
## Number of Classes:  3 
## 
## Levels: 
##  bad good normal
##                 Length Class  Mode     
## call                3  -none- call     
## type                1  -none- character
## predicted        3918  factor numeric  
## err.rate         2000  -none- numeric  
## confusion          12  -none- numeric  
## votes           11754  matrix numeric  
## oob.times        3918  -none- numeric  
## classes             3  -none- character
## importance         10  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y                3918  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
## # weights:  703
## initial  value 4734.393371 
## iter  10 value 4076.043102
## iter  20 value 4032.743039
## iter  30 value 3975.438697
## iter  40 value 3861.607333
## iter  50 value 3819.895916
## iter  60 value 3691.288394
## iter  70 value 3629.001959
## iter  80 value 3552.538244
## iter  90 value 3444.420014
## iter 100 value 3411.465065
## final  value 3411.465065 
## stopped after 100 iterations
## a 10-50-3 network with 703 weights
## options were - softmax modelling  decay=0.001
##   b->h1  i1->h1  i2->h1  i3->h1  i4->h1  i5->h1  i6->h1  i7->h1  i8->h1 
##    0.68    1.18    0.15    1.92    0.78   -1.02    0.38    0.70    1.59 
##  i9->h1 i10->h1 
##   -0.30   -2.88 
##   b->h2  i1->h2  i2->h2  i3->h2  i4->h2  i5->h2  i6->h2  i7->h2  i8->h2 
##    0.05    0.91    0.01   -0.48   -0.65    0.15   -0.39    0.30    1.37 
##  i9->h2 i10->h2 
##    0.34    1.40 
##   b->h3  i1->h3  i2->h3  i3->h3  i4->h3  i5->h3  i6->h3  i7->h3  i8->h3 
##   -0.34    0.17    0.58    0.38   -0.09    1.17    0.69   -0.29   -0.34 
##  i9->h3 i10->h3 
##    0.50   -0.92 
##   b->h4  i1->h4  i2->h4  i3->h4  i4->h4  i5->h4  i6->h4  i7->h4  i8->h4 
##   -0.42   -0.58   -0.09   -0.21   -0.05   -0.31    0.52   -0.40   -0.81 
##  i9->h4 i10->h4 
##    0.17   -0.54 
##   b->h5  i1->h5  i2->h5  i3->h5  i4->h5  i5->h5  i6->h5  i7->h5  i8->h5 
##    0.26    0.13   -0.34   -0.19    0.43    1.08    1.13   -0.32    0.15 
##  i9->h5 i10->h5 
##    0.29    1.59 
##   b->h6  i1->h6  i2->h6  i3->h6  i4->h6  i5->h6  i6->h6  i7->h6  i8->h6 
##    0.27    0.25   -0.44   -0.64   -0.65    1.26    0.36   -0.03   -0.11 
##  i9->h6 i10->h6 
##    0.15   -1.29 
##   b->h7  i1->h7  i2->h7  i3->h7  i4->h7  i5->h7  i6->h7  i7->h7  i8->h7 
##   -0.32   -0.44   -0.67   -0.66    0.43   -0.17    0.50    0.63    0.38 
##  i9->h7 i10->h7 
##    0.27   -0.01 
##   b->h8  i1->h8  i2->h8  i3->h8  i4->h8  i5->h8  i6->h8  i7->h8  i8->h8 
##   -0.29   -0.03   -0.34   -0.82    0.26    0.47    0.45   -0.24    0.11 
##  i9->h8 i10->h8 
##   -0.44    0.25 
##   b->h9  i1->h9  i2->h9  i3->h9  i4->h9  i5->h9  i6->h9  i7->h9  i8->h9 
##    3.43    7.77    0.32   -0.09    1.31    0.03   -0.01    4.24    0.20 
##  i9->h9 i10->h9 
##   -2.40   -0.86 
##   b->h10  i1->h10  i2->h10  i3->h10  i4->h10  i5->h10  i6->h10  i7->h10 
##     0.05    -0.23     0.63    -0.55    -0.38    -1.19    -4.02    -0.09 
##  i8->h10  i9->h10 i10->h10 
##    -1.28     0.30    -4.04 
##   b->h11  i1->h11  i2->h11  i3->h11  i4->h11  i5->h11  i6->h11  i7->h11 
##     0.42    -0.09    -0.34    -0.54    -0.20     3.52    -1.98    -0.82 
##  i8->h11  i9->h11 i10->h11 
##    -1.10    -0.06    -1.58 
##   b->h12  i1->h12  i2->h12  i3->h12  i4->h12  i5->h12  i6->h12  i7->h12 
##     0.50     0.11     0.60     1.95    -0.17     1.11    -0.16    -0.01 
##  i8->h12  i9->h12 i10->h12 
##    -0.34     0.06    -2.99 
##   b->h13  i1->h13  i2->h13  i3->h13  i4->h13  i5->h13  i6->h13  i7->h13 
##     0.65     0.43    -0.58     0.52    -0.37     0.10     1.27     0.60 
##  i8->h13  i9->h13 i10->h13 
##    -0.48    -0.33    -0.02 
##   b->h14  i1->h14  i2->h14  i3->h14  i4->h14  i5->h14  i6->h14  i7->h14 
##     0.44    -0.02     0.32    -0.30     0.48    -0.55    -0.18    -0.16 
##  i8->h14  i9->h14 i10->h14 
##    -0.32    -0.37    -0.36 
##   b->h15  i1->h15  i2->h15  i3->h15  i4->h15  i5->h15  i6->h15  i7->h15 
##    -0.33    -0.34    -0.47    -0.82    -0.25    -0.16     0.03     0.59 
##  i8->h15  i9->h15 i10->h15 
##    -0.22    -0.22     0.32 
##   b->h16  i1->h16  i2->h16  i3->h16  i4->h16  i5->h16  i6->h16  i7->h16 
##     0.54    -0.66    -0.21     0.80    -0.03    -0.05     1.59    -0.58 
##  i8->h16  i9->h16 i10->h16 
##    -0.20     0.38     0.24 
##   b->h17  i1->h17  i2->h17  i3->h17  i4->h17  i5->h17  i6->h17  i7->h17 
##     0.00    -0.57    -0.55    -0.66    -0.09     0.04    -0.10    -0.64 
##  i8->h17  i9->h17 i10->h17 
##     0.43    -0.35    -0.67 
##   b->h18  i1->h18  i2->h18  i3->h18  i4->h18  i5->h18  i6->h18  i7->h18 
##     0.47    -0.20     0.27    -0.59    -0.22    -0.57    -0.69    -0.34 
##  i8->h18  i9->h18 i10->h18 
##    -0.20     0.64     0.53 
##   b->h19  i1->h19  i2->h19  i3->h19  i4->h19  i5->h19  i6->h19  i7->h19 
##     0.46     0.52    -0.08    -0.06    -0.62     0.67     0.27    -0.16 
##  i8->h19  i9->h19 i10->h19 
##     0.64    -0.64    -0.22 
##   b->h20  i1->h20  i2->h20  i3->h20  i4->h20  i5->h20  i6->h20  i7->h20 
##     2.58     2.82     0.58    -0.13    -0.14    -0.19    -0.05     2.99 
##  i8->h20  i9->h20 i10->h20 
##     2.92     0.22    -0.97 
##   b->h21  i1->h21  i2->h21  i3->h21  i4->h21  i5->h21  i6->h21  i7->h21 
##    -0.16    -0.34     0.20     1.06    -0.25     0.15    -0.35    -0.29 
##  i8->h21  i9->h21 i10->h21 
##     0.29    -0.29     3.92 
##   b->h22  i1->h22  i2->h22  i3->h22  i4->h22  i5->h22  i6->h22  i7->h22 
##     0.06     0.32    -0.37    -0.59    -0.14    -0.04    -0.74     0.30 
##  i8->h22  i9->h22 i10->h22 
##     0.09     0.43     0.53 
##   b->h23  i1->h23  i2->h23  i3->h23  i4->h23  i5->h23  i6->h23  i7->h23 
##    -0.46     0.60    -0.02     0.38    -0.01     0.13     0.15     0.22 
##  i8->h23  i9->h23 i10->h23 
##    -0.38    -0.67     0.44 
##   b->h24  i1->h24  i2->h24  i3->h24  i4->h24  i5->h24  i6->h24  i7->h24 
##    -0.04     0.27    -0.56     0.14     0.32    -0.11     0.28    -0.28 
##  i8->h24  i9->h24 i10->h24 
##    -0.07    -0.36     0.42 
##   b->h25  i1->h25  i2->h25  i3->h25  i4->h25  i5->h25  i6->h25  i7->h25 
##     0.51    -0.42    -0.36     0.15     0.37    -0.34    -0.94     0.57 
##  i8->h25  i9->h25 i10->h25 
##    -0.48     0.22     0.72 
##   b->h26  i1->h26  i2->h26  i3->h26  i4->h26  i5->h26  i6->h26  i7->h26 
##     0.18     0.19     0.23    -0.67     0.22     0.00    -0.37    -0.65 
##  i8->h26  i9->h26 i10->h26 
##    -0.22    -0.66     0.32 
##   b->h27  i1->h27  i2->h27  i3->h27  i4->h27  i5->h27  i6->h27  i7->h27 
##    -0.32     0.41    -0.35     0.53     0.56    -0.61    -0.39     0.41 
##  i8->h27  i9->h27 i10->h27 
##    -0.32     0.22    -0.32 
##   b->h28  i1->h28  i2->h28  i3->h28  i4->h28  i5->h28  i6->h28  i7->h28 
##    -0.59    -0.45     0.01     0.46     0.11     0.06    -0.39    -0.31 
##  i8->h28  i9->h28 i10->h28 
##    -0.27     0.11    -0.17 
##   b->h29  i1->h29  i2->h29  i3->h29  i4->h29  i5->h29  i6->h29  i7->h29 
##    -0.48    -0.41    -0.37    -0.07     0.29    -0.20    -0.16    -0.42 
##  i8->h29  i9->h29 i10->h29 
##    -0.31    -0.56    -0.38 
##   b->h30  i1->h30  i2->h30  i3->h30  i4->h30  i5->h30  i6->h30  i7->h30 
##    -0.27     0.64     0.47     0.67     0.30    -0.51     0.05     0.67 
##  i8->h30  i9->h30 i10->h30 
##    -0.70     0.52     0.14 
##   b->h31  i1->h31  i2->h31  i3->h31  i4->h31  i5->h31  i6->h31  i7->h31 
##     0.01     0.21    -0.44     0.84    -0.12    -1.34    -2.86    -0.09 
##  i8->h31  i9->h31 i10->h31 
##    -0.79    -0.33    -5.24 
##   b->h32  i1->h32  i2->h32  i3->h32  i4->h32  i5->h32  i6->h32  i7->h32 
##     0.26    -0.55     0.40    -1.24    -0.66     0.51    -1.06    -0.02 
##  i8->h32  i9->h32 i10->h32 
##    -0.26     0.57    -1.31 
##   b->h33  i1->h33  i2->h33  i3->h33  i4->h33  i5->h33  i6->h33  i7->h33 
##    -0.63    -0.11     0.52     0.10     0.50    -0.04     0.25    -0.11 
##  i8->h33  i9->h33 i10->h33 
##    -0.64     0.64     0.63 
##   b->h34  i1->h34  i2->h34  i3->h34  i4->h34  i5->h34  i6->h34  i7->h34 
##     0.16    -0.52     0.30    -0.53     0.36     0.31    -0.02    -0.38 
##  i8->h34  i9->h34 i10->h34 
##    -0.38     0.15     0.12 
##   b->h35  i1->h35  i2->h35  i3->h35  i4->h35  i5->h35  i6->h35  i7->h35 
##     0.10     0.02     0.63     0.59    -0.41     0.29    -0.68     0.25 
##  i8->h35  i9->h35 i10->h35 
##     0.51     0.24     0.08 
##   b->h36  i1->h36  i2->h36  i3->h36  i4->h36  i5->h36  i6->h36  i7->h36 
##     0.64     0.14     0.07    -0.33    -0.35    -0.54    -0.06    -0.41 
##  i8->h36  i9->h36 i10->h36 
##    -0.65     0.00    -0.44 
##   b->h37  i1->h37  i2->h37  i3->h37  i4->h37  i5->h37  i6->h37  i7->h37 
##    -0.46     0.64    -0.60    -0.11     0.35    -0.67    -0.20     0.27 
##  i8->h37  i9->h37 i10->h37 
##    -0.20    -0.09    -0.27 
##   b->h38  i1->h38  i2->h38  i3->h38  i4->h38  i5->h38  i6->h38  i7->h38 
##     0.31     0.31    -0.63     0.46     0.15    -0.63    -0.60     0.11 
##  i8->h38  i9->h38 i10->h38 
##     0.04    -0.03    -0.14 
##   b->h39  i1->h39  i2->h39  i3->h39  i4->h39  i5->h39  i6->h39  i7->h39 
##    -0.30    -0.35     0.03     0.11     0.56    -0.46    -0.35    -0.41 
##  i8->h39  i9->h39 i10->h39 
##     0.59    -0.27    -0.03 
##   b->h40  i1->h40  i2->h40  i3->h40  i4->h40  i5->h40  i6->h40  i7->h40 
##    -1.24    -0.47    -0.91    -1.68    -0.11     1.82     1.25    -0.65 
##  i8->h40  i9->h40 i10->h40 
##    -2.28    -0.17    -7.45 
##   b->h41  i1->h41  i2->h41  i3->h41  i4->h41  i5->h41  i6->h41  i7->h41 
##     0.13    -0.09     0.14    -0.44     0.07    -0.58    -0.64    -0.60 
##  i8->h41  i9->h41 i10->h41 
##     0.05    -0.18    -0.58 
##   b->h42  i1->h42  i2->h42  i3->h42  i4->h42  i5->h42  i6->h42  i7->h42 
##    -0.05     0.73     0.18    -0.70     0.65     1.17    -1.85     0.57 
##  i8->h42  i9->h42 i10->h42 
##     0.68     0.37     2.63 
##   b->h43  i1->h43  i2->h43  i3->h43  i4->h43  i5->h43  i6->h43  i7->h43 
##     4.30     1.90     1.52     2.28    -0.04     0.19     0.32     4.12 
##  i8->h43  i9->h43 i10->h43 
##     9.78     0.48   -11.27 
##   b->h44  i1->h44  i2->h44  i3->h44  i4->h44  i5->h44  i6->h44  i7->h44 
##    -0.17     0.59     0.43     0.60    -0.32    -0.61     0.66     0.37 
##  i8->h44  i9->h44 i10->h44 
##     0.48    -0.49    -0.02 
##   b->h45  i1->h45  i2->h45  i3->h45  i4->h45  i5->h45  i6->h45  i7->h45 
##     0.28    -0.27     0.44     0.05     0.15     0.68     0.16     0.16 
##  i8->h45  i9->h45 i10->h45 
##     0.36    -0.59     0.08 
##   b->h46  i1->h46  i2->h46  i3->h46  i4->h46  i5->h46  i6->h46  i7->h46 
##     0.48     0.66     0.60    -1.89     0.58     0.18     0.78    -0.23 
##  i8->h46  i9->h46 i10->h46 
##    -0.24    -0.44    -0.51 
##   b->h47  i1->h47  i2->h47  i3->h47  i4->h47  i5->h47  i6->h47  i7->h47 
##     0.35    -0.36    -0.24    -0.20    -0.59     0.03     0.65     0.08 
##  i8->h47  i9->h47 i10->h47 
##     0.59    -0.07    -0.04 
##   b->h48  i1->h48  i2->h48  i3->h48  i4->h48  i5->h48  i6->h48  i7->h48 
##     0.23     0.47    -0.46    -0.52     0.11    -0.44     0.46    -0.21 
##  i8->h48  i9->h48 i10->h48 
##     0.23     0.19     0.01 
##   b->h49  i1->h49  i2->h49  i3->h49  i4->h49  i5->h49  i6->h49  i7->h49 
##    -0.23     0.27     0.12    -0.51    -0.15     0.15     0.63     0.46 
##  i8->h49  i9->h49 i10->h49 
##     0.13    -0.59     0.09 
##   b->h50  i1->h50  i2->h50  i3->h50  i4->h50  i5->h50  i6->h50  i7->h50 
##     0.08     0.24    -0.21     0.04    -0.36     0.11     1.14    -0.14 
##  i8->h50  i9->h50 i10->h50 
##     0.53    -0.53    -0.53 
##   b->o1  h1->o1  h2->o1  h3->o1  h4->o1  h5->o1  h6->o1  h7->o1  h8->o1 
##    0.29    1.10   -1.56   -0.45   -0.56    0.37    0.06    0.01   -0.28 
##  h9->o1 h10->o1 h11->o1 h12->o1 h13->o1 h14->o1 h15->o1 h16->o1 h17->o1 
##    4.32    0.72    0.83    0.68    0.21   -0.33    0.29   -1.17    0.67 
## h18->o1 h19->o1 h20->o1 h21->o1 h22->o1 h23->o1 h24->o1 h25->o1 h26->o1 
##    0.67   -0.43    2.50   -1.06    1.87   -0.42    0.10    2.15    0.39 
## h27->o1 h28->o1 h29->o1 h30->o1 h31->o1 h32->o1 h33->o1 h34->o1 h35->o1 
##    0.50    0.08    0.00    0.72    0.62    0.49   -0.03    0.13    0.35 
## h36->o1 h37->o1 h38->o1 h39->o1 h40->o1 h41->o1 h42->o1 h43->o1 h44->o1 
##    0.53    0.19    0.21    0.35   -1.09   -0.19    1.92    0.38    0.10 
## h45->o1 h46->o1 h47->o1 h48->o1 h49->o1 h50->o1 
##   -0.09    0.96   -0.34   -0.16    0.39   -0.43 
##   b->o2  h1->o2  h2->o2  h3->o2  h4->o2  h5->o2  h6->o2  h7->o2  h8->o2 
##    0.01   -0.59    0.81    0.05   -0.10   -0.09    0.32    0.01   -0.24 
##  h9->o2 h10->o2 h11->o2 h12->o2 h13->o2 h14->o2 h15->o2 h16->o2 h17->o2 
##   -4.50   -0.45   -0.20    0.13    0.31   -0.39   -0.97   -0.20    0.60 
## h18->o2 h19->o2 h20->o2 h21->o2 h22->o2 h23->o2 h24->o2 h25->o2 h26->o2 
##    0.56   -0.18   -3.43    0.16   -0.58   -0.72    0.28   -0.29    0.44 
## h27->o2 h28->o2 h29->o2 h30->o2 h31->o2 h32->o2 h33->o2 h34->o2 h35->o2 
##    0.62    0.28   -0.33    0.21    0.10   -0.62    0.12   -0.24   -0.11 
## h36->o2 h37->o2 h38->o2 h39->o2 h40->o2 h41->o2 h42->o2 h43->o2 h44->o2 
##    0.50   -0.23    0.61   -0.25    0.13    0.30    0.01   -0.12   -0.18 
## h45->o2 h46->o2 h47->o2 h48->o2 h49->o2 h50->o2 
##   -0.04    0.48   -0.19   -0.43    0.59    0.43 
##   b->o3  h1->o3  h2->o3  h3->o3  h4->o3  h5->o3  h6->o3  h7->o3  h8->o3 
##    0.41    0.39    0.85   -0.29   -0.77   -0.53   -0.24    0.60   -1.07 
##  h9->o3 h10->o3 h11->o3 h12->o3 h13->o3 h14->o3 h15->o3 h16->o3 h17->o3 
##    0.05   -0.88   -0.01    0.48   -0.15   -0.33   -0.25    2.22    0.45 
## h18->o3 h19->o3 h20->o3 h21->o3 h22->o3 h23->o3 h24->o3 h25->o3 h26->o3 
##   -0.67    0.37   -0.59   -0.57   -1.15   -0.12   -0.05   -0.35    0.13 
## h27->o3 h28->o3 h29->o3 h30->o3 h31->o3 h32->o3 h33->o3 h34->o3 h35->o3 
##    0.45   -0.02    0.59   -0.22   -0.65   -0.46    0.30   -0.21   -0.12 
## h36->o3 h37->o3 h38->o3 h39->o3 h40->o3 h41->o3 h42->o3 h43->o3 h44->o3 
##    0.44    0.23    0.11   -0.14    0.14   -0.31   -1.65   -0.01    0.64 
## h45->o3 h46->o3 h47->o3 h48->o3 h49->o3 h50->o3 
##    0.26   -1.69   -0.69   -0.09    0.68    0.01
##   accuracy.svm accuracy.rf accuracy.nn
## 1    0.6551812   1.0000000   0.5566616
## 2    0.6193878   0.7285714   0.5704082

将所有的变量加入到预测模型中,对quality进行预测,为了对比效果训练了svm、randomForest、nnet三个模型。
从上述模型训练效果来说,随机森林在测试集上效果最好,准确率达到70%以上。
上述模型没有进行参数调优,效果还有优化空间。
说明quality大体上还是可以通过各个变量来预测的。

另外值得注意的是,随机森林由于搜索空间大,容易出现过拟合。该模型在训练集准确度达到100%,但测试集上的准确率只有72%。可能需要更大量的测试集进一步验证其准确性。

#步骤1:首先将quality按照评分分成三个level,bad, noarmal, good
wineQualityWhites$quality_level = with(wineQualityWhites, 
                                     ifelse(quality_score>6, 'good', 'bad'))
wineQualityWhites$quality_level[wineQualityWhites$quality==6] = 'normal'
wineQualityWhites$quality_level = as.factor(wineQualityWhites$quality_level)

#步骤2:生成训练数据集和测试数据集
set.seed(123)
samp = sample(nrow(wineQualityWhites), 0.8 * nrow(wineQualityWhites))
train = wineQualityWhites[samp, ]
test = wineQualityWhites[-samp, ]

#步骤3:训练svm模型
library(e1071)#加载e1071包
#利用svm建立支持向量机分类模型,采用默认kernel函数rbf
svm.model = svm(quality_level~.-quality-quality_score, data = train[,-1])
summary(svm.model)

#步骤4:效果评估
#训练集效果
confusion.train.svm = table(train$quality_level,
                          predict(svm.model,train,type="class"))
accuracy.train.svm = sum(diag(confusion.train.svm))/sum(confusion.train.svm)

#测试集效果
confusion.test.svm = table(test$quality_level,predict(svm.model,test,type =  "class"))
accuracy.test.svm = sum(diag(confusion.test.svm))/sum(confusion.test.svm)
accuracy.svm = c(accuracy.train.svm, accuracy.test.svm)

#步骤5:同类算法randomForest效果对比
#1 randomForest模型训练
randomForest.model = randomForest(quality_level~.-quality-quality_score, train[,-1])
summary(randomForest.model)
#2 randomForest模型效果评估
#训练集
confusion.train.rf = table(train$quality_level,
                         predict(randomForest.model,train,type = "class"))
accuracy.train.rf = sum(diag(confusion.train.rf))/sum(confusion.train.rf)
#测试集效果
confusion.test.rf = table(test$quality_level,
                        predict(randomForest.model,test,type = "class"))
accuracy.test.rf = sum(diag(confusion.test.rf))/sum(confusion.test.rf)
accuracy.rf = c(accuracy.train.rf,accuracy.test.rf)

#步骤6:同类算法-神经网络效果对比
#1 nnet模型训练
nnet.model = nnet(quality_level~.-quality-quality_score, train[,-1], size = 50, decay = .001)
summary(nnet.model)
#2 nnet模型效果评估
#训练集
confusion.train.nn = table(train$quality_level,
                         predict(nnet.model,train,type = "class"))
accuracy.train.nn = sum(diag(confusion.train.nn))/sum(confusion.train.nn)
#测试集效果
confusion.test.nn = table(test$quality_level,
                        predict(nnet.model,test,type = "class"))
accuracy.test.nn = sum(diag(confusion.test.nn))/sum(confusion.test.nn)
accuracy.nn = c(accuracy.train.nn,accuracy.test.nn)

#步骤7:对比支持向量机、随机森林、人工神经网络算法的准确率
accuracy.data = data.frame(accuracy.svm,accuracy.rf,accuracy.nn)
print(accuracy.data)

4 最终结论与反思

4.1 结论

4.1.1 图1

4.1.1.1 绘图代码

ggplot(data = wineQualityWhites, aes(x = quality, y = alcohol))+
  	   geom_jitter(alpha = 0.3) +
  	   geom_boxplot(alpha = 0.3, color = 'blue')+
	   stat_summary(fun.y = 'mean', geom = 'point', color = 'red')+
	   geom_smooth(method = 'lm', aes(group = 1))+
	   labs(y = 'alcohol (% by volume)',
       title = 'alcohol vs. quality scatter plot')+
	   theme_grey()

4.1.1.2 定稿图

【R语言】白葡萄酒的EDA分析_第28张图片

4.1.1.3 结论描述

从图中我们可以发现,酒精浓度越高往往对应的红酒质量越高。

4.1.2 图2

4.1.2.1 绘图代码

ggplot(data = wineQualityWhites, aes(x = alcohol, y = density,color = quality))+
  	   geom_point( alpha = 0.6, size = 1, position = 'jitter')+
  	   scale_color_brewer(type='Qual', palette='Accent',
       guide = guide_legend(title = 'quality', reverse = T,
       override.aes = list(alpha = 1, size = 2)))+
  	   facet_wrap(~quality)+
  	   theme_grey()+
  	   labs(y = "density in (mg/l)", x = 'alcohol (% by volume)',
       title = 'alcohol+density~quality analysis')

4.1.2.2 定稿图

【R语言】白葡萄酒的EDA分析_第29张图片

4.1.2.3 结论描述

分析发现对于不同quality评分的红酒,几乎都满足随着alcohol提高,红酒的density不断降低的特点。

4.1.3 图3

4.1.3.1 绘图代码

wineQualityWhites$quality_level = with(wineQualityWhites, ifelse(quality_score>6, 'high', 'low')  )
wineQualityWhites$quality_level = factor(wineQualityWhites$quality_level)
ggplot(data = wineQualityWhites, 
       aes(x = residual.sugar, y = density,color = quality_level))+
  	   geom_point( alpha = 0.3, size = 1, position = 'jitter')+
  	   scale_color_brewer(type='Seq', palette='Paired',
       guide = guide_legend(title = 'quality_level', reverse = T,
       override.aes = list(alpha = 1, size = 2)))+
  	   geom_smooth(method = 'lm', se = FALSE, size=1)+
  	   theme_dark()+
  	   labs(x = 'residual.sugar in (g/l)', 
       y = 'density in (g/ml)', 
       title = 'residual.sugar+density~quality analysis')

4.1.3.2 定稿图

【R语言】白葡萄酒的EDA分析_第30张图片

4.1.3.3 结论描述

density的主要影响因素是alcohol和residual.sugar。但因为高quality评分[>=7]的葡萄酒的acohol浓度较为集中,低quality评分[<=6]的alcohol浓度较为分散。
因此高quality评分[>=7]白葡萄酒的residual.sugar含量对于density影响更明显,故具有更高的斜率,即residual.sugar含量增加一个单位,density提升的密度量也要更高一些。

4.2 反思

4.2.1 数据分析与模型预测的关系

当一些分析目标比较简单时,可以通过图形绘制、相关性测试来找出数据间的联系。
但如果分析目标的影响因素比较复杂,这个时候就无法通过形象化的图形和简单的相关性测试来找到自变量与预测变量之间的关系。
这个时候就需要更进一步的机器学习算法来进行目标分析。
所以真正要精通数据分析,需要在数据挖掘相关的技能上更进一步。

4.3 写在结尾

好久没写R了,最近要建模故复盘下曾经的项目熟悉熟悉,代码语言得经常练,尤其是多门语言时,不像游泳等生存本能技能长时间不写容易混淆出错,自勉。

原创不易,转载请注意出处:
https://blog.csdn.net/weixin_41613094/article/details/128284695

你可能感兴趣的:(R语言学习,r语言,算法,数据分析,数据挖掘)