R for Data science-3 Data visualisation[读书笔记]

在线读书:
R for data science
gethub地址: https://github.com/hadley/r4ds
通过视频课程,自己看帖子,已经自学R有一段时间,断断续续的也算入门了,但是还是感觉知识不系统,因此,想系统的学习一下R,优化自己的工作流程。

学习目标:应用领域:作物遗传育种;数据类型:主要用来分析转录组或者重测序数据,不进行大规模Rawdata 处理,以及田间农艺性状的表型调查的统计分析,实验室标记统计及与表型的连锁分析;研究重点:要解决生物学问题,不在于秀技术,能重测序数据中挖掘关键基因,能够数据可视化,能够作图发文章。

Data visualisation

数据可视化的利器,当然是大名鼎鼎的ggplot2 了,完全执行图形语法。
安装tidyverse包,直接包含数据分析中的常用R包,省时省力。##(其实对自己常用的软件也可以的写一个类似的函数,直接一行代码解决。)

install.packages("tidyverse")
library(tidyverse)
#> ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0.9000     ✔ purrr   0.2.5     
#> ✔ tibble  2.0.0          ✔ dplyr   0.7.8     
#> ✔ tidyr   0.8.2          ✔ stringr 1.3.1     
#> ✔ readr   1.3.1          ✔ forcats 0.3.0
#> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

当函数有冲突时,可以通过package::function() 的方式,指定运行特定包的函数。例如:ggplot2::filter() , dplyr::filter().

First step 开始作图

ggplot(data = ) + 
  (mapping = aes())

在使用ggplot2时,分行写函数时“+”一定要放在前一行的末尾,而不能放在下一行的前边。
ggplot() 中的第一个参数"data"是作图需要用的数据,并创建一个空白图。之后可以通过geom_xxx() 图形语法添加不同的图形到图上。每个geom函数均有一个mapping参数,来定义作图数据中的变量,mapping 参数总数和aes()同时应用,你可以额外增加 变量,通过aesthetic参数,例如:color(colour),size,shape。从而在1副图上展现不同的的数据。也可以自定义aesthetic参数,像geom函数的参数一样,在aes()外面。
但是也要注意:
-Using size for a discrete variable is not advised.# 离散型变量不要用size 展示。
-Using alpha for a discrete variable is not advised. #离散型变量不要用size 展示。
-The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate.#1副图中最多出现6种shape,超过6中会识别困难ggplot2不会绘制。
-A continuous variable can not be mapped to shape. #连续型变量无法映射到shape.

你需要确保设定的aes()是有意义的:

  • color 的名字是 character string.
  • 点的szie单位是 mm.
  • 点的shape以不同的数字表示 Figure [3.1]


    R for Data science-3 Data visualisation[读书笔记]_第1张图片
    image.png

Figure 3.1: R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the colour and fill aesthetics. The hollow shapes (0–14) have a border determined by colour,不可填充颜色; the solid shapes (15–18) are filled with colour; the filled shapes (21–24) have a border of colour and are filled with fill.

Facets

除了通过aes()映射外,还可以通过facet_wrap() 进行分面,将离散型数据,按不同类型单独作图#。

按class 变量进行分面作图

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

通过facet_grid()函数进行用两种变量进行分面。

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

也可以用“. ” 替代其中一个变量进行分面, e.g. + facet_grid(. ~ cyl) 或者e.g. + facet_grid(drv ~ .)。前者 分面后按列排布,后者分面后按行排布。
facet_wrap(~class, scales = "free") ##分面可以用不同的坐标系,默认为scales="fixed".
facet_wrap(c("cyl", "drv"), labeller = "label_both")## 同过c()函数,facet_wrap()也可实现两个变量的分面,通过labeller参数控制 标签显示更加完全。

To repeat the same data in every panel, simply construct a data frame that does not contain the faceting variable.

ggplot(mpg, aes(displ, hwy)) +
  geom_point(data = transform(mpg, class = NULL), colour = "grey85") +
  geom_point() +
  facet_wrap(~class)

Statistical transformations 统计转化

每个geom类型均默认一种stat 方式,也可以用stat()作图,效果相同。
也可以在geom作图时,重新定义stat="xx".
也可以更改变量的统计方式,在映射的变量两侧加..,表示进行stat.

Position 位置

dodge ##并列
fill ##百分比堆叠
identity ## 原位
jitter ##扰动,避免重叠
stack ##堆叠图

Coordinate systems 坐标系统

最常见的坐标系统就是笛卡尔坐标系(包括x轴,y轴),但其他坐标系也有一定的作用。
coord_flip() ##进行x轴与y轴转化
coord_quickmap() #sets the aspect ratio correctly for maps. 设定地图正确的纵横比。
coord_polar() #uses polar coordinates. 使用极坐标,可以绘制饼图,圈图,鸡冠花图。
coord_fixed() ##对坐标系统进行校正,使得x轴与y轴符合比例。
练习题:

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
R for Data science-3 Data visualisation[读书笔记]_第2张图片
image

自定义的aes参数要放置在aes()外。

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

  2. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
    连续型变量无法映射到shape.color 和size 两种均可以。

  3. What happens if you map the same variable to multiple aesthetics?
    可以。

  4. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
    "stroke"参数用来设置形状(非实心shape)轮廓的线条粗细。

  5. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
    经过计算,或者判定的变量也可以直接映射给geom函数作图。

  6. What happens if you facet on a continuous variable?
    可以分面,但分面过多,无意义。

  7. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))
  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)
  1. Take the first faceted plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
  1. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

  2. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

  3. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

  4. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
    geom_line(),geom_boxplot, geom_histogram, geom_area

  5. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

```
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
```
  1. What does show.legend = FALSE do? What happens if you remove it?
    Why do you think I used it earlier in the chapter?
    show.legend = FALSE,##不显示图例,默认为TRUE,
  2. What does the se argument to geom_smooth() do?
    'se'参数控制平滑线附近的置信区间显示,默认为TRUE,显示置信区间。
  3. Will these two graphs look different? Why/why not?
```
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```
  1. Recreate the R code necessary to generate the following graphs.


    R for Data science-3 Data visualisation[读书笔记]_第3张图片
    image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(size=3)+
geom_smooth(size=2,se=FALSE)
R for Data science-3 Data visualisation[读书笔记]_第4张图片
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(size=3)+
geom_smooth(aes(group=drv),size=2,se=FALSE)
R for Data science-3 Data visualisation[读书笔记]_第5张图片
image
ggplot(mpg,aes(x=displ,y=hwy,color=drv))+geom_point(size=3)+
geom_smooth(size=2,se=FALSE)
R for Data science-3 Data visualisation[读书笔记]_第6张图片
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(aes(color=drv),size=3)+
geom_smooth(size=2,se=FALSE)
R for Data science-3 Data visualisation[读书笔记]_第7张图片
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(aes(color=drv),size=3)+
geom_smooth(aes(linetype=drv),size=2,se=FALSE)
R for Data science-3 Data visualisation[读书笔记]_第8张图片
image
ggplot(mpg,aes(x=displ,y=hwy))+
geom_point(aes(fill=drv),size=3,shape=21,stroke=2.5,color="white")
  1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
    stat_summary() 默认geom="pointrange"。

  2. What does geom_col() do? How is it different to geom_bar()?
    geom_col()也是绘制柱形图,默认stat="identity";
    geom_bar()默认stat="count"

  3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

  4. What variables does stat_smooth() compute? What parameters control its behaviour?
    continuous variable,"method"参数控制stat_smooth的运行

  5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

让..prop..不再为1,需要分组,group接分组变量可以实现分组,但是它没有展示的途径,所以一般即使用group分组,做出图形也看不出来组别之间的区别。但是对于..prop..确是有影响的。
第二句:去掉“,y = ..prop..”可以实现以颜色分组的堆叠图;増加position=“fill”可以实现百分比堆叠图。

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill=color))
ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill=color),position="fill")
  1. What parameters to geom_jitter() control the amount of jittering?

  2. Compare and contrast geom_jitter() with geom_count().
    geom_jitter()##展示原始数据
    geom_count() ##展示统计后的数据

  3. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
    geom_boxplot()的默认位置参数为"dodge2",

  4. What’s the difference between coord_quickmap() and coord_map()?
    coord_quickmap() ## 计算速度比coord_map()快,适合于小地图。
    coord_map() ## 计算速度慢,占内存大。

你可能感兴趣的:(R for Data science-3 Data visualisation[读书笔记])