[R语言] ggplot2包 可视化《R for data science》 1

《R for Data Science》第二、三章 Data visualisation 啃书知识点积累

参考书籍

  1. 《R for data science》
  2. 《R数据科学》
  3. The Layered Grammar of Graphics.
  4. ggplot2: Points

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

A graphing template

ggplot(data = ) + 
  (
     mapping = aes(),
     stat = , 
     position = 
  ) +
   +
  

Aesthetic mappings

# Left
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

# Right
p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))  

p1 + p2
# Warning messages:
# 1: Using alpha for a discrete variable is not advised. 
# 2: The shape palette can deal with a maximum of 6 discrete values
# because more than 6 becomes difficult to discriminate; you have
# 7. Consider specifying shapes manually if you must have them. 
# 3: Removed 62 rows containing missing values (geom_point). 
[R语言] ggplot2包 可视化《R for data science》 1_第1张图片

ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

[R语言] ggplot2包 可视化《R for data science》 1_第2张图片

- How do these aesthetics behave differently for categorical vs. continuous variables

'''
color 有序属性
1. 分类变量映射:对应多种不同颜色
2. 连续变量映射:形成有固定范围的色阶,在色阶内部取色

size 有序属性
1. 分类变量映射:点大小和分类类型逐一对应但不相关,且会警告
2. 连续变量映射:点的大小和连续变量线性相关

shape 无序属性
1. 分类变量映射:对应多种形状,最多同时出现6种,超过则不显示且有警告
2. 连续变量映射:无法映射
'''

- mpg的变量类型

[R语言] ggplot2包 可视化《R for data science》 1_第3张图片
  • stroke属性
[R语言] ggplot2包 可视化《R for data science》 1_第4张图片
p1 <- ggplot(mpg,aes(x = displ, y = hwy)) +
  geom_point(shape = 1)

p2 <- ggplot(mpg,aes(x = displ, y = hwy)) +
  geom_point(shape = 1,stroke = 2)

p1 + p2
[R语言] ggplot2包 可视化《R for data science》 1_第5张图片

Facet 分面

- 封装型 wrap

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
[R语言] ggplot2包 可视化《R for data science》 1_第6张图片

facet_wrap()参数如下:


# strip.position参数调节标签的朝向
p1 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2, strip.position = 'bottom')

p2 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2, strip.position = 'right')

p1 + p2
[R语言] ggplot2包 可视化《R for data science》 1_第7张图片

- 在分面中呈现总数据

ggplot(mpg, aes(displ, hwy)) +
  geom_point(data = transform(mpg, class = NULL), 
             colour = "grey85") +
  geom_point() +
  facet_wrap(~ class)
[R语言] ggplot2包 可视化《R for data science》 1_第8张图片

- 网格型 grid

# . 的作用表示的是不想在行或者列的维度上进行分面
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) # 列 ~ 行

p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

p1 + p2
[R语言] ggplot2包 可视化《R for data science》 1_第9张图片

Geometric objects

- 不显示图例和置信区间

p1 <- ggplot(mpg) +
  geom_smooth(aes(x = displ, y = hwy))

p2 <- ggplot(mpg,aes(x = displ, y = hwy, group = drv)) +
  geom_smooth(se = FALSE)

p3 <- ggplot(mpg) +
  geom_smooth(
    aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE)

p1 + p2 + p3
[R语言] ggplot2包 可视化《R for data science》 1_第10张图片

- 配合filter

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
[R语言] ggplot2包 可视化《R for data science》 1_第11张图片

- 细节画图

同样是外白内其他颜色的点,一种重叠后有白色,一种无白色在内

p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(fill=drv),shape=21,color='white',size=2.5,stroke=1.5)

p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color='white',size=3.5)+
  geom_point(aes(color=drv),shape=16,size=2.3)

p1 + p2
[R语言] ggplot2包 可视化《R for data science》 1_第12张图片

Statistical transformations

barcharts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box.

[R语言] ggplot2包 可视化《R for data science》 1_第13张图片

- 几种常用互换

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar()

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))
# 等价于
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut), stat = 'identity') # 默认stat可以不写
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
# 等价于
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

# 也可以手动复现
ggplot(diamonds, aes(cut,depth)) + 
  geom_line(size=1) + 
  # 更换data需要重新指名data = xxx
  geom_point(data = diamonds %>%   
               group_by(cut) %>% 
               summarise(median(depth)),
               aes(cut, `median(depth)`), size=2) 
[R语言] ggplot2包 可视化《R for data science》 1_第14张图片

- 覆盖默认映射

ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = stat(prop), group = 1, fill = stat(prop)))
# 等价于
p1 <- ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = ..prop.., group = 1, fill = ..prop..))

p2 <- ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = ..prop.., group = color, fill = color))

p1 + p2
[R语言] ggplot2包 可视化《R for data science》 1_第15张图片

- What does geom_col() do? How is it different to geom_bar()?

  1. geom_col() 函数也是用来绘制柱状图,"identity" 表示不做统计变换
  2. geom_bar() 函数默认是 count,表示计数

- Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

[R语言] ggplot2包 可视化《R for data science》 1_第16张图片
[R语言] ggplot2包 可视化《R for data science》 1_第17张图片
[R语言] ggplot2包 可视化《R for data science》 1_第18张图片

Position adjustments

position = "identity" 将每个对象直接显示在图中,这样数据会彼此重叠,不适合展示结果
position = "fill" 堆叠百分比条形图
position = "dodge" 并列条形图
position = "stack" 堆叠起来
position = "jitter" 数据随机抖动,一般应用于散点图

用一下刘博的案例

library(ggplot2)
library(patchwork)

v <- data.frame(x = 1:20, 
                y = runif(40,min = 10,max = 20),
                z = rep(c("A","B"),each = 20))
                
p1 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_dodge(), alpha = 0.5) +
  labs(title = "position_dodge()")

p2 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_fill(), alpha = 0.5) +
  labs(title = "position_fill()")

p3 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_stack(), alpha = 0.5) +
  labs(title = "position_stack()")

p4 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_identity(), alpha = 0.5) +
  labs(title = "position_identity()")

p5 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_jitter(), alpha = 0.5) +
  labs(title = "position_jitter(), usually for point")

(p1 + p2 + p3)/(p4 + p5) 
[R语言] ggplot2包 可视化《R for data science》 1_第19张图片
  • geom_jitter() 抖动

geom_jitter() 对数据进行随机抖动
geom_count() 将重叠的位置数目进行计数

p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()
# 等价于
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = position_jitter())
# 等价于
p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = 'jitter')

# geom_count()
p3 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()
[R语言] ggplot2包 可视化《R for data science》 1_第20张图片

Coordinate systems

- coord_flip()

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

p1 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

p2 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

p1 + p2 
[R语言] ggplot2包 可视化《R for data science》 1_第21张图片

- coord_quickmap()

帮助地图设置成正确比例

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.

nz <- map_data("nz")

p1 <- ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

p2 <- ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

p1 + p2 
[R语言] ggplot2包 可视化《R for data science》 1_第22张图片

- coord_polar()

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

p1 <- bar + coord_flip()
p2 <- bar + coord_polar()

p1 + p2 
[R语言] ggplot2包 可视化《R for data science》 1_第23张图片

进一步拓展:

- Turn a stacked bar chart into a pie chart using coord_polar()

p1 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity)) + 
  coord_polar()

p2 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity),
           position = 'fill') + 
  coord_polar()

# theta 参数表示 variable to map angle to (x or y)
# 意思就是根据值计算出所占的比例,然后再映射到角度
p3 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity),
           position = 'fill') + 
  coord_polar(theta = "y")

p1 + p2 + p3
[R语言] ggplot2包 可视化《R for data science》 1_第24张图片

- What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

'''
城市和公路燃油效率之间呈现正相关。
coord_fixed()能够固定x轴和y轴的比例。
geom_abline()是绘制斜线,默认45度,截距适应图形
可以指定intercept截距,slope坡度
'''

p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline(intercept=-5,slope=1) +
  coord_fixed()

p1 + p2
[R语言] ggplot2包 可视化《R for data science》 1_第25张图片

你可能感兴趣的:([R语言] ggplot2包 可视化《R for data science》 1)