R数据科学第五章

library(tidyverse)

变量的分布进行可视化

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

图片.png

观测值可以使用dplyr::count()计算统计

diamonds %>%
  count(cut)

图片.png

连续变量，用直方图

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

图片.png

当然也可以用dplry::count和ggplot::cut_width进行手动统计
diamonds %>% count(cut_width(carat, 0.5))

图片.png
根据自己的目的选择直方图的宽度，如果选择小于3克拉的钻石

smaller <- diamonds %>%
  filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

图片.png

如果使用叠加的条形图则用geom_freqpoly代替geom_histogram,但前者是折线图统计

ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
  geom_histogram(binwidth = 0.1)
ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
  geom_freqpoly(binwidth = 0.1)

ggplot(data = smaller, mapping = aes(x = carat)) +
        geom_histogram(binwidth = 0.1)
ggplot(data = faithful, mapping = aes(x = eruptions)) +
  geom_histogram(binwidth = 0.25)

图片.png

5.3异常值

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

图片.png

coord_cartesian（）放到靠近0的数值

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth =0.5) +
  coord_cartesian(ylim = c(0,50))

图片.png

可以看到有三个异常的数值,利用dplry中的filter将他们找到

unusual <- diamonds %>%
  filter(y < 3 | y > 20) %>%
  arrange(y)
unusual

图片.png

ggplot(unusual) +
  geom_histogram(mapping = aes(x = y), binwidth =0.5) +
  coord_cartesian(ylim = c(0,50))

图片.png

钻石为0.99克拉和1克拉的数量，为什么出现这样的结果

m0.99 <- diamonds %>%
  filter(carat == 0.99)
m0.99
m1 <- diamonds %>%
  filter(carat == 1)
m1

图片.png

缺失值

diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20)) %>%
  arrange(y)
diamonds2

利用缺失值NA代替异常值,mutate()

diamonds3 <- diamonds %>%
  mutate(y = ifelse(y < 3|y >20, NA, y))
diamonds3

ggplot2中遵循无视缺失值的原则，忽略缺失值

ggplot(data = diamonds3, mapping = aes(x = x, y = y)) +
  geom_point()
###Warning message:
Removed 9 rows containing missing values (geom_point).

警告忽略缺失值，可用na.rn = TURE,消除警告
ggplot(data = diamonds3, mapping = aes(x = x, y = y)) + geom_point(na.rm = TRUE)

图片.png

下边这个没看懂 原文是：弄清楚造成缺失值的观测和没有缺失值的观测间的区别的原因，例如：在nycflights13::flights 中，dep_time变量中的缺失值表示航班取消了，因子，应该比较一下已取消的航班和未取消航班的计划出发时间，利用is.na()函数创建一个新变量来完成这个操作

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60 
  ) %>%
  ggplot(mapping = aes(sched_dep_time)) + 
  geom_freqpoly(
    mapping = aes(color = cancelled),
    binwidth = 1/4)

图片.png

箱线图可以将分类变量进行可视化

geom_boxplot函数查看切割质量和价格分布

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot(mapping = aes(color = cut)
)

图片.png

利用reorder()函数进行排列

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()
ggplot(data = mpg, mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  geom_boxplot()

图片.png

利用coord_flip将函数图形旋转90度

ggplot(data = mpg, mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  geom_boxplot() +
  coord_flip()

图片.png

个人学习笔记，记录的不够详细，比较粗糙，勿喷。

R数据科学第五章——蜂

R数据科学第五章

变量的分布进行可视化

连续变量，用直方图

5.3异常值

钻石为0.99克拉和1克拉的数量，为什么出现这样的结果

相关变动是两个或者多个变量以相关的方式共同变化所表现出的趋势

箱线图可以将分类变量进行可视化

你可能感兴趣的:(R数据科学第五章——蜂)