R语言操作(UDA)-- Part4: 探索多个变量

4.1 Third Qualitative Variable 第三个定性变量

ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) + geom_boxplot() +
stat_summary(fun.y = mean, geom = 'point', shape = 4)

ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender))) +
geom_line(aes(color = gender), stat = 'summary', fun.y = median)

library(dplyr)
####### 方法1：
age_gender_groups <- group_by(pf, age,gender)
pf.fc_by_age_gender <- summarise(age_gender_groups,

mean_friend_count = mean(friend_count),
median_friend_count = median(friend_count),
n = n())
####### 方法2：chain functions together %>%
pf.fc_by_age_gender <- pf %>%
filter(!is.na(gender)) %>%
group_by(age, gender) %>%
summarise(mean_friend_count = mean(friend_count),
median_friend_count = median(friend_count),
n = n()) %>%
ungroup() %>%
arrange(age)
head(pf.fc_by_age_gender)

4.2 Plotting Conditional Summaries Solution

ggplot(aes(x = age, y = median_friend_count),
data = pf.fc_by_age_gender) +
geom_line(aes(color = gender))

Reshaping Data 重塑数据

install.packages('reshape2')
library(reshape2)

pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age ~ gender,
value.var = 'median_friend_count')
head(pf.fc_by_age_gender.wide)

dcast 从long变成wide
nill 从wide变成long

4.3 Ratio Plot Solution 比率图

ggplot(aes(x = age, y = female / male),
data = pf.fc_by_age_gender.wide) +
geom_line() +
geom_hline(yintercept = 1, alpha = 0.3, linetype = 2)

可见，年轻女性的朋友数量中位数甚至超过同龄男性的2.5倍

4.4 第三个定量变量

之前我们添加了第三个定型变量，现在看看添加一个新的定量变量：

pf$year_joined <- floor(2014 - pf$tenure/365)
我们在这里进行了向下取整

Cut a Variable 切割一个变量:

summary(pf$year_joined)
table(pf$year_joined)

(2004-2009]
(2009-2011]
(2011-2012]
(2012-2014]

pf$year_joined.bucket <- cut(pf$year_joined,
c(2004, 2009, 2011, 2012, 2014))

4.5 Plotting it All Together 绘制在一起

table(pf$year_joined.bucket, useNA = 'ifany')
table(pf$year_joined.bucket)

x轴:age, y轴:friend_count, 颜色变量:gender:

ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender))) +
geom_line(aes(color = gender), stat = 'summary', fun.y = median)

颜色用year_joined.bucket来分割:

ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket))) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = median)

4.6 练习任务：绘制总均值

(1) Add another geom_line to code below

to plot the grand mean of the friend count vs age:

ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(year_joined.bucket))) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean) +

geom_line(stat = 'summary',
fun.y = mean,
linetype = 2)

4.7 练习任务：Friending Rate Solution

with(subset(pf, tenure >= 1), summary(friend_count / tenure))

4.8 练习任务：申请好友数

Create a line graph of mean of friendships_initiated per day (of tenure)
vs. tenure colored by year_joined.bucket.
You need to make use of the variables tenure,
friendships_initiated, and year_joined.bucket.
You also need to subset the data to only consider user with at least
one day of tenure.

ggplot(aes(x = tenure, y = friendships_initiated/tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean)

我们发现，噪音蛮强的，
下面通过调整bin_width来降低噪音

ggplot(aes(x = round(tenure/7)7, y = friendships_initiated/tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean)

ggplot(aes(x = round(tenure/30)30, y = friendships_initiated/tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean)

ggplot(aes(x = round(tenure/90)*90, y = friendships_initiated/tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean)

4.9 Bias-Variance Tradeoff Revisited 偏差方差折中

将平滑器添加到该图表：

ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_smooth(aes(color = year_joined.bucket))

在平滑版本中，我们依然看见新好友数随着使用时间的增加而下降

4.10 Yogurt!

Histograms Revisited

yo <- read.csv('yogurt.csv')
str(yo)

Change the id from an int to a factor 重访直方图

yo$id <- factor(yo$id)
str(yo)
qplot(data = yo, x = price, fill = I('#F79420'))

上图直方图太过于离散
调整一下组距

qplot(data = yo, x = price, fill = I('#F79420'), binwidth = 10)

调整之后的图形，会错过相邻价格的一些空白空间的观测值
这个直方图是一个有偏差的模型

4.11 Number of Purchases

summary(yo)
length(unique(yo$price))
table(yo$price)

str(yo)
yo$all.purchases <- yo$strawberry + yo$blueberry + yo$pina.colada + yo$plain + yo$mixed.berry

或者嫌上面的语句太繁琐的话，就用transform语句来做
yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)

创建新的直方图，用我们上面刚创造的变量：总购买量

qplot(x = all.purchases, data = yo, binwidth = 1,
fill = I('#099DD9'))

从上图可见，大多数的家庭每次只买1-2份

qplot(x = time, y = price, data = yo)

上图的重叠点太多，改用ggplot作图，用jitter抖动来减少重叠

ggplot(aes(x = time, y = price), data = yo) +
geom_jitter(alpha = 0.25)

可见，随着时间的推移，酸奶的价钱在涨，但是也不乏一些低价酸奶，有可能是商家活动

4.12 Looking at Samples of Households 查看家庭样本

Set the seed for reproducible results

set.seed(4239)
sample.ids <- sample(levels(yo$id), 16)

ggplot(aes(x = time, y = price),
data = subset(yo, id %in% sample.ids)) +
facet_wrap( ~ id) +
geom_line() +
geom_point(aes(size = all.purchases), pch = 1)

4.13 Scatterplot Matrices 散点图矩阵

library(GGally)
theme_set(theme_minimal(20))

Set the seed for reproducible results

set.seed(1836)
散点图对于数值变量来说很好，但是对于分类变量来说没啥意思，刨除掉2个分类变量：
pf_subset <- pf[, c(2:15)]
names(pf_subset)
ggpairs(pf_subset[sample.int(nrow(pf_subset), 1000), ])

上图可见，对角线以上的是相关系数

4.14 Even More Variable 更多变量

nci <- read.table('nci.tsv')
changing the columns to produce a nicer plot:
colnames(nci) <- c(1:64)

Create a Heat Map!热图

melt the data to long format:

library(reshape2)
nci.long.samp <- melt(as.matrix(nci[1:200, ]))
names(nci.long.samp) <- c('gene', 'case', 'value')
head(nci.long.samp)

make the heat map:

ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c('blue','red'))(100))

小结：探索多变量这一块中学到了什么

我们采取你以前课程中学习的许多基本技巧，并对其扩展，以便一次调查多个变量的模式
我们从简单的散点图入手，并绘制前面学过的条件总结，例如为多个组添加总结
然后我们尝试采用一些技术来一次检查大量的变量，例如散点图矩阵和热图
我们还学会了如何重塑数据：从每种情况一行的广泛数据变成每个变量组合一行的总和数据
将数据在long和wide格式之间往返移动

author: 快乐自由拉菲犬Celine Zhang

R语言操作(UDA)-- Part4: 探索多个变量

4.1 Third Qualitative Variable 第三个定性变量

4.2 Plotting Conditional Summaries Solution

Reshaping Data 重塑数据

4.3 Ratio Plot Solution 比率图

4.4 第三个定量变量

4.5 Plotting it All Together 绘制在一起

x轴:age, y轴:friend_count, 颜色变量:gender:

颜色用year_joined.bucket来分割:

4.6 练习任务：绘制总均值

(1) Add another geom_line to code below

4.7 练习任务：Friending Rate Solution

4.8 练习任务：申请好友数

4.9 Bias-Variance Tradeoff Revisited 偏差方差折中

将平滑器添加到该图表：

4.10 Yogurt!

Histograms Revisited

Change the id from an int to a factor 重访直方图

4.11 Number of Purchases

4.12 Looking at Samples of Households 查看家庭样本

Set the seed for reproducible results

4.13 Scatterplot Matrices 散点图矩阵

Set the seed for reproducible results

4.14 Even More Variable 更多变量

Create a Heat Map!热图

make the heat map:

小结：探索多变量这一块中学到了什么

你可能感兴趣的:(R语言操作(UDA)-- Part4: 探索多个变量)