R语言学习记录 - 单一变量分析

今天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课

Lesson 3


What to Do First?

Notes:


Pseudo-Facebook User Data

Notes:

getwd()
## [1] "C:/Users/HH/Desktop/R Data analyst"
list.files()
##  [1] "07-tidy-data.pdf"         "demystifying.R"          
##  [3] "demystifyingR2_v3.html"   "demystifyingR2_v3.Rmd"   
##  [5] "EDA_Course_Materials.zip" "lesson3_student.html"    
##  [7] "lesson3_student.rmd"      "pseudo_facebook.tsv"     
##  [9] "reddit.csv"               "stateData.csv"           
## [11] "tidy-data.pdf"
pf<-read.delim('pseudo_facebook.tsv')
names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

Notes:

library(ggplot2)
qplot(x=dob_day,data=pf)+
  scale_x_continuous(breaks=1:31)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R语言学习记录 - 单一变量分析_第1张图片
image.png

What are some things that you notice about this histogram?

Response: It is usual that so many people birth on 1st


Moira’s Investigation

Notes:


Estimating Your Audience Size

Notes:


Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response:

How many of your friends do you think saw that post?

Response:

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response:


Perceived Audience Size

Notes:


Faceting

Notes:

qplot(x=dob_day,data=pf)+
  scale_x_continuous(breaks=1:31)+
  facet_wrap(~dob_month,ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R语言学习记录 - 单一变量分析_第2张图片
image.png

Let鈥檚 take another look at our plot. What stands out to you here?

Response:


Be Skeptical - Outliers and Anomalies

Notes:


Moira’s Outlier

Notes: #### Which case do you think applies to Moira鈥檚 outlier? Response:


Friend Count

Notes:

What code would you enter to create a histogram of friend counts?

qplot(friend_count,data=pf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R语言学习记录 - 单一变量分析_第3张图片
image.png

How is this plot similar to Moira’s first plot?

Response:


Limiting the Axes

Notes:

qplot(friend_count,data=pf)+
  scale_x_continuous(limits=c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第4张图片
image.png

Exploring with Bin Width

Notes:


Adjusting the Bin Width

Notes:

Faceting Friend Count

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))+
  facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第5张图片
image.png

Omitting NA Values

Notes:

qplot(friend_count,data=subset(pf,!is.na(gender)),binwidth=25)+
  scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50))+
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第6张图片
image.png

Statistics ‘by’ Gender

Notes:

table(pf$gender)
## 
## female   male 
##  40254  58574
by(pf$friend_count,pf$gender,summary)
## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: women #### What’s the difference between the median friend count for women and men? Response: 22 #### Why would the median be a better measure than the mean? Response: don’t change too much when there are extreme data ***

Tenure

Notes:

qplot(x=tenure,data=pf, binwidth=30,
  color=I('black'), fill=I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第7张图片
image.png

How would you create a histogram of tenure by year?

qplot(x=tenure/365,data=pf, binwidth=.25,
  color=I('black'), fill=I('#F79420'))+
  scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第8张图片
image.png

Labeling Plots

Notes:

qplot(x=tenure/365,data=pf, 
      xlab='No. of years using FB',
      ylab='No. of users in sample',
      binwidth=.25,
  color=I('black'), fill=I('#F79420'))+
  scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))
## Warning: Removed 26 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第9张图片
image.png

User Ages

Notes:

qplot(x=age,data=pf,
      xlab='Age of users', ylab='Number of users',
      binwidth=1,
      color=I('black'), fill=I('#5760AB'))+
  scale_x_continuous(breaks=seq(1,113,5))
R语言学习记录 - 单一变量分析_第10张图片
image.png

What do you notice?

Response:


The Spread of Memes

Notes:


Lada’s Money Bag Meme

Notes:


Transforming Data

Notes:

library(gridExtra)
p1 <- qplot(x= friend_count,data=pf)
p2 <- qplot(x=log10(friend_count+1),data=pf)
p3 <- qplot(x=sqrt(friend_count+1),data=pf)
grid.arrange(p1, p2, p3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R语言学习记录 - 单一变量分析_第11张图片
image.png
p1 <- ggplot(aes(x= friend_count),data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1, p2, p3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R语言学习记录 - 单一变量分析_第12张图片
image.png

Add a Scaling Layer

Notes:

qplot (x=friend_count,data=pf)+
  scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1962 rows containing non-finite values (stat_bin).
R语言学习记录 - 单一变量分析_第13张图片
image.png

Frequency Polygons

q1 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

q2 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(0, 250), breaks = seq(0, 250, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

q3 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(250, 500), breaks = seq(250, 500, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

q4 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

grid.arrange(q1,q2,q3,q4,ncol=2)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 19870 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 87181 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 93438 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
R语言学习记录 - 单一变量分析_第14张图片
image.png

Likes on the Web

Notes:

by(pf$www_likes,pf$gender,sum)
## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175
by(pf$www_likes_received,pf$gender,sum)
## pf$gender: female
## [1] 4199879
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1586098

Box Plots

Notes:

qplot(x=gender,y=friend_count,
      data=subset(pf,!is.na(gender)),
      geom='boxplot')+
  scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1962 rows containing non-finite values (stat_boxplot).
R语言学习记录 - 单一变量分析_第15张图片
image.png

Adjust the code to focus on users who have friend counts between 0 and 1000.

qplot(x=gender,y=friend_count,
      data=subset(pf,!is.na(gender)),
      geom='boxplot')+
  coord_cartesian(ylim=c(0,1000))
R语言学习记录 - 单一变量分析_第16张图片
image.png

Box Plots, Quartiles, and Friendships

Notes:

qplot(x=gender,y=friendships_initiated,
      data=subset(pf,!is.na(gender)),
      geom='boxplot')+
  coord_cartesian(ylim=c(0,500))
R语言学习记录 - 单一变量分析_第17张图片
image.png

On average, who initiated more friendships in our sample: men or women?

Response: #### Write about some ways that you can verify your answer. Response:

Response:


Getting Logical

Notes:

Response:


Analyzing One Variable

Reflection:


Click

KnitHTML

to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

你可能感兴趣的:(R语言学习记录 - 单一变量分析)