今天在优达Udacity学了用R做数据分析，以前也学过，不过没有学得这么系统，把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课

Lesson 3

What to Do First?

Notes:

Pseudo-Facebook User Data

Notes:

getwd()

## [1] "C:/Users/HH/Desktop/R Data analyst"

list.files()

##  [1] "07-tidy-data.pdf"         "demystifying.R"          
##  [3] "demystifyingR2_v3.html"   "demystifyingR2_v3.Rmd"   
##  [5] "EDA_Course_Materials.zip" "lesson3_student.html"    
##  [7] "lesson3_student.rmd"      "pseudo_facebook.tsv"     
##  [9] "reddit.csv"               "stateData.csv"           
## [11] "tidy-data.pdf"

pf<-read.delim('pseudo_facebook.tsv')
names(pf)

##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"

Histogram of Users’ Birthdays

Notes:

library(ggplot2)
qplot(x=dob_day,data=pf)+
  scale_x_continuous(breaks=1:31)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image.png

What are some things that you notice about this histogram?

Response: It is usual that so many people birth on 1st

Moira’s Investigation

Notes:

Estimating Your Audience Size

Notes:

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

Response:

How many of your friends do you think saw that post?

Response:

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Response:

Perceived Audience Size

Notes:

Faceting

Notes:

qplot(x=dob_day,data=pf)+
  scale_x_continuous(breaks=1:31)+
  facet_wrap(~dob_month,ncol=3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image.png

Let鈥檚 take another look at our plot. What stands out to you here?

Response:

Be Skeptical - Outliers and Anomalies

Notes:

Moira’s Outlier

Notes: #### Which case do you think applies to Moira鈥檚 outlier? Response:

Friend Count

Notes:

What code would you enter to create a histogram of friend counts?

qplot(friend_count,data=pf)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image.png

How is this plot similar to Moira’s first plot?

Response:

Limiting the Axes

Notes:

qplot(friend_count,data=pf)+
  scale_x_continuous(limits=c(0,1000))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

image.png

Exploring with Bin Width

Notes:

Adjusting the Bin Width

Notes:

Faceting Friend Count

# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))+
  facet_wrap(~gender)

## Warning: Removed 2951 rows containing non-finite values (stat_bin).

image.png

Omitting NA Values

Notes:

qplot(friend_count,data=subset(pf,!is.na(gender)),binwidth=25)+
  scale_x_continuous(limits=c(0,1000),breaks=seq(0,1000,50))+
facet_wrap(~gender)

## Warning: Removed 2949 rows containing non-finite values (stat_bin).

image.png

Statistics ‘by’ Gender

Notes:

table(pf$gender)

## 
## female   male 
##  40254  58574

by(pf$friend_count,pf$gender,summary)

## pf$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      37      96     242     244    4923 
## -------------------------------------------------------- 
## pf$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27      74     165     182    4917

Who on average has more friends: men or women?

Response: women #### What’s the difference between the median friend count for women and men? Response: 22 #### Why would the median be a better measure than the mean? Response: don’t change too much when there are extreme data ***

Tenure

Notes:

qplot(x=tenure,data=pf, binwidth=30,
  color=I('black'), fill=I('#099DD9'))

## Warning: Removed 2 rows containing non-finite values (stat_bin).

image.png

How would you create a histogram of tenure by year?

qplot(x=tenure/365,data=pf, binwidth=.25,
  color=I('black'), fill=I('#F79420'))+
  scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))

## Warning: Removed 26 rows containing non-finite values (stat_bin).

image.png

Labeling Plots

Notes:

qplot(x=tenure/365,data=pf, 
      xlab='No. of years using FB',
      ylab='No. of users in sample',
      binwidth=.25,
  color=I('black'), fill=I('#F79420'))+
  scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))

## Warning: Removed 26 rows containing non-finite values (stat_bin).

image.png

User Ages

Notes:

qplot(x=age,data=pf,
      xlab='Age of users', ylab='Number of users',
      binwidth=1,
      color=I('black'), fill=I('#5760AB'))+
  scale_x_continuous(breaks=seq(1,113,5))

image.png

What do you notice?

Response:

The Spread of Memes

Notes:

Lada’s Money Bag Meme

Notes:

Transforming Data

Notes:

library(gridExtra)
p1 <- qplot(x= friend_count,data=pf)
p2 <- qplot(x=log10(friend_count+1),data=pf)
p3 <- qplot(x=sqrt(friend_count+1),data=pf)
grid.arrange(p1, p2, p3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image.png

p1 <- ggplot(aes(x= friend_count),data=pf) + geom_histogram()
p2 <- p1 + scale_x_log10()
p3 <- p1 + scale_x_sqrt()
grid.arrange(p1, p2, p3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Transformation introduced infinite values in continuous x-axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1962 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image.png

Add a Scaling Layer

Notes:

qplot (x=friend_count,data=pf)+
  scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1962 rows containing non-finite values (stat_bin).

image.png

Frequency Polygons

q1 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

q2 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(0, 250), breaks = seq(0, 250, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

q3 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(250, 500), breaks = seq(250, 500, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

q4 <- ggplot(aes(x=friend_count,y=..count../sum(..count..)),
       data=subset(pf,!is.na(gender)))+
  geom_freqpoly(aes(color=gender),binwidth=10)+
  scale_x_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 50))+
  xlab('Numbers of Friends')+
  ylab('Percentage of users with that friend count')

grid.arrange(q1,q2,q3,q4,ncol=2)

## Warning: Removed 2949 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_path).

## Warning: Removed 19870 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_path).

## Warning: Removed 87181 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_path).

## Warning: Removed 93438 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_path).

image.png

Likes on the Web

Notes:

by(pf$www_likes,pf$gender,sum)

## pf$gender: female
## [1] 3507665
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1430175

by(pf$www_likes_received,pf$gender,sum)

## pf$gender: female
## [1] 4199879
## -------------------------------------------------------- 
## pf$gender: male
## [1] 1586098

Box Plots

Notes:

qplot(x=gender,y=friend_count,
      data=subset(pf,!is.na(gender)),
      geom='boxplot')+
  scale_y_log10()

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 1962 rows containing non-finite values (stat_boxplot).

image.png

Adjust the code to focus on users who have friend counts between 0 and 1000.

qplot(x=gender,y=friend_count,
      data=subset(pf,!is.na(gender)),
      geom='boxplot')+
  coord_cartesian(ylim=c(0,1000))

image.png

Box Plots, Quartiles, and Friendships

Notes:

qplot(x=gender,y=friendships_initiated,
      data=subset(pf,!is.na(gender)),
      geom='boxplot')+
  coord_cartesian(ylim=c(0,500))

image.png

On average, who initiated more friendships in our sample: men or women?

Response: #### Write about some ways that you can verify your answer. Response:

Response:

Getting Logical

Notes:

Response:

Analyzing One Variable

Reflection:

Click

KnitHTML

to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

R语言学习记录 - 单一变量分析

Lesson 3

What to Do First?

Pseudo-Facebook User Data

Histogram of Users’ Birthdays

What are some things that you notice about this histogram?

Moira’s Investigation

Estimating Your Audience Size

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

How many of your friends do you think saw that post?

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

Perceived Audience Size

Faceting

Let鈥檚 take another look at our plot. What stands out to you here?

Be Skeptical - Outliers and Anomalies

Moira’s Outlier

Friend Count

What code would you enter to create a histogram of friend counts?

How is this plot similar to Moira’s first plot?

Limiting the Axes

Exploring with Bin Width

Adjusting the Bin Width

Faceting Friend Count

Omitting NA Values

Statistics ‘by’ Gender

Who on average has more friends: men or women?

Tenure

How would you create a histogram of tenure by year?

Labeling Plots

User Ages

What do you notice?

The Spread of Memes

Lada’s Money Bag Meme

Transforming Data

Add a Scaling Layer

Frequency Polygons

Likes on the Web

Box Plots

Adjust the code to focus on users who have friend counts between 0 and 1000.

Box Plots, Quartiles, and Friendships

On average, who initiated more friendships in our sample: men or women?

Getting Logical

Analyzing One Variable

你可能感兴趣的:(R语言学习记录 - 单一变量分析)