Data Carpentry Workshop - Day 2 -Working with tabular data in R

IMG_7436.JPG

Main Contents

Describe what a data frame is.
Load external data from a .csv file into a data frame.
Summarize the contents of a data frame.
Describe what a factor is.
Convert between strings and factors.
Reorder and rename factors.
Change how character strings are handled in a data frame.
Format dates.

1. Preparation for a dataframe

1.1 Download file

download.file("https://ndownloader.figshare.com/files/2292169",
   "C:/Users/home/Desktop/Rcourse/DataCarpentry33/RawData/portal_data_joined1.csv")

1.2 Load file

survey <- read.csv("C:/Users/home/Desktop/Rcourse/DataCarpentry33/RawData/portal_data_joined1.csv")
surveys <- read.table(file="C:/Users/home/Desktop/Rcourse/DataCarpentry33/RawData/portal_data_joined1.csv", sep=",", header=TRUE)

2. Inspecting dataframe objects

2.1 Contents

head(): shows the first 6 rows
tail(): shows thelast 6 rows

2.2 Size

dim(): returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
nrow(): returns the number of rows
ncol(): returns the number of columns

2.3 Names

names(): returns the column names (synonym of colnames() for data.frame objects)
colnames()
rownames(): returns the row names

2.4 Summary

str(): structure of the object and information about the class, length and content of each column
summary(): summary statistics for each column

3. Indexing and subsetting data frames

3.1 Indexing for data

v1 <- survey[1,1] #第1行第1列,vector
v2 <- survey[1,] #第1行,dataframe
v3 <- survey[,1] #第1列,vector
v4 <- survey[1] #第1列，dataframe
v5 <- survey[1:3,6] #第6列的1-3个元素 (未得到预期？)，vector
v6 <- survey[1:6,] #前6行，同head(survey)，dataframe

3.2 Exclude certain indices of a data frame using the “-” sign

survey[,-1] #去除第一列的整个表格
survey[-c(7:34786),] #前6行，相当于head(survey)

3.3 Subsetting by calling indices or column names

a <- survey["species_id"] #dataframe,某列表
b <- survey[,"species_id"] #某列组成的vector
c <- survey[["species_id"]]#同上，某列组成的vector
d <- survey$species_id # 某列组成的vector

4. Factors

4.1 Basic factors

sex <- factor(c("male","female","female","male"))
sex
levels(sex) #male female
nlevels(sex) #2
sex <- factor(sex,levels = c("male","female"))##不太懂此处含义
sex#未发现新的变化

4.2 Converting factors

sex_t <- as.character(sex) # convert to character
sex_t
year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct)               # Wrong! And there is no warning...
as.numeric(as.character(year_fct)) # Works...
as.numeric(levels(year_fct))[year_fct]    # The recommended way.

Three steps to transfer into numberic:

We obtain all the factor levels using levels(year_fct).
We convert these levels to numeric values using as.numeric(levels(year_fct)).
We then access these numeric values using the underlying integers of the vector year_fct inside the square brackets.

4.3 Renaming factors

plot(survey$sex) #view 
sex <- survey$sex
sex
head(sex)
levels(sex)
nlevels(sex)
levels(sex)[1] <- "determined"
levels(sex)
head(sex)
levels(sex)[2:3] <- c("Female", "Male")
levels(sex)
sex <- factor(sex,levels = c("Female","Male","undetermined"))
levels(sex)
plot(sex)#名称已换，但数值对不上号

4.4 Using stringsAsFactors = FALSE

Compare the difference between read as "factors" vs "character"

surveys1 <- read.csv("C:/Users/home/Desktop/Rcourse/DataCarpentry33/RawData/portal_data_joined1.csv",stringsAsFactors = TRUE)
str(surveys1)
surveys2 <- read.csv("C:/Users/home/Desktop/Rcourse/DataCarpentry33/RawData/portal_data_joined1.csv",stringsAsFactors = FALSE)
str(surveys2)
surveys2$plot_type <- factor(surveys2$plot_type)
str(surveys2) #Convert the column "plot_type" into a factor

Practice

animal_data <- data.frame(animal = c("dog", "cat", "sea cucumber", "sea urchin"), feel = c("furry", "squishy", "spiny","sweety"),weight = c(45, 8,1.1, 0.8))
country_climate <- data.frame(country = c("Canada", "Panama", "South Africa", "Australia"), climate = c("cold", "hot", "temperate", "hot/temperate"), temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1), stringsAsFactors = FALSE)
str(country_climate)
country_climate <- data.frame(country = c("Canada", "Panama", "South Africa", "Australia"),
                          climate = c("cold", "hot", "temperate", "hot/temperate"),
                          temperature = c(10, 30, 18, 15),
                          northern_hemisphere = c(TRUE, TRUE, FALSE, FALSE),
                          has_kangaroo = c(FALSE, FALSE, FALSE, TRUE), stringsAsFactors = FALSE)
str(country_climate)

4.5 Formatting Dates

library(tidyverse)
library(lubridate)
my_date <- ymd("2015-01-01")
str(my_date)
# sep indicates the character to use to separate each component
my_date <- ymd(paste("2015", "1", "1", sep = "-")) 
str(my_date)
paste(survey$year, survey$month, survey$day, sep = "-")
ymd(paste(survey$year, survey$month, survey$day, sep = "-"))
survey$date <- ymd(paste(survey$year, survey$month, survey$day, sep = "-"))
view(survey)
str(survey)
summary(survey)
summary(survey$date)
# considering the missing data (未找到正确解决方式）
is_missing_date <- is.na(survey$date)
date_columns <- c("year", "month", "day")
missing_dates <- survey[is_missing_date,  date_columns]
head(missing_dates)

Data Carpentry Workshop - Day 2 -Working with tabular data in R

Main Contents

1. Preparation for a dataframe

1.1 Download file

1.2 Load file

2. Inspecting dataframe objects

2.1 Contents

2.2 Size

2.3 Names

2.4 Summary

3. Indexing and subsetting data frames

3.1 Indexing for data

3.2 Exclude certain indices of a data frame using the “-” sign

3.3 Subsetting by calling indices or column names

4. Factors

4.1 Basic factors

4.2 Converting factors

Three steps to transfer into numberic:

4.3 Renaming factors

4.4 Using stringsAsFactors = FALSE

Compare the difference between read as "factors" vs "character"

Practice

4.5 Formatting Dates

下期预告

Data Manipulation using dplyr and tidyr

Data visualization with ggplot2

你可能感兴趣的:(Data Carpentry Workshop - Day 2 -Working with tabular data in R)