Data Manipulation in R with dplyr用dplyr包来处理数据
-
Introduction to dplyr and tbls对于dplyr和tbls的简介
tbl_df()将data.frame格式的数据转化为tibble格式,一种特殊的data.frame格式,使其更易观察、处理。
two <- c("AA", "AS")
lut <- c("AA" = "American",
"AS" = "Alaska",
"B6" = "JetBlue")
two <- lut[two]
two可以找出两个向量中相关联的信息
-
Select and mutate
select(), which returns a subset of the columns,
filter(), that is able to return a subset of the rows,
arrange(), that reorders the rows according to single or multiple variables,
mutate(), used to add columns from existing data,
summarise(), which reduces each group to a single row by calculating aggregate measures.
starts_with("X"): every name that starts with "X",
ends_with("X"): every name that ends with "X",
contains("X"): every name that contains "X",
matches("X"): every name that matches "X", where "X" can be a regular expression,
num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
one_of(x): every name that appears in x, which should be a character vector.
-
Filter and arrange过滤和安排
library(dplyr)
x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
c1 <- filter(hflights, Dest == "JFK")过滤一遍符合条件的出来
c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))新建一列
select(c2, Date, DepTime, ArrTime, TailNum)选择部分列
arrange(dtc, DepDelay)然后用arrange进行分类排序,arrange括号中desc()表示由大到小的逆序排列
summarise(filter(hflights, Diverted == 1), max_div = max(Distance))用summarise对数据进行分类汇总,括号中第一个数为想要进行分类的全部数据,后面的为各个类别维度。
-
Summarise and the pipe operator总结和操作
min(x) - 最小值
max(x) - 最大值
mean(x) - 平均数
median(x) - 中位数
quantile(x, p) - p分位数
sd(x) - 标准差
var(x) - 方差.
IQR(x) - Inter Quartile Range (IQR) of vector x.
diff(range(x)) - 向量x的长度
temp1 <- filter(hflights, !is.na(ArrDelay))筛选ArrDelay列中含有NA的行
first(x) - The first element of vector x.
last(x) - The last element of vector x.
nth(x, n) - The nth element of vector x.
n() - The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) - The number of unique values in vector x.
summarise(n_non = n(),
...................n_dest = n(Dest),
...................min_dist = min(Distance),
...................max_dist = max(Distance)) summarise()的格式
-
Group_by and working with databases分组和操作数据库
Joining Data in R with dplyr
-
Mutating joins增加链接
left_join(x, y)将y添加到x,同理right_join()
bands2 <- left_join(bands, artists, by = c("first", "last"))
setequal()查看俩个是否相等
semi_join()
songs %>%
semi_join(labels, by = "album") %>% songs中的行要匹配labels中的行
-
Filtering joins and set operations过滤链接以及设置操作
artists %>%
anti_join(bands, by = c("first", "last"))使用anti_join():artists中以姓名为维度不含bands的行
aerosmith %>%
union(greatest_hits, by = "song") %>% union可将两项合并在一起
intersect()类似semi-join
setdiff()类似anti-join
all_songs <- live_songs %>% union(greatest_songs)
common_songs <- live_songs %>% intersect(greatest_songs)
all_songs %>% setdiff(common_songs)有关union,intersect,setdiff的区别,此处是用全部的songs减去公用的songs得出只包含在一边的songs。
identical(definitive, complete)检验这俩是否包含一个同样的song(在一个order中)
setequal(definitive, complete)检验这俩是否包含一个同样的song(任意order)
definitive %>%
setdiff(complete)找出在definitive但不在complete的songs
-
Assembling data聚集数据
dplyr包中的bind_rows()bind_cols()和R本身的rbind()cbind()的区别:
前者更快;前者可以以list的方式输入;前者输出的格式总是tbl。当列的名称和data frame中不匹配时,rbind()会报错
而bind_rows()会为匹配不到的名称新建一列,而且缺失值也会?
bind_rows(.id = "album")id表示以album为主键?
比较下data.frame()和data_frame()的区别
as_data_frame(hank)转化为data_frame
当dplyr包bind一个factor和一个character时,会将factor转化为character
factor要转化为numeric之前要先转化成character
-
Advanced joining
rownames_to_column()
rownames_to_column(df, new_col)其中df是一个包含有行名称的dataframe,该函数将df中的行名称作为一列,以new_col作为名称新建一列加入其中。
rename(data, new_name = old_name)在data中,将某一列的旧名称更改成新名称.当要更改两次且都是要把为name的名称改为别的时,可以使用name.x,name.y加以区分之。
left_join(elvis_songs, by = c("name" = "movie"))当要选取两个表中名称不同的列为新表主键时,可以这样用。
library(purrr)
redue()可以把许多table表连接起来。
reduce(left_join, by = c("first", "last"))
distinct()消除重复项
count()计数
Intro to SQL for Data Science简介SQL
SELECT COUNT(birthdate)
FROM people;计数
where中,等号=,不等号<>
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');连用OR与AND时
BETWEEN is inclusive,即首末都是包含在内的
SELECT (4 / 3);这样没有指定的话会默认为整数,所以所得的结果是1
SELECT (4.0 / 3.0) AS result;这样设定之后结果才会使1.333
SELECT MAX(budget) AS max_budget,更改查询后列的名字
MAX(duration) AS max_duration
FROM films;用同一个功能选择,命名的话会比较清晰易于区分之。
ORDER BY name DESC倒序
ORDER BY birthdate, name先按birthdate排序,然后按name排序
ORDER BY分组
SQL中,分类时不可以使用where了 这时就可以用HAVING 代替之