①Ctrl+Shift+c 注释当前行,选中后再使用会注释选中区域,两次则会取消单行注释
②Alt± 赋值号
③Ctrl + L 刷新控制台(清空屏幕)
④仅支持单行注释#
1.显示当前工作路径
> getwd() [1] "C:/Users/sxl/Documents"
2.改变当前工作路径
> setwd(dir = "c:/Users/sxl/Desktop/R")
3.查看目录下包含的文件(两者功能一样)
> list.files()
[1] "desktop.ini" "jongde" "KingsoftData" "MATLAB" "My Music"
[6] "My Pictures" "My Videos" "OneNote 笔记本" "Tencent Files" "WeChat Files"
[11] "手机模拟大师" "自定义 Office 模板"
> dir()
[1] "desktop.ini" "jongde" "KingsoftData" "MATLAB" "My Music"
[6] "My Pictures" "My Videos" "OneNote 笔记本" "Tencent Files" "WeChat Files"
[11] "手机模拟大师" "自定义 Office 模板"
4.对一个变量赋值(也可以用=赋值,但后面会和等于号混合,不推荐使用)
也可以6 -> x,不推荐使用
> x <- 3
> x
[1] 3
强制赋值给局部变量
> x <<- 5
> x
[1] 5
5.可直接利用函数求值,也可以赋值
> sum (1,2,3,4,5)
[1] 15
> y <- sum (1,2,3,4,5)
> y
[1] 15
6.求算术平均值
> z <- mean(1,2,3,4,5)
> z
[1] 1
7.列出当前变量及变量信息
若只有str(x),则是只列出此变量信息
> ls()
[1] "x" "y" "z"
> ls.str()
x : num 3
y : num 15
z : num 1
ls不能列出以.开头的文件(隐藏文件)
解决:
> ls(all.names = TRUE)
[1] ".Random.seed" "x" "y" "z"
8.删除变量或函数
> rm (x) #删除x变量
> x
错误: 找不到对象'x'
> rm (y,z) #删除y和z
> rm (list = ls()) #删除所有变量
9.列出历史命令
> history(25) #最近使用的25条命令
10.保存当前工作空间(不会保存图片)
> save.image()
11.退出:q() 或者菜单栏
12.在线安装包(注意加引号)也可以用源码安装,相关联的包也要一起安装
> install.packages("vcd")
13.显示库所在位置
> .libPaths()
[1] "D:/R/R-4.0.2/library"
> library() #显示库里面的包
14.查看包的详细信息
> help(package = "vcd")
15.列出包的基础内容(eg.数据集)
> library(help = "vcd")
16.列出包中所有函数
> ls(package:“vcd")
17.包中寻找数据集
> data (package:"vcd")
18.require() : require(package)将加载名为package的命名空间,并添加到包的搜索列表中,与library(package)一致。加载前对搜索列表进行检查并更新,如果package不存在(不可用),则返回FALSE而不报错,如果存在则返回TRUE。
> require (vcd) #最开始执行,不然以上命令会出现错误
19.删除包
> detach("package:vcd")
20.将R包彻底从硬盘上删除
> remove.package("vcd")
21.R包的批量移植
> installed.package() #列出当前的所有包
22.帮助文档
> help.start()
如果什么都不发生的话,你应该自己打开‘http://127.0.0.1:12755/doc/html/index.html’
Making 'packages.html' ... done
> help(sum)
> ?plot #也可以查看函数信息
> args(plot) #可以在终端显示函数参数信息
> example(mean) #查询普通函数
> example("hist") #查询绘图函数
> help(package = ggplot2) #查看包
> vignette() #查看包教程、简介等内容
23.搜索热图
热点图是通过使用不同的标志将图或页面上的区域按照受关注程度的不同加以标注并呈现的一种分析手段,标注的手段一般采用颜色的深浅、点的疏密以及呈现比重的形式,不管使用哪种方式最终得到的效果是一样的,那就是,眼前豁然开朗。
> help.search("heatmap")
> ??heatmap
24.列出包含字符的文件
> apropos("sum")
[1] ".colSums" ".rowSums" ".rs.callSummary" ".rs.summarizeDir" ".rs.tutorial.onResume"
[6] ".tryResumeInterrupt" "colSums" "contr.sum" "cumsum" "format.summaryDefault"
[11] "marginSums" "print.summary.table" "print.summary.warnings" "print.summaryDefault" "rowsum"
[16] "rowsum.data.frame" "rowsum.default" "rowSums" "sum" "summary"
[21] "Summary" "summary.aov" "summary.connection" "summary.data.frame" "Summary.data.frame"
[26] "summary.Date" "Summary.Date" "summary.default" "Summary.difftime" "summary.factor"
[31] "Summary.factor" "summary.glm" "summary.lm" "summary.manova" "summary.matrix"
[36] "Summary.numeric_version" "Summary.ordered" "summary.POSIXct" "Summary.POSIXct" "summary.POSIXlt"
[41] "Summary.POSIXlt" "summary.proc_time" "summary.srcfile" "summary.srcref" "summary.stepfun"
[46] "summary.table" "summary.warnings" "summaryRprof"
> apropos ("sum",mod = "function") #只搜索函数
25.利用关键字在线在网站上查找资料
> RSiteSearch("matlab")
26.查看包中可用的数据集
> data(package = "car")
> data(package = .packages(all.available = TRUE))
> data(Chile,package = "car")
> Chile
27.向量:类似于数学上的集合,由一个或多个元素构成(用c表示)
用于存储数值型、字符型或逻辑型数据的一维数组
> x <- c(1,2,3,4,5)
> x
[1] 1 2 3 4 5
> print(x) #和直接输入x等价
[1] 1 2 3 4 5
> y <- c("one","two","three") #注意加引号
> z <- c(TRUE,FALSE,T,F) #不能首字母大写,会报错
> c(1:100) #等差数列
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
> seq(from = 1, to = 100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
> seq(from = 1, to = 100, by = 2)
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87
[45] 89 91 93 95 97 99
> seq(from = 1, to = 100, length.out = 10) #输出10个值
[1] 1 12 23 34 45 56 67 78 89 100
> rep(2,5) #重复5次2
[1] 2 2 2 2 2
> rep(x,5) #将x向量重复5次
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
> rep(x,each = 5) #5个相同数重复
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
> rep(x,each = 5,times = 2) #重复两次
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
> a <- c(1,2,"one")
> a
[1] "1" "2" "one"
> mode(a) #查看数据类型,把数字转换为字符型
[1] "character"
> rep(x,c(2,4,6,1,3)) #看重复几次
[1] 1 1 2 2 2 2 3 3 3 3 3 3 4 5 5 5
28.向量索引
> length(x) #数长度
[1] 100
> x[1] #求值(注意下标从1开始)
[1] 1
> x[0] #若输入0,则显示以下
integer(0)
> x[-19] #利用负整数,除了这个数以外的数都输出
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
[67] 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
> x[c(4:18)] #输出4—18的数
[1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> x[c(1,23,45,67,89)] #输出对应位置的数
[1] 1 23 45 67 89
> x[c(11,11,23,23,5,90,2)] #可多次访问同一个元素
[1] 11 11 23 23 5 90 2
> y[y > 5] #找大于5的数
[1] 6 7 8 9 10
> y[y > 5 & y < 9] #查找范围的数
[1] 6 7 8
29.字符查找
> z <- c("one","two","three","fore","five")
> z
[1] "one" "two" "three" "fore" "five"
> "one" %in% z #表示字符是否在向量中
[1] TRUE
> z[z %in% c("one","two")] #相当于z[T,T,F,F,F] 还是逻辑索引,所以输出前两个值
[1] "one" "two"
> z %in% c("one","two")
[1] TRUE TRUE FALSE FALSE FALSE
> "one" %in% z
[1] TRUE
> z["one" %in% z] #字符在向量中,并且输出z(自己的理解,没找到答案)
[1] "one" "two" "three" "fore" "five"
> k <- z %in% c("one","two") #赋给一个值
> z(k)
Error in z(k) : 没有"z"这个函数
> z[k]
[1] "one" "two"
30.使用names函数为每个向量添加名称
> y <- 1:10
> names(y) <- c("one","two","three","four","five","six","seven","eight","nine","ten")
> y
one two three four five six seven eight nine ten
1 2 3 4 5 6 7 8 9 10
> names(y)
[1] "one" "two" "three" "four" "five" "six" "seven" "eight" "nine" "ten"
> euro
ATS BEF DEM ESP FIM FRF IEP ITL LUF NLG PTE
13.760300 40.339900 1.955830 166.386000 5.945730 6.559570 0.787564 1936.270000 40.339900 2.203710 200.482000
> euro("ATS")
Error in euro("ATS") : 没有"euro"这个函数
> y["one"]
one
1
> euro["ATS"]
ATS
13.7603
31.修改向量
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
> x[101] <- 101 #添加向量
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100 101
> v <- 1:3 #批量赋值
> v[c(4,5,6)] <- c(4,5,6)
> v
[1] 1 2 3 4 5 6
> v[20] <- 4 #把4赋给v[20],数个数扩展到了20个,没有赋值的为NA
> v
[1] 1 2 3 4 5 6 NA NA NA NA NA NA NA NA NA NA NA NA NA 4
> append(x = v,values = 99, after = 5) #在第5个元素后插入值99
[1] 1 2 3 4 5 99 6 NA NA NA NA NA NA NA NA NA NA NA NA NA 4
> append(x = v,values = 99, after = 0) #after = 0,则在头部插入数据
[1] 99 1 2 3 4 5 6 NA NA NA NA NA NA NA NA NA NA NA NA NA 4
> rm(v) #删除整个向量
> v
错误: 找不到对象'v'
> y[-c(1:3)] #删除某个元素
four five six seven eight nine ten
4 5 6 7 8 9 10
> y <- y[-c(1:3)]
> y
four five six seven eight nine ten
4 5 6 7 8 9 10
> y["four"] <- 100 #修改变量的值
> y
four five six seven eight nine ten
100 5 6 7 8 9 10
32.向量运算
幂运算:两个**
求余运算:两个%%
整除运算:%/%
包含运算符:%in%
判断两个向量是否相等:== (若为=,则是赋值,会改变向量的值)
向量函数:
abs(x):返回绝对值
sqrt(x):返回平方根
log(16,base = 2):求对数,以2为底,求16的对数
log(16):默认以2为底
log10 (10):以10为底的对数
exp(x):计算向量中每个元素的指数
> ceiling(c(-2,3,3,3.1415)) #返回不小于x的整数
[1] -2 3 3 4
> floor(c(-2,3,3,3.1415)) #返回不大于x的最大整数
[1] -2 3 3 3
> trunc(c(-2,3,3.1415)) #返回整数
[1] -2 3 3
> round(c(-2,3,3.1415)) #四舍五入
[1] -2 3 3
> round(c(-2,3,3.1415),digits = 2) #四舍五入,保留位数
[1] -2.00 3.00 3.14
> signif(c(-2,3,3.1415),digits = 2) #保留有效数字
[1] -2.0 3.0 3.1
> sin(x) #三角函数
[1] 0.8414710 -0.5365729 -0.8462204 0.5290827 0.8509035 -0.5215510 -0.8555200 0.5139785 0.8600694 -0.5063656
> cos(x)
[1] 0.5403023 0.8438540 -0.5328330 -0.8485703 0.5253220 0.8532201 -0.5177698 -0.8578031 0.5101770 0.8623189
统计函数
> vec <- 1:100 #数值函数
> vec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
> sum(vec) #计算总和
[1] 5050
> max(vec) #返回最大值
[1] 100
> min(vec) #返回最小值
[1] 1
> range(vec) #返回最大、最小值
[1] 1 100
> mean(vec) #返回均值
[1] 50.5
> var(vec) #返回返回方差
[1] 841.6667
> round(var(vec),digits = 2)
[1] 841.67
> round(sd(vec),digits = 2) #返回标准差
[1] 29.01
> prod(vec) #返回连乘积
[1] 9.332622e+157
> median(vec) #返回中位数
[1] 50.5
> quantile(vec) #返回分位数
0% 25% 50% 75% 100%
1.00 25.75 50.50 75.25 100.00
#找索引值
> t <- c(1,4,2,5,7,9,6)
> t
[1] 1 4 2 5 7 9 6
> which.max(t)
[1] 6
> which(t ==7)
[1] 5
> which(t > 5)
[1] 5 6 7
> t[which(t > 5)] #返回的元素值
[1] 7 9 6
33.矩阵与数组
(1)构建矩阵以及按顺序排列
> x <- 1:20
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> m <- matrix(x,mrow = 4,mcol = 5)
Error in matrix(x, mrow = 4, mcol = 5) : 参数没有用(mrow = 4, mcol = 5)
> m <- matrix(x,nrow = 4,ncol = 5) #构建矩阵
> m <- matrix(1:20,4,5) #按列排
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> m <- matrix(x,nrow = 4,ncol = 6) #个数要符合矩阵的个数
Warning message:
In matrix(x, nrow = 4, ncol = 6) : 数据长度[20]不是矩阵列数[6]的整倍数
> m <- matrix(x,nrow = 4,ncol = 5,byrow = T) #按行排
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
> m <- matrix(x,nrow = 4,ncol = 5,byrow = F) #按列排
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
(2)行列标签
> rnames <- c("R1","R2","R3","R4")
> rnames
[1] "R1" "R2" "R3" "R4"
> cnames <- c("C1","C2","C3","C4","C5")
> cnames
[1] "C1" "C2" "C3" "C4" "C5"
> dimnames(m) <- list(rnames,cnames)
> m
C1 C2 C3 C4 C5
R1 1 5 9 13 17
R2 2 6 10 14 18
R3 3 7 11 15 19
R4 4 8 12 16 20
(3)返回矩阵维数
> dim(x)
NULL
> dim(x) <- c(4,5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
(4)R中的数组就是多维矩阵,创建数组
> x <- 1:20
> dim(x) <- c(2,2,5) #可看成是一个长宽高的空间 数据
> x
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
, , 3
[,1] [,2]
[1,] 9 11
[2,] 10 12
, , 4
[,1] [,2]
[1,] 13 15
[2,] 14 16
, , 5
[,1] [,2]
[1,] 17 19
[2,] 18 20
(5)利用array创建数组并带标签
> ?array
> dim1 <- c("A1","A2")
> DIM2 <- C("B1","B2","B3")
Error in C("B1", "B2", "B3") : 不能把对象解释成因子
> dim2 <- c("B1","B2","B3")
> dim3 <- c("C1","C2","C3","C4")
> z <- array(1:24,c(2,3,4,dimnames = list(dim1,dim2,dim3))
+ z
错误: unexpected symbol in:
"z <- array(1:24,c(2,3,4,dimnames = list(dim1,dim2,dim3))
z"
> z <- array(1:24,c(2,3,4),dimnames = list(dim1,dim2,dim3,)
+
+ z
错误: unexpected symbol in:
"
z"
> z <- array(1:24,c(2,3,4),dimnames = list(dim1,dim2,dim3))
> z
, , C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
, , C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
(6)访问矩阵值
> m <- matrix(1:20,4,5,byrow = T)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
> m[1,2] #第一行第二列
[1] 2
> m[1,c(2,3,4)] #第一行第2,3,4列
[1] 2 3 4
> m[c(2:4),c(2,3)]
[,1] [,2]
[1,] 7 8
[2,] 12 13
[3,] 17 18
> m[2,] #第二行
[1] 6 7 8 9 10
> m[,2] #第二列
[1] 2 7 12 17
> m[2] #按列数,第二个值
[1] 6
> m[-1,2] #去除第一行的第二列
[1] 7 12 17
#通过访问名字来访问元素值
> dimnames(m) = list (rnames,cnames)
> m
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
R3 11 12 13 14 15
R4 16 17 18 19 20
> m["R1","R2"]
Error in m["R1", "R2"] : 下标出界
> m["R1","C2"]
[1] 2
(7)矩阵运算
> m + 1 #所有元素加1
C1 C2 C3 C4 C5
R1 2 3 4 5 6
R2 7 8 9 10 11
R3 12 13 14 15 16
R4 17 18 19 20 21
> n <- matrix(1:20,5,4)
> n
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
> m\
错误: unexpected input in "m\"
> m
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
R3 11 12 13 14 15
R4 16 17 18 19 20
> m + n #矩阵相加要行列相等
Error in m + n : 非整合陈列
> m[,1]
R1 R2 R3 R4
1 6 11 16
> t <- m[,1]
> sum(t) #求第一列的和
[1] 34
> colSums(m) #计算每一列的和
C1 C2 C3 C4 C5
34 38 42 46 50
> rowSum(m)
Error in rowSum(m) : 没有"rowSum"这个函数
> rowSums(m) #计算每一行的和
R1 R2 R3 R4
15 40 65 90
#计算平均值
> colMeans(m)
C1 C2 C3 C4 C5
8.5 9.5 10.5 11.5 12.5
> rowMeans(m)
R1 R2 R3 R4
3 8 13 18
> n*t #两矩阵内积
[,1] [,2] [,3]
[1,] 2 20 56
[2,] 6 30 72
[3,] 12 42 90
> n %*% t #两矩阵外积
[,1] [,2] [,3]
[1,] 42 78 114
[2,] 51 96 141
[3,] 60 114 168
> diag(n) #返回对角矩阵的值
[1] 1 5 9
> t(n) #函数t对矩阵进行转置
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
34.列表
同:模式和向量类似,都是一维数据集合
异:向量只能存储一种数据类型,列表中的对象可以是R中的任何数据结构,甚至列表本身
(1)建立列表
> a <- 1:20
> b <- matrix(1:24,4,6)
> c = mtcars
> d <- "This is a test list"
> > mlist <- list(a,b,c,d)
> mlist
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 5 9 13 17 21
[2,] 2 6 10 14 18 22
[3,] 3 7 11 15 19 23
[4,] 4 8 12 16 20 24
[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
[[4]]
[1] "This is a test list"
(2)列表命名
> mlist <- list(first = a,second = b,third = c,forth = d)
> mlist
(3)访问列表
> mlist[1] #访问单个向量
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> mlist[c(1,4)] #访问多个向量
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$forth
[1] "This is a test list"
> state.center[c("x","y")] #通过名字访问列表
$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 -74.9841 -81.6850 -83.3736 -126.2500 -113.9300 -89.3776
[14] -86.0808 -93.3714 -98.1156 -84.7674 -92.2724 -68.9801 -76.6459 -71.5800 -84.6870 -94.6043 -89.8065 -92.5137 -109.3200
[27] -99.5898 -116.8510 -71.3924 -74.2336 -105.9420 -75.1449 -78.4686 -100.0990 -82.5963 -97.1239 -120.0680 -77.4500 -71.1244
[40] -80.5056 -99.7238 -86.4560 -98.7857 -111.3300 -72.5450 -78.2005 -119.7460 -80.6665 -89.9941 -107.2560
$y
[1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777 27.8744 32.3329 31.7500 43.5648 40.0495 40.0495 41.9358 38.4204
[17] 37.3915 30.6181 45.6226 39.2778 42.3645 43.1361 46.3943 32.6758 38.3347 46.8230 41.3356 39.1063 43.3934 39.9637 34.4764 43.1361
[33] 35.4195 47.2517 40.2210 35.5053 43.9078 40.9069 41.5928 33.6190 44.3365 35.6767 31.3897 39.1063 44.2508 37.5630 47.4231 38.4204
[49] 44.5937 43.0504
> mlist$first #利用mlist$会出现向量列表,需要哪个向量选择即可
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(4)
> mlist[1]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> mlist[[1]] #输出的是数据本身类型,[[]]主要用于获取列表(list)中的元素,是向量的子集
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> class (mlist[1])
[1] "list"
> class (mlist[[1]])
[1] "integer"
(5)给列表赋值
> mlist[[5]] <- iris
> mlist
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$second
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 5 9 13 17 21
[2,] 2 6 10 14 18 22
[3,] 3 7 11 15 19 23
[4,] 4 8 12 16 20 24
#后面还有内容,后面添加[[5]]的值
(6)删除列表
> mlist[-5]
> mlist[[5]] <- NULL
35.数据框:表格式的数据结构,旨在模拟数据集
数据集:由数据构成的一个矩形数组,行表示观测,列表示变量
数据框每一列必须同一类型,每一行可以不同
(1)合并成一个数据框(state.name以及后面的都是已经存在的,且序列相同)
> state <- data.frame(state.name,state.abb,state.region,state.x77)
> state
state.name state.abb state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama Alabama AL South 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska Alaska AK West 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona Arizona AZ West 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas Arkansas AR South 2110 3378 1.9 70.66 10.1 39.9 65 51945
California California CA West 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado Colorado CO West 2541 4884 0.7 72.06 6.8 63.9 166 103766
Connecticut Connecticut CT Northeast 3100 5348 1.1 72.48 3.1 56.0 139 4862
Delaware Delaware DE South 579 4809 0.9 70.06 6.2 54.6 103 1982
F
(2)访问数据框
> state[1] #输出数据框的第一列
state.name
Alabama Alabama
Alaska Alaska
Arizona Arizona
A
> state[c(2,4)] #输出数据框的第2、4列
state.abb Population
Alabama AL 3615
Alaska AK 365
Arizona AZ 2212
> state[-c(2,4)] #去掉这部分内容
> state[,"state.abb"] #访问对应列
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT"
[27] "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
> state["Alabama",] #该行详细信息
state.name state.abb state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama Alabama AL South 3615 3624 2.1 69.05 15.1 41.3 20 50708
> state$state.region #利用$
[1] South West West South West West Northeast South South
[10] South West West North Central North Central North Central North Central South South
[19] Northeast South Northeast North Central North Central South North Central West North Central
[28] West Northeast Northeast West Northeast South North Central North Central South
[37] West Northeast Northeast South North Central South South West Northeast
[46] South West South North Central West
Levels: Northeast South North Central West
> attach(mtcars) #利用attach访问对象
> mpg #直接输入列名
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3
[27] 26.0 30.4 15.8 19.7 15.0 21.4
> hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
> rownames(mtcars)
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona" "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
> colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
> cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
> detach(mtcars) #取消加载
> hp
错误: 找不到对象'hp'
> with(mtcars,{mpg}) #用with函数也可以直接访问,直接在大括号内输入列名
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3
[27] 26.0 30.4 15.8 19.7 15.0 21.4
36.变量分类:名义型变量、有序型变量、连续型变量
因子:名义型变量和有序型变量称为因子
因子的应用:计算频数、独立性检验、相关性检验、方差分析、主成分分析、因子分析等
(1)
> mtcars$cyl #cyl这一列作为因子
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> table(mtcars$cyl) #频数统计 4 6 8 三种类型
4 6 8
11 7 14
(2)定义因子
> f <- factor(c("red","red","green","red","blue","green","blue","blue"))
> f
[1] red red green red blue green blue blue
Levels: blue green red
(3)有序序列作为因子
> week <- factor(c("Mon","Fri","Thu","Wed","Mon","Fri","Sun"))
> week
[1] Mon Fri Thu Wed Mon Fri Sun
Levels: Fri Mon Sun Thu Wed
(4)指定level
> week <- factor(c("Mon","Fri","Thu","Wed","Mon","Fri","Sun"),order = TRUE, level = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
> week
[1] Mon Fri Thu Wed Mon Fri Sun
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
(5)向量转换为因子
> fcyl = factor(mtcars$cyl)
> fcyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8
(6)作图结果
> plot(mtcars$cyl) #向量作出来是点状图
> plot(factor(mtcars$cyl)) #因子作出来是条形图
(7)分组
> num <- 1:100
> num
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
> cut (num,c(seq(0,100,10))) #10个为1组,有规律分组可用cut函数
[1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (10,20] (10,20] (10,20] (10,20]
[15] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30]
[29] (20,30] (20,30] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (40,50] (40,50]
[43] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (50,60] (50,60] (50,60] (50,60] (50,60] (50,60]
[57] (50,60] (50,60] (50,60] (50,60] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70]
[71] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (80,90] (80,90] (80,90] (80,90]
[85] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100]
[99] (90,100] (90,100]
Levels: (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
37.有缺省数据时
不同缺失值差别:
NA:存在的值,但是不知道是多少
NaN:不存在
inf:存在,是无穷大或者无穷小,表示不可能的值
> NA == 0
[1] NA
> a <- c(NA,1:49)
> sum(a) #函数结果会显示NA
[1] NA
> mean(a)
[1] NA
> sum(a,na,rm = TRUE) #加上na.rm = TRUE则可以正确显示数值
错误: 找不到对象'na'
> sum(a,na.rm = TRUE)
[1] 1225
> mean(a,na.rm = TRUE)
[1] 25
> a
[1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[45] 44 45 46 47 48 49
有缺省数据时,可以使用na.omit()函数删除NA
> c <- c(NA,1:20,NA,NA)
> c
[1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 NA NA
> d <- na.omit(c)
> d
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
attr(,"na.action")
[1] 1 22 23
attr(,"class")
[1] "omit"
> is.na(d)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> sum(d)
[1] 210
> mean(d)
[1] 10.5
> na.omit(sleep) #可以删除sleep数据集里面的NA行,但是可能会对实验数据造成影响
38.字符串
(1)计算字符串长度
> nchar("Hello World")
[1] 11
> month.name
[1] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" "November"
[12] "December"
> nchar(month.name) #返回每个字符串的字符数
[1] 7 8 5 5 3 4 4 6 9 7 8 8
> length(month.name) #返回字符串个数
[1] 12
> nchar(c(12,3,345))
[1] 2 1 3
(2)paste函数:粘贴字符串,将多个字符串合并成一个
> paste(c("Everybody","loves","states"))
[1] "Everybody" "loves" "states"
> paste("Everybody","loves","states")
[1] "Everybody loves states"
> paste("Everybody","loves","states",sep = "-") #设置分隔符
[1] "Everybody-loves-states"
(3)分别连接字符串
> names <- c("Moe","Larry","Curly")
> paste(names,"love stats")
[1] "Moe love stats" "Larry love stats" "Curly love stats"
(4)截取字符串,大小写转换
> substr(month.name,1,3) #截取字符到第三个字符
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> temp <- substr(x - mi=onth.name,start = 1,stop = 3)
错误: 意外的'=' in "temp <- substr(x - mi="
> temp <- substr(x = month.name,start = 1,stop = 3)
> toupper(temp) #转换为大写
[1] "JAN" "FEB" "MAR" "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT" "NOV" "DEC"
> tolower(temp) #转换为小写
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep" "oct" "nov" "dec"
#恢复首字母大写
> gsub("^(\\w)","\\U\\1",tolower(temp)) #gsub是全局改变,sub是单个改变
[1] "Ujan" "Ufeb" "Umar" "Uapr" "Umay" "Ujun" "Ujul" "Uaug" "Usep" "Uoct" "Unov" "Udec"
#^:首字母 \\w:字符集的简写,代表所有小写字符
# \\U:所有转化为大写 1:表示只转换一次
> gsub("^(\\w)","\\U\\1",tolower(temp),perl = TURE) #利用正则表达式
Error in gsub("^(\\w)", "\\U\\1", tolower(temp), perl = TURE) :
找不到对象'TURE'
> gsub("^(\\w)","\\U\\1",tolower(temp),perl = TRUE)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
#转换为首字母小写
> gsub("^(\\w)","\\L\\1",toupper(temp),perl = TRUE)
[1] "jAN" "fEB" "mAR" "aPR" "mAY" "jUN" "jUL" "aUG" "sEP" "oCT" "nOV" "dEC"
(5)查找字符串
> x <- c("b","A+","AC")
> x
[1] "b" "A+" "AC"
> grep("A+",x,fixed = T) #匹配到第二个字符串
[1] 2
> grep("A+",x,fixed = F) #+表示到无穷个字符串,AC也满足
[1] 2 3
> match("AC",x) #也用于查找字符串,但不支持正则表达式
[1] 3
(6)分割字符串
> path <- "/usr/local/bin/R"
> strsplit(path,"/") #path:字符串 /:分隔符 返回对是列表,不是向量
[[1]]
[1] "" "usr" "local" "bin" "R"
> strsplit(c(path,path),"/") #一次分割两个路径
[[1]]
[1] "" "usr" "local" "bin" "R"
(7)生成字符串所有组合,即笛卡尔积
> face <- 1:13
> suit <- c("spades","clubs","hearts","diamonds")
> face
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
> suit
[1] "spades" "clubs" "hearts" "diamonds"
> outer(suit,face,FUN = paste)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "spades 1" "spades 2" "spades 3" "spades 4" "spades 5" "spades 6" "spades 7" "spades 8" "spades 9" "spades 10"
[2,] "clubs 1" "clubs 2" "clubs 3" "clubs 4" "clubs 5" "clubs 6" "clubs 7" "clubs 8" "clubs 9" "clubs 10"
[3,] "hearts 1" "hearts 2" "hearts 3" "hearts 4" "hearts 5" "hearts 6" "hearts 7" "hearts 8" "hearts 9" "hearts 10"
[4,] "diamonds 1" "diamonds 2" "diamonds 3" "diamonds 4" "diamonds 5" "diamonds 6" "diamonds 7" "diamonds 8" "diamonds 9" "diamonds 10"
[,11] [,12] [,13]
[1,] "spades 11" "spades 12" "spades 13"
[2,] "clubs 11" "clubs 12" "clubs 13"
[3,] "hearts 11" "hearts 12" "hearts 13"
[4,] "diamonds 11" "diamonds 12" "diamonds 13"
> outer(suit,face,FUN = paste,sep = "-")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "spades-1" "spades-2" "spades-3" "spades-4" "spades-5" "spades-6" "spades-7" "spades-8" "spades-9" "spades-10"
[2,] "clubs-1" "clubs-2" "clubs-3" "clubs-4" "clubs-5" "clubs-6" "clubs-7" "clubs-8" "clubs-9" "clubs-10"
[3,] "hearts-1" "hearts-2" "hearts-3" "hearts-4" "hearts-5" "hearts-6" "hearts-7" "hearts-8" "hearts-9" "hearts-10"
[4,] "diamonds-1" "diamonds-2" "diamonds-3" "diamonds-4" "diamonds-5" "diamonds-6" "diamonds-7" "diamonds-8" "diamonds-9" "diamonds-10"
[,11] [,12] [,13]
[1,] "spades-11" "spades-12" "spades-13"
[2,] "clubs-11" "clubs-12" "clubs-13"
[3,] "hearts-11" "hearts-12" "hearts-13"
[4,] "diamonds-11" "diamonds-12" "diamonds-13"
39.时间与日期
> Sys.Date() #显示当前时间
[1] "2020-11-01"
> class(Sys.date())
Error in Sys.date() : 没有"Sys.date"这个函数
> class(Sys.Date())
[1] "Date"
> a = "2017-01-01"
> as.Date(a)
[1] "2017-01-01"
> as.Date(a,foemat = "%Y-%m-%d") #表示年月日
[1] "2017-01-01"
> class(as.Date(a,foemat = "%Y-%m-%d")) #为date类了
[1] "Date"
> seq(as.Date("2017-01-01"),as.Date("2017-07-05"),by = 5) #创建连续的时间点,间隔为5
[1] "2017-01-01" "2017-01-06" "2017-01-11" "2017-01-16" "2017-01-21" "2017-01-26" "2017-01-31" "2017-02-05" "2017-02-10" "2017-02-15"
[11] "2017-02-20" "2017-02-25" "2017-03-02" "2017-03-07" "2017-03-12" "2017-03-17" "2017-03-22" "2017-03-27" "2017-04-01" "2017-04-06"
[21] "2017-04-11" "2017-04-16" "2017-04-21" "2017-04-26" "2017-05-01" "2017-05-06" "2017-05-11" "2017-05-16" "2017-05-21" "2017-05-26"
[31] "2017-05-31" "2017-06-05" "2017-06-10" "2017-06-15" "2017-06-20" "2017-06-25" "2017-06-30" "2017-07-05"
> sales <- round(runif(48,min = 50,max = 100)) #生成50——100的随机数,round函数用来取整数
> sales
[1] 97 76 57 76 82 64 92 71 64 96 70 57 97 58 56 67 95 60 85 70 84 55 95 66 95 63 78 54 81 60 65 80 82 55 92 83 75 58 86 55 56 73 87 94
[45] 86 79 66 73
#frequency = 1:代表年 4:季度 12:月份
> ts(sales,start = c(2010,5),end = c(2014,4),frequency = 1) #将向量转换为时间序列,两个向量要用c
Time Series:
Start = 2014
End = 2017
Frequency = 1
[1] 97 76 57 76
> ts(sales,start = c(2010,5),end = c(2014,4),frequency = 12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2010 97 76 57 76 82 64 92 71
2011 64 96 70 57 97 58 56 67 95 60 85 70
2012 84 55 95 66 95 63 78 54 81 60 65 80
2013 82 55 92 83 75 58 86 55 56 73 87 94
2014 86 79 66 73
> ts(sales,start = c(2010,5),end = c(2014,4),frequency = 4)
Qtr1 Qtr2 Qtr3 Qtr4
2011 97 76 57 76
2012 82 64 92 71
2013 64 96 70 57
2014 97 58 56 67
40.R获取数据三种途径:
(1)利用键盘输入数据
① 手动每个变量赋值
#定义5个变量
> patientID <- c(1,2,3,4)
> admdate <- c("10/15/2009","11/01/2009","10/21/2009","10/28/2009")
> age <- c(25,34,28,52)
> diabetes <- c("Type1","Type2","Type1","Type1")
> status <- c("Poor","Improved","Excellent","Poor")
#5个变量用数据框表示出来
> data <- data.frame(patientID,admidate,age,diabetes,status)
Error in data.frame(patientID, admidate, age, diabetes, status) :
找不到对象'admidate'
> data <- data.frame(patientID,admdate,age,diabetes,status)
> data #显示数据
patientID admdate age diabetes status
1 1 10/15/2009 25 Type1 Poor
2 2 11/01/2009 34 Type2 Improved
3 3 10/21/2009 28 Type1 Excellent
4 4 10/28/2009 52 Type1 Poor
② 定义变量后打开文本编辑框,输入数据,可复制
> data2 <- data.frame(patientID = character(0),admdate = character(0),age = numeric(0),diabetes = character(),status = character())
> date2
错误: 找不到对象'date2'
> data2
[1] patientID admdate age diabetes status
<0 行> (或0-长度的row.names)
> data2 <- edit(data2) #打开文本编辑框
> data2
patientID admdate age diabetes status
1 1 10/15/2009 NA <NA> <NA>
2 2 <NA> NA <NA> <NA>
3 3 <NA> NA <NA> <NA>
4 4 <NA> NA <NA> <NA>
> fix(data2) #直接输入保存数据
> data2
patientID admdate age diabetes status
1 1 10/15/2009 NA Type1 <NA>
2 2 <NA> NA <NA> <NA>
3 3 <NA> NA <NA> <NA>
4 4 <NA> NA <NA> <NA>
(2)通过读取存储在外部文件上的数据
(3)通过访问数据库系统来获取数据
通过ODBC访问数据库(开放数据库:Open Database Connectivity)
41.读入文件
(1)读取、显示前后行信息,截取行数信息
> x <- read.table("input.txt") #直接读取文件,文件要在当前目录下,若不在当前路径,也可直接访问文件目录,必须是完整路径,否则报错
> x <- read.table("D:/RWork/RData/input.txt") # 文件必须解压,否则也要报错
> head(x) #默认显示前6行
> tail(x) #默认显示后6行
> head(x,n = 10) #显示前10行
> x <- read.table("D:/RWork/RData/input.csv",sep = ",") #若不使用分隔符,显示信息会比较混乱
> x
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 mpg cyl disp hp drat wt qsec vs am gear carb
2 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
3 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
4 Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
5 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
6
> x <- read.table("D:/RWork/RData/input.csv",sep = ",",header = T) #若读取头部分有变量名,则header = T,上下两张图的变量名改变比较变化
> x
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
> x <- read.table("D:/RWork/RData/input 1.txt",sep = ",",header = T,skip = 5) #从第6行开始读入数据,前5行为注释
> x
Ozone.Solar.R.Wind.Temp.Month.Day
1 1 41 190 7.4 67 5 1
2 2 36 118 8 72 5 2
3 3 12 149 12.6 74 5 3
4 4 18 313 11.5 62 5 4
5 5 NA NA 14.3 56 5 5
6 6 28 NA 14.9 66 5 6
> x <- read.table("D:/RWork/RData/input 1.txt",sep = ",",header = T,skip = 50,nrow = 200) #读取第6行到250行的数据,nrow是要读取的行数
> x
X45.NA.332.13.8.80.6.14
1 46 NA 322 11.5 79 6 15
2 47 21 191 14.9 77 6 16
3 48 37 284 20.7 72 6 17
4 49 20 37 9.2 65 6 18
> read.fwf("D:/RWork/RData/fwf.txt",width = c(3,3)) #自己设置宽度
V1 V2
1 "st ate
2 "1" "A
3 "2" "A
> x <- read.table("D:/RWork/RData/input.csv",sep = ",",header = T,skip = 50,nrows = 100,stringAsFactors = F) #文件中有空的字符会设为NA
(2)文件不在本机上,读取网络文件,通过协议读取
Ⅰ.读取剪切内容
> x <- read.table("clipboard",header = T, sep = ",") #直接复制剪切板上的内容,就是复制完还没有粘贴
> x
> readClipboard() #直接读取剪切板上的信息
Ⅱ.打开压缩文件
> RSiteSearch("Matlab") #网页查找
> x <- read.table(gzfile("D:/RWork/RData/input.txt.gz"))
> x
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Ⅲ.读取不标准的文件格式
scan函数的用法
> readLines("D:/RWork/RData/input.csv",n = 15) #读取各行,并以字符串的形式返回结果,限制读取15行
[1] "\"\",\"mpg\",\"cyl\",\"disp\",\"hp\",\"drat\",\"wt\",\"qsec\",\"vs\",\"am\",\"gear\",\"carb\""
[2] "\"Mazda RX4\",21,6,160,110,3.9,2.62,16.46,0,1,4,4"
[3] "\"Mazda RX4 Wag\",21,6,160,110,3.9,2.875,17.02,0,1,4,4"
[4] "\"Datsun 710\",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1"
> world.series <- scan ("http://lib.stat.cmu.edu/datasets/wseries",skip=35,nlines = 23,
what = list(year=integer(0),pattern=character(0))) #可以读取字符串,也可以读取数值
> x <- scan("scan.txt",what=list (character(3),numeric(0),numeric(0)))
> x <- scan("scan.txt",what=list (X1=character(3),X2=numeric(0),X3=numeric(0)))
42.写入文件
> write.table(x,file = "C:/Users/Desktop/newfile.csv",sep = ",")
#避免每次写入都要列出一列行号
> write.table(x,file = "C:/Users/Desktop/newfile.csv",sep = ",",row.names = FALSE)
#append:是否追加文件,是则为TRUE,在末尾追加,默认情况下会为字符串添加双引号,若要去掉双引号,则把quote = FALSE
> write.table(x,file=newfile.csv,sep="\t",quote=FALSE,append=FALSE,na="NA")
#把文件写成压缩文件
> write.table(mtcars,gzfile("newfile.txt.gz"))
43.读写excel文件
(1)excel文件另存为csv格式
(2)R中读取
Ⅰ.在excel表中复制内容
Ⅱ.readClipboard() 将剪切板上内容复制下来
> readClipboard()
[1] "\tmpg\tcyl\tdisp\thp\tdrat\twt\tqsec\tvs\tam\tgear\tcarb"
[2] "Mazda RX4\t21\t6\t160\t110\t3.9\t2.62\t16.46\t0\t1\t4\t4"
[3] "Mazda RX4 Wag\t21\t6\t160\t110\t3.9\t2.875\t17.02\t0\t1\t4\t4"
[4] "Datsun 710\t22.8\t4\t108\t93\t3.85\t2.32\t18.61\t1\t1\t4\t1"
[5] "Hornet 4 Drive\t21.4\t6\t258\t110\t3.08\t3.215\t19.44\t1\t0\t3\t1"
Ⅲ.调整格式
> read.table("clipboard",sep = "\t",header = T )
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
(3)两步读取
> ex <- loadWorkbook ("data.xlsx")
> edata <- readWorksheet(ex,1) #ex:变量 1:第一个工作表
> head(edata)
> edata <- readWorksheet(ex,1,startRow=0,starCol=0,endRow=50,endCol=3) #指定表范围
> readWorksheetFromFile ("data.xlsx",1,startRow=0,starCol=0,
endRow=50,endCol=3,header=TRUE) #一步读取
(4)四步写excel
> wb <- loadWorkbook("file.xlsx",create=TRUE)
> createSheet(wb,"Sheet 1")
> writeWorksheet(wb,data=mtcars,sheet = "Sheet 1")
> saveWorkbook()
(5)一步写
> writeWorksheetToFile("file.xlsx",data = mtcars,sheet = "Sheet 1")
> vignette("XLConnect")
44.读取R文件
> load(file = "C:/Users/wangtong/Desktop/RData/Ch02.R") #存储文件到桌面
> save(iris,iris3,file = "iris.Rdata")
> save.image() #保存当前工作路径
45,数据转换
(1)矩阵转换为数据框
> is.data.frame(cars32) #判断是否是数据框
[1] TRUE #是数据框
> is.data.frame(state.x77)
[1] FALSE #不是数据框,是矩阵
> data.x77 <- as.data.frame(state.x77) #强制转换为数据框
> is.data.frame(dstate.x77)
[1] TRUE #转换成功
(2)数据框转换为矩阵
> as.matrix(data.frame(state.region,dstate.x77)) #转换为字符串类型的矩阵
> methods(is) #查看包含的所有函数
(3)向量的转换
Ⅰ.给向量添加维度
> x <- state.abb
> x
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT"
[27] "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
> dim(x) <- c(5,10)
> x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "AL" "CO" "HI" "KS" "MA" "MT" "NM" "OK" "SD" "VA"
[2,] "AK" "CT" "ID" "KY" "MI" "NE" "NY" "OR" "TN" "WA"
[3,] "AZ" "DE" "IL" "LA" "MN" "NV" "NC" "PA" "TX" "WV"
[4,] "AR" "FL" "IN" "ME" "MS" "NH" "ND" "RI" "UT" "WI"
[5,] "CA" "GA" "IA" "MD" "MO" "NJ" "OH" "SC" "VT" "WY"
Ⅱ.向量转换为因子
> x <- state.abb
> as.factor(x)
[1] AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT
[45] VT VA WA WV WI WY
50 Levels: AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI ... WY
Ⅲ.向量变成列表
> as.list(x)
[[1]]
[1] "AL"
[[2]]
[1] "AK"
[[3]]
[1] "AZ"
Ⅳ.组成数据框
> state <- data.frame(x,state.region,state.x77)
> state$Income #访问行
> state["Nevada",] #访问列,注意加,
x state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Nevada NV West 590 5149 0.5 69.03 11.5 65.2 188 109889
> is.data.frame(state["Nevada",]) #判断行是不是一个数据框
[1] TRUE
> y <- is.data.frame(state["Nevada",])
> y
[1] TRUE
> y <- state['Nevada',]
> y
x state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Nevada NV West 590 5149 0.5 69.03 11.5 65.2 188 109889
> unname(y) #不要列名
Nevada NV West 590 5149 0.5 69.03 11.5 65.2 188 109889
> unlist(y) #转换为向量
x state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
"NV" "4" "590" "5149" "0.5" "69.03" "11.5" "65.2" "188" "109889"
46.在数据表中找固定的行列
> who2 <- who[c(1,3,5,8),c(2,14,16,18)] #行,列
> View(who2) #形成表格
#取子集
> who3 <- who[which(who$Continent == 7),]
> View(who3)
> who4 <- who[which(who$CountryID > 50 & who$CountryID <= 100)]
> View(who4)
#与上面实现功能一样
> who4 <- subset(who,who$CountryID > 50 & who$CountryID <= 100)
> View(who4)
47.sample随机抽样
(1)无返回抽样
> x <- 1:100
> sample(x,30) #在x中抽取30个,无返回抽样,即每个值只出现一次
[1] 40 88 8 80 26 21 55 72 95 77 20 19 11 73 25 82 84 59 71 10 69 74 16 1 38 5 48 52 12 27
(2)有放回抽样:即有重复值
> sample(x,60,replace = T)
[1] 37 49 51 56 29 48 48 65 1 31 88 61 100 59 93 30 57 98 34 38 50 83 80 99 29 42 70 100 38 89 7 97 86
[34] 22 9 92 9 87 97 23 79 39 64 5 57 75 98 11 4 31 8 84 31 38 14 73 52 69 3 27
> sort (sample(x,60,replace = T))
[1] 5 8 9 10 13 14 15 15 15 17 18 23 25 26 29 29 30 32 39 41 41 42 44 45 45 48 49 49 54 56 60 60 61 62 63 65 66 67 67 68 68 69 70 73
[45] 73 77 77 81 82 82 82 83 85 90 91 91 94 96 99 99
48.删除固定行
(1)删除对应的列
> mtcars[,-1:-5]
wt qsec vs am gear carb
Mazda RX4 2.620 16.46 0 1 4 4
Mazda RX4 Wag 2.875 17.02 0 1 4 4
Datsun 710 2.320 18.61 1 1 4 1
Hornet 4 Drive 3.215 19.44 1 0 3 1
Hornet Sportabout 3.440 17.02 0 0 3 2
(2)清空对应行
> mtcars$mpg <- NULL
> head(mtcars)
cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4
49.给数据框增加行或列
(1)增加列
> state.division #美国地区 比如像我们的华北、华南……
[1] East South Central Pacific Mountain West South Central Pacific Mountain
[7] New England South Atlantic South Atlantic South Atlantic Pacific Mountain
> USArrests
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
> data.frame(state.division,USArrests) #直接添加一个数据框
state.division Murder Assault UrbanPop Rape
Alabama East South Central 13.2 236 58 21.2
Alaska Pacific 10.0 263 48 44.5
Arizona Mountain 8.1 294 80 31.0
> cbind(USArrests,state.division) #也可以直接使用函数cbind,合并列
Murder Assault UrbanPop Rape state.division
Alabama 13.2 236 58 21.2 East South Central
Alaska 10.0 263 48 44.5 Pacific
Arizona 8.1 294 80 31.0 Mountain
(2)增加行(要具有相同的列才可以合并)
> data1 <- head(USArrests,20) #取前20行
> data2 <- tail(USArrests,20) #取后20行
> rbind(data1,data2) #合并,具有相同的列
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
50.处理重复行数据
> data1 <- head(USArrests,30)
> data2 <- tail(USArrests,30)
> data4 <- rbind(data1,data2)
> data4
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
> rownames(data4) #查看行名
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
[8] "Delaware" "Florida" "Georgia" "Hawaii" "Idaho" "Illinois" "Indiana"
[15] "Iowa" "Kansas" "Kentucky" "Louisiana" "Maine" "Maryland" "Massachusetts"
[22] "Michigan" "Minnesota" "Mississippi" "Missouri" "Montana" "Nebraska" "Nevada"
[29] "New Hampshire" "New Jersey" "Massachusetts1" "Michigan1" "Minnesota1" "Mississippi1" "Missouri1"
[36] "Montana1" "Nebraska1" "Nevada1" "New Hampshire1" "New Jersey1" "New Mexico" "New York"
[43] "North Carolina" "North Dakota" "Ohio" "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
[50] "South Carolina" "South Dakota" "Tennessee" "Texas" "Utah" "Vermont" "Virginia"
[57] "Washington" "West Virginia" "Wisconsin" "Wyoming"
> length(rownames(data4)) #行名个数,说明完全合并,没有去除重复项
[1] 60
> duplicated(data4) #判断数据框中哪些是重复项,重复项返回TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> data4[duplicated(data4),] #取出重复部分
Murder Assault UrbanPop Rape
Massachusetts1 4.4 149 85 16.3
Michigan1 12.1 255 74 35.1
Minnesota1 2.7 72 66 14.9
Mississippi1 16.1 259 44 17.1
Missouri1 9.0 178 70 28.2
Montana1 6.0 109 53 16.4
Nebraska1 4.3 102 62 16.5
Nevada1 12.2 252 81 46.0
New Hampshire1 2.1 57 56 9.5
New Jersey1 7.4 159 89 18.8
> data4[!duplicated(data4),] #取出非重复部分
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
> length(rownames(data4[!duplicated(data4),])) #去除重复行
[1] 50
> unique(data4) #以上步骤可以直接用这条命令,直接去除重复项
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
51.数据框的翻转
(1)整体
> sractm <- t(mtcars) #整体翻转,行列互换
(2)行反向
> letters #letters是向量
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> rev(letters)
[1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
> rownames(women)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
> rev(rownames(women))
[1] "15" "14" "13" "12" "11" "10" "9" "8" "7" "6" "5" "4" "3" "2" "1"
> women[rev(rownames(women)),]
height weight
15 72 164
14 71 159
13 70 154
12 69 150
11 68 146
10 67 142
9 66 139
8 65 135
7 64 132
6 63 129
5 62 126
4 61 123
3 60 120
2 59 117
1 58 115
52.修改数据框值
> women$height
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
> women$height*2.54
[1] 147.32 149.86 152.40 154.94 157.48 160.02 162.56 165.10 167.64 170.18 172.72 175.26 177.80 180.34 182.88
> data.frame(women$height^2.54,women$height)
women.height.2.54 women.height
1 30137.49 58
2 31474.88 59
3 32847.64 60
4 34256.09 61
> transform(women,height = height*2.54) #直接一步到位修改height值
height weight
1 147.32 115
2 149.86 117
3 152.40 120
4 154.94 123
5 157.48 126
> transform(women,cm = height*2.54) #也可以自己再定义一列
height weight cm
1 58 115 147.32
2 59 117 149.86
3 60 120 152.40
4 61 123 154.94
5 62 126 157.48
53.数据框排序(只能用于向量)
> sort(rivers) #返回对应值
[1] 135 202 210 210 215 217 230 230 233 237 246 250 250 250 255 259 260 260 265 268 270 276 280 280 280 281
[27] 286 290 291 300 300 300 301 306 310 310 314 315 320 325 327 329 330 332 336 338 340 350 350 350 350 352
[53] 360 360 360 360 375 377 380 380 383 390 390 392 407 410 411 420 420 424 425 430 431 435 444 445 450 460
[79] 460 465 470 490 500 500 505 524 525 525 529 538 540 545 560 570 600 600 600 605 610 618 620 625 630 652
[105] 671 680 696 710 720 720 730 735 735 760 780 800 840 850 870 890 900 900 906 981 1000 1038 1054 1100 1171 1205
[131] 1243 1270 1306 1450 1459 1770 1885 2315 2348 2533 3710
> order(rivers) #返回对应值所在位置,即索引,可以直接访问数据框
[1] 8 17 39 108 129 52 36 42 91 117 133 34 56 87 76 55 41 75 37 127 138 107 13 30 72 53 29 19 49 61 103 124
[33] 126 46 94 123 116 14 2 3 35 18 11 65 12 81 51 27 60 78 111 54 43 112 119 134 97 105 102 104 96 33 47 4
[65] 28 73 88 48 110 122 106 139 77 92 125 100 6 74 95 9 57 93 84 136 22 5 31 132 135 113 120 99 62 59 10 21
[97] 45 86 118 80 128 64 40 130 140 58 85 50 32 137 44 1 90 79 71 109 24 38 15 26 63 131 16 82 20 121 89 114
[129] 67 115 25 98 83 23 7 141 101 69 66 70 68
> order(mtcars$drat)
[1] 6 22 15 16 12 13 14 4 25 5 23 7 17 31 30 8 21 24 28 3 1 2 9 10 11 18 26 32 20 29 27 19
> mtcars[order(mtcars$drat),] #直接返回数据值,sort不行
cyl disp hp drat wt qsec vs am gear carb
Valiant 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Dodge Challenger 8 318.0 150 2.76 3.520 16.87 0 0 3 2
> order(mtcars$drat,mtcars$disp) #排量小的在前面
[1] 6 22 15 16 12 13 14 4 25 23 5 7 17 31 30 8 21 24 28 3 1 2 9 10 11 18 26 32 20 29 27 19
54.对数据框的数学计算
(1)复杂求计算
> WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
> worldphones <- as.data.frame(WorldPhones) #矩阵转换成数据框
> rs <- rowSums(worldphones)
> rs
1951 1956 1957 1958 1959 1960 1961
74494 102199 110001 118399 124801 133709 141700
> cm <- colMeans(worldphones)
> cm
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
66747.5714 34343.4286 6229.2857 2772.2857 2625.0000 1484.0000 841.7143
> total <- cbind(worldphones,Total = rs)
> rbind (total,cm)
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer Total
1951 45939.00 21574.00 2876.000 1815.000 1646 89 555.0000 74494.00
1956 60423.00 29990.00 4708.000 2568.000 2366 1411 733.0000 102199.00
1957 64721.00 32510.00 5230.000 2695.000 2526 1546 773.0000 110001.00
1
> rbind (total,cm)
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer Total
1951 45939.00 21574.00 2876.000 1815.000 1646 89 555.0000 74494.00
1956 60423.00 29990.00 4708.000 2568.000 2366 1411 733.0000 102199.00
1957 64721.00 32510.00 5230.000 2695.000 2526 1546 773.0000 110001.00
8 66747.57 34343.43 6229.286 2772.286 2625 1484 841.7143 66747.57
(2)简单求计算
> apply(WorldPhones,MARGIN = 1,FUN = sum) #对行求和,1代表行
1951 1956 1957 1958 1959 1960 1961
74494 102199 110001 118399 124801 133709 141700
> apply(WorldPhones,MARGIN = 2,FUN = mean) #对列求平均值,2代表列
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
66747.5714 34343.4286 6229.2857 2772.2857 2625.0000 1484.0000 841.7143
55.lapply and sapply
apply:对应的是数据框
lapply:返回列表
sapply:返回向量或者矩阵
tapply:处理因子
> state.center #列表值
$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573 -74.9841 -81.6850 -83.3736 -126.2500 -113.9300 -89.3776
[14] -86.0808 -93.3714 -98.1156 -84.7674 -92.2724 -68.9801 -76.6459 -71.5800 -84.6870 -94.6043 -89.8065 -92.5137 -109.3200
[27] -99.5898 -116.8510 -71.3924 -74.2336 -105.9420 -75.1449 -78.4686 -100.0990 -82.5963 -97.1239 -120.0680 -77.4500 -71.1244
[40] -80.5056 -99.7238 -86.4560 -98.7857 -111.3300 -72.5450 -78.2005 -119.7460 -80.6665 -89.9941 -107.2560
$y
[1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777 27.8744 32.3329 31.7500 43.5648 40.0495 40.0495 41.9358 38.4204
[17] 37.3915 30.6181 45.6226 39.2778 42.3645 43.1361 46.3943 32.6758 38.3347 46.8230 41.3356 39.1063 43.3934 39.9637 34.4764 43.1361
[33] 35.4195 47.2517 40.2210 35.5053 43.9078 40.9069 41.5928 33.6190 44.3365 35.6767 31.3897 39.1063 44.2508 37.5630 47.4231 38.4204
[49] 44.5937 43.0504
> lapply(state.center,FUN = length) #返回长度
$x
[1] 50
$y
[1] 50
> sapply(state.center,FUN = length) #返回的是向量值
x y
50 50
> class(sapply(state.center,FUN = length))
[1] "integer"
> tapply(state.name,state.division,FUN = length) #查询美国每个区包括多少个州
New England Middle Atlantic South Atlantic East South Central West South Central East North Central West North Central
6 3 8 4 4 5 7
Mountain Pacific
8 5
56.数据的中心化和标准化
数据中心化:指数据集中各项数据减去数据集的均值
数据标准化:是指在中心化之后在除以数据集的标准差,即数据集中的各项数据减去数据集的均值再除以数据集的标准差。
> x <- c(1,2,3,6,3)
> mean(x)
[1] 3
> x - mean(x) #中心化后相差还是有点大
[1] -2 -1 0 3 0
> sd(x) #计算标准差
[1] 1.870829
> (x - mean(x))/sd(x) #标准化
[1] -1.0690450 -0.5345225 0.0000000 1.6035675 0.0000000
> x <- scale(state.x77,scale = T,center = T) #利用scale函数中心化和标准化
> head(x)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama -0.1414316 -1.3211387 1.525758 -1.3621937 2.0918101 -1.4619293 -1.6248292 -0.2347183
Alaska -0.8693980 3.0582456 0.541398 -1.1685098 1.0624293 1.6828035 0.9145676 5.8093497
Arizona -0.4556891 0.1533029 1.033578 -0.2447866 0.1143154 0.6180514 -1.7210185 0.5002047
Arkansas -0.4785360 -1.7214837 1.197638 -0.1628435 0.7373617 -1.6352611 -0.7591257 -0.2202212
California 3.7969790 1.1037155 -0.114842 0.6193415 0.7915396 1.1751891 -1.6248292 1.0034903
Colorado -0.3819965 0.7294092 -0.771082 0.8800698 -0.1565742 1.3361400 1.1838976 0.3870991
> head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
> heatmap(x)
57.merge函数的使用(reshape2的使用)
> x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5),data = 1:5)
> y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5),data = 1:5)
> x
k1 k2 data
1 NA 1 1
2 NA NA 2
3 3 NA 3
4 4 4 4
5 5 5 5
> y
k1 k2 data
1 NA NA 1
2 2 NA 2
3 NA 3 3
4 4 4 4
5 5 5 5
> merge(x,y,by = "k1") #合并k1,交集
k1 k2.x data.x k2.y data.y
1 4 4 4 4 4
2 5 5 5 5 5
3 NA 1 1 NA 1
4 NA 1 1 3 3
5 NA NA 2 NA 1
6 NA NA 2 3 3
> merge(x,y,by = "k2",incomparables = T) #合并k2,除去NA的情况
k2 k1.x data.x k1.y data.y
1 4 4 4 4 4
2 5 5 5 5 5
3 NA NA 2 NA 1
4 NA NA 2 2 2
5 NA 3 3 NA 1
6 NA 3 3 2 2
> merge(x,y,by = c("k1","k2")) #合并k1,k2
k1 k2 data.x data.y
1 4 4 4 4
2 5 5 5 5
3 NA NA 2 1
58.gather函数:调整列,可以使固定列不变,其他列转换
(能把一个变量名含有变量的二维表转换成一个规范的二维表)
> tdata <- mtcars[1:10,1:3] #取这函数的一部分
> > tdata <- data.frame(name = rownames(tdata),tdata)
> tdata
name mpg cyl disp
Mazda RX4 Mazda RX4 21.0 6 160.0
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0
Datsun 710 Datsun 710 22.8 4 108.0
Hornet 4 Drive Hornet 4 Drive 21.4 6 258.0
Hornet Sportabout Hornet Sportabout 18.7 8 360.0
Valiant Valiant 18.1 6 225.0
Duster 360 Duster 360 14.3 8 360.0
Merc 240D Merc 240D 24.4 4 146.7
Merc 230 Merc 230 22.8 4 140.8
Merc 280 Merc 280 19.2 6 167.6
> gather(tdata,key = "Key",value = "Value",cyl,disp,mpg)
name Key Value
1 Mazda RX4 cyl 6.0
2 Mazda RX4 Wag cyl 6.0
3 Datsun 710 cyl 4.0
4 Hornet 4 Drive cyl 6.0
5 Hornet Sportabout cyl 8.0
6 Valiant cyl 6.0
7 Duster 360 cyl 8.0
8 Merc 240D cyl 4.0
9 Merc 230 cyl 4.0
10 Merc 280 cyl 6.0
11 Mazda RX4 disp 160.0
12 Mazda RX4 Wag disp 160.0
13 Datsun 710 disp 108.0
14 Hornet 4 Drive disp 258.0
15 Hornet Sportabout disp 360.0
16 Valiant disp 225.0
17 Duster 360 disp 360.0
18 Merc 240D disp 146.7
19 Merc 230 disp 140.8
20 Merc 280 disp 167.6
21 Mazda RX4 mpg 21.0
22 Mazda RX4 Wag mpg 21.0
23 Datsun 710 mpg 22.8
24 Hornet 4 Drive mpg 21.4
25 Hornet Sportabout mpg 18.7
26 Valiant mpg 18.1
27 Duster 360 mpg 14.3
28 Merc 240D mpg 24.4
29 Merc 230 mpg 22.8
30 Merc 280 mpg 19.2
> gather(tdata,key = "Key",value = "Value",cyl,-disp) #除去disp这一列
name mpg disp Key Value
1 Mazda RX4 21.0 160.0 cyl 6
2 Mazda RX4 Wag 21.0 160.0 cyl 6
3 Datsun 710 22.8 108.0 cyl 4
4 Hornet 4 Drive 21.4 258.0 cyl 6
5 Hornet Sportabout 18.7 360.0 cyl 8
6 Valiant 18.1 225.0 cyl 6
7 Duster 360 14.3 360.0 cyl 8
8 Merc 240D 24.4 146.7 cyl 4
9 Merc 230 22.8 140.8 cyl 4
10 Merc 280 19.2 167.6 cyl 6
> gather(tdata,key = "Key",value = "Value",2:4) #以防输错行名,可以用数字代替
name Key Value
1 Mazda RX4 mpg 21.0
2 Mazda RX4 Wag mpg 21.0
3 Datsun 710 mpg 22.8
4 Hornet 4 Drive mpg 21.4
5 Hornet Sportabout mpg 18.7
6 Valiant mpg 18.1
7 Duster 360 mpg 14.3
8 Merc 240D mpg 24.4
9 Merc 230 mpg 22.8
10 Merc 280 mpg 19.2
11 Mazda RX4 cyl 6.0
12 Mazda RX4 Wag cyl 6.0
13 Datsun 710 cyl 4.0
14 Hornet 4 Drive cyl 6.0
15 Hornet Sportabout cyl 8.0
16 Valiant cyl 6.0
17 Duster 360 cyl 8.0
18 Merc 240D cyl 4.0
19 Merc 230 cyl 4.0
20 Merc 280 cyl 6.0
21 Mazda RX4 disp 160.0
22 Mazda RX4 Wag disp 160.0
23 Datsun 710 disp 108.0
24 Hornet 4 Drive disp 258.0
25 Hornet Sportabout disp 360.0
26 Valiant disp 225.0
27 Duster 360 disp 360.0
28 Merc 240D disp 146.7
29 Merc 230 disp 140.8
30 Merc 280 disp 167.6
59.spread函数:用来扩展表,把某一列的值(键值对)分开拆成多列。
key是原来要拆的那一列的名字(变量名),value是拆出来的那些列的值应该填什么(填原表的哪一列)
> tdata <- gather(tdata,key = "Key",value = "Value",2:4)
> tdata
name Key Value
1 Mazda RX4 mpg 21.0
2 Mazda RX4 Wag mpg 21.0
3 Datsun 710 mpg 22.8
4 Hornet 4 Drive mpg 21.4
5 Hornet Sportabout mpg 18.7
> spread(tdata,key = "Key",value = "Value")
name cyl disp mpg
1 Datsun 710 4 108.0 22.8
2 Duster 360 8 360.0 14.3
3 Hornet 4 Drive 6 258.0 21.4
4 Hornet Sportabout 8 360.0 18.7
5 Mazda RX4 6 160.0 21.0
6 Mazda RX4 Wag 6 160.0 21.0
7 Merc 230 4 140.8 22.8
8 Merc 240D 4 146.7 24.4
9 Merc 280 6 167.6 19.2
10 Valiant 6 225.0 18.1
60.separate函数:负责分割数据,把一个变量中就包含两个变量的数据分来
> df <- data.frame(x = c(NA,"a.b","a.d","b.c"))
> df
x
1 <NA>
2 a.b
3 a.d
4 b.c
> separate(df,col = x,into = c("A","B")) #创建新的列,一列分成几列
A B
1 <NA> <NA>
2 a b
3 a d
4 b c
> df <- data.frame(x = c(NA,"a.b-c","a-d","b-c"))
> separate(df,x,into = c("A","B"),sep = "-")
A B
1 <NA> <NA>
2 a.b c
3 a d
4 b c
61.unite函数:合并列
> unite(x,col = "AB",A,B,sep = "-")
AB
1 NA-NA
2 a.b-c
3 a-d
4 b-c
62.dplyr包:数据格式的转换
(1)filter() 函数可以基于观测的值筛选出一个观测子集
> dplyr::filter(iris,Sepal.Length > 7) #除去长度小于7的
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.1 3.0 5.9 2.1 virginica
2 7.6 3.0 6.6 2.1 virginica
3 7.3 2.9 6.3 1.8 virginica
4 7.2 3.6 6.1 2.5 virginica
5 7.7 3.8 6.7 2.2 virginica
6 7.7 2.6 6.9 2.3 virginica
(2)distinct函数:去除重复项
> dplyr::distinct(rbind(iris[1:10,],iris[1:15,])) #除去重复项
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
(3)slice函数:切片,可以取出任意行
> dplyr::slice(iris,10:15)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.1 1.5 0.1 setosa
2 5.4 3.7 1.5 0.2 setosa
3 4.8 3.4 1.6 0.2 setosa
4 4.8 3.0 1.4 0.1 setosa
5 4.3 3.0 1.1 0.1 setosa
6 5.8 4.0 1.2 0.2 setosa
(4)sample_n函数:随机取样
> dplyr::sample_n(iris,10) #随机抽取10行
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.0 2.0 3.5 1.0 versicolor
2 6.7 3.1 4.7 1.5 versicolor
3 5.0 3.4 1.5 0.2 setosa
4 7.1 3.0 5.9 2.1 virginica
5 6.8 3.2 5.9 2.3 virginica
6 5.1 3.8 1.5 0.3 setosa
7 4.9 3.6 1.4 0.1 setosa
8 6.5 3.0 5.2 2.0 virginica
9 5.0 3.3 1.4 0.2 setosa
10 5.0 2.3 3.3 1.0 versicolor
(5)sample_frac函数:按比例随机选取
> dplyr::sample_frac(iris,0.1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 2.6 4.0 1.2 versicolor
2 5.1 3.5 1.4 0.3 setosa
3 5.1 3.8 1.9 0.4 setosa
4 6.7 3.3 5.7 2.5 virginica
5 6.8 2.8 4.8 1.4 versicolor
6 6.0 2.2 5.0 1.5 virginica
7 6.9 3.2 5.7 2.3 virginica
8 5.3 3.7 1.5 0.2 setosa
9 4.9 3.6 1.4 0.1 setosa
10 5.5 2.6 4.4 1.2 versicolor
11 6.9 3.1 5.1 2.3 virginica
12 5.7 3.0 4.2 1.2 versicolor
13 5.7 2.9 4.2 1.3 versicolor
14 6.7 3.1 4.4 1.4 versicolor
15 6.0 2.7 5.1 1.6 versicolor
(6)arrange函数:排序
> dplyr::arrange(iris,Sepal.Length) #按花萼长度排序,升序
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
> dplyr::arrange(iris,desc(Sepal.Length)) #降序排列
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.9 3.8 6.4 2.0 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.7 2.6 6.9 2.3 virginica
4
(7)统计函数summarise
> summarise(iris,avg = mean(Sepal.Length)) #计算花萼的平均长度
avg
1 5.843333
> summarise(iris,sum = sum(Sepal.Length))
sum
1 876.5
(8)%>%:链式操作符,用于实现将一个函数的输出传递给下一个函数,作为下一个函数的输入(相当于,管道)快捷键:ctrl+shift+M
> head(mtcars,20) %>% tail(10) #取的是11——20行
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
(9)group_by:分组
> dplyr::group_by(iris,Species) #分成了3组
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
> iris %>% group_by(Species) #与上一条命令执行结果一样
> iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width)) //统计平均宽度
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 3.43
2 versicolor 2.77
3 virginica 2.97
> iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width)) %>% arrange(avg)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 versicolor 2.77
2 virginica 2.97
3 setosa 3.43
(10)mutate函数:增加新的列
> dplyr::mutate(iris,new = Sepal.Length+Petal.Length) #花萼、花瓣长度总和
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
1 5.1 3.5 1.4 0.2 setosa 6.5
2 4.9 3.0 1.4 0.2 setosa 6.3
3 4.7 3.2 1.3 0.2 setosa 6.0
4 4.6 3.1 1.5 0.2 setosa 6.1
5
63.dplyr包的双表格
> a=data.frame(x1=c("A","B","C"),x2=c(1,2,3))
> b=data.frame(x1=c("A","B","D"),x3=c(T,F,T))
> a
x1 x2
1 A 1
2 B 2
3 C 3
> b
x1 x3
1 A TRUE
2 B FALSE
3 D TRUE
> dplyr::left_join(a,b,by="x1") #左链接,x1给的是a的值 b中没有
C这一列,则给的是NA
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NA
> dplyr::right_join(a,b,by="x1") #右链接,x1是b中给的
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 D NA TRUE
> dplyr::full_join(a,b,by="x1") #全链接,取x1的并集
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NA
4 D NA TRUE
> dplyr::semi_join(a,b,by="x1") #半链接,把x1中的交集取出来
x1 x2
1 A 1
2 B 2
> dplyr::anti_join(a,b,by="x1") #反链接,把x1中的补集取出来
x1 x2
1 C 3
64.数据集的合并
> mtcars <- mutate(mtcars,Model = rownames(mtcars)) #多添加一行
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb Model
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
> first <- slice(mtcars,1:20) #取1-20行
> second <- slice(mtcars,10:30) #取10-30行
> intersect(first,second) #取两个交集
> union_all(first,second) #取两个并集
> setdiff(first,second) #取first补集
> setdiff(second,first) #取second补集