表2-1 病例数据
病人编号(Patient ID) | 入院时间(AdmDate) | 年龄(Age) | 糖尿病类型(Diabetes) | 病情(Status) |
---|---|---|---|---|
1 | 10/15/2009 | 25 | Type1 | Poor |
2 | 11/01/2009 | 34 | Type2 | Improved |
3 | 10/21/2009 | 28 | Type1 | Excellent |
4 | 1-/28/2009 | 52 | Type1 | Poor |
维度 | 同质性 | 异质性 |
---|---|---|
一维 | 向量 vector | 列表 list |
二维 | 矩阵 matrix | 数据框 data frame |
N维 | 数组 array |
是用于储存数值型、字符型、逻辑型数据的一维数组。
#创建向量:c()
a <- c(1, 2, 5, 3, 6, -2, 4)
b <- c("one", "two", "three")
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
#访问向量中的指定元素:[所处位置的数值]
> a <- c(1, 2, 5, 3, 6, -2, 4)
> a[3]
[1] 5
> a[c(1, 3, 5)]
[1] 1 5 6
> a[2:6]
[1] 2 5 3 6 -2
是一个二维数组,且每个元素性质同一。
# 创建矩阵:matrix()
# 示例:myymatrix <-matrix(vector, nrow=number_of_row, ncol=number_of_columns,
byrow=logical_value, dimnames=list(
char_vector_rownames, char_vector_colnames))
代码清单2-1 创建矩阵
y <- matrix(1:20, nrow = 5, ncol = 4)
y
cells <- c(1, 26, 24, 68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = TRUE,
dimnames = list(rnames, cnames))
mymatrix
mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = FALSE,
dimnames = list(rnames, cnames))
mymatrix
> y <- matrix(1:20, nrow = 5, ncol = 4) #创建一个5×4的矩阵
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
> cells <- c(1, 26, 24, 68)
> rnames <- c("R1", "R2")
> cnames <- c("C1", "C2")
> mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = TRUE,
+ dimnames = list(rnames, cnames))
#创建一个按行填充的、含列名标签的2×2矩阵
> mymatrix
C1 C2
R1 1 26
R2 24 68
> mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = FALSE,
+ dimnames = list(rnames, cnames))
#创建一个按列填充的、含列名标签的2×2矩阵
> mymatrix
C1 C2
R1 1 24
R2 26 68
代码清单2-2 矩阵下标的使用
x <- matrix(1:10, nrow = 2)
x
x[2, ]
x[, 2]
x[1, 4]
x[1, c(4, 5)]
> x <- matrix(1:10, nrow = 2)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> x[2, ]
[1] 2 4 6 8 10
> x[, 2]
[1] 3 4
> x[1, 4]
[1] 7
> x[1, c(4, 5)]
[1] 7 9
类似矩阵,但维度大于矩阵。
# 创建数组:array()
# 示例:myarray <-array(vector, dimension, dimnames)
代码清单2-3 创建一个数组
dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
z <- array(1:24, c(2, 3, 4), dimnames = list(dim1, dim2, dim3))
z
> z
, , C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
, , C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
> z[1,2,3]
[1] 15
==最常用!!!==类似矩阵,但可包含不同格式的数据。
# 创建数组:data.frame()
# 示例:mydata <-data.frame(col1, col2, col3)
代码清单2-4 创建一个数据框
patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
patientdata <- data.frame(patientID, age, diabetes,
status)
patientdata
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
代码清单2-5 选取数据框值中的元素
patientdata[1:2]
patientdata[c("diabetes", "status")]
patientdata$age
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
> patientdata[1:2]
patientID age
1 1 25
2 2 34
3 3 28
4 4 52
> patientdata[c("diabetes", "status")]
diabetes status
1 Type1 Poor
2 Type2 Improved
3 Type1 Excellent
4 Type1 Poor
> patientdata$age
[1] 25 34 28 52
$:用于选取给定数据框中的特定变量;可运用以下函数简化代码:
# 将数据框添加到R的搜索路径中:attach()
# 将数据框从搜索路径中删除:detach()
在使用函数attach()前存在与数据框中相同的对象时,R可能会报错:
#某个对象_已被屏蔽(mask)_
The following object is masked _by_ .GlobalEnv: mpg
#应用范围更广的函数:with()
> with(mtcars,{
+ nokeepstats<-summary(mpg)
+ keepstats<<-summary(mpg)
+ })
> nokeepstats
Error: object 'nokeepstats' not found
> keepstats
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.4 15.4 19.2 20.1 22.8 33.9
实例标识符(case identifier):用于区分数据集中不同的个体。如表2-1中的病人编号(Patient ID)。可通过数据框操作函数中的rowname选项指定:
patientdata <- data.frame(patientID, age, diabetes, status, row.names=patientID)
即R中类别/名义型变量、有序/有序型变量的总称。
#以整数向量形式存储类别:factor()
#当类别为有序型变量时:需为factor()制定参数向量order=TRUE
代码清单2-6 因子的使用
patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
diabetes <- factor(diabetes)
status <- factor(status, order = TRUE)
patientdata <- data.frame(patientID, age, diabetes, status)
str(patientdata) #显示对象结构
summary(patientdata) #显示对象的统计概要
> str(patientdata)
'data.frame': 4 obs. of 4 variables:
$ patientID: num 1 2 3 4
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.0 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.2 Type2:1 Improved :1
Median :2.50 Median :31.0 Poor :2
Mean :2.50 Mean :34.8
3rd Qu.:3.25 3rd Qu.:38.5
Max. :4.00 Max. :52.0
代码清单2-7 创建列表
g <- "My First List"
h <- c(25, 26, 18, 39)
j <- matrix(1:10, nrow = 5)
k <- c("one", "two", "three")
mylist <- list(title = g, ages = h, j, k)
mylist
mylist[[2]]
mylist[["ages"]]
> mylist
$title
[1] "My First List"
$ages
[1] 25 26 18 39
[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
[[4]]
[1] "one" "two" "three"
> mylist[[2]]
[1] 25 26 18 39
> mylist[["ages"]]
[1] 25 26 18 39
向R中导入数据的权威指南参见可在 http://cran.r-project.org/doc/manuals/R-data.pdf下载的R Data Import/Export手册
mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
#age=numeric(0):将创建一个有格式,但为空的变量
mydata <- edit(mydata)
#edit()必须赋值到一个目标,否则所有修改无效
fix(mydata) #等价写法
read.table()
#格式如下:
mydataframe <- read.table(file, header=logical_value,
sep="delimiter", row.names="names")
通过连接(connection)来访问数据的机制:文件名参数
library(xlsx)
workbook <- "/Users/Documents/myworkbook.xlxs" #“文件路径”
mydataframe <- read.xlxs(workbook, 1) #1为要导入的表格名称
#安装RODBC包
install.packages("RODBC")
#导入数据
library(RODBC)
channel <- odbcConnectExcel("myfile.xls")
mydataframe <- sqlFetch(channel, "mysheet")
odbcClose(channel)
XML包
install.packages("Hmisc")
mydataframe <- spss.get("mydata.sav", use.value.labels=TRUE)
use.value.labels=TRUE:表示让函数将带有值标签的变量导入为R中水平对应相同的因子
???
#SAS程序:
proc export data = mydata
outfile = "mydata.csv"
dbms = csv;
run;
#R:
mydata <- read.table("mydata.csv", header=TRUE, sep=" ")
library(foreign)
mydataframe <- read.dta("mydata.dta")
library(ncdf)
nc <- nc_open("mynetCDFfile")
myarray <- get.var.ncdf(nc, myvar)
library(RODBC) #载入了RODBC包
myconn <- odbcConnect("mydsn", uid="Rob", pwd="aardvark") #通过一个已注册的数据源名称(mydsn)、用户名(rob)、密码(aardvark)打开了一个ODBC数据库连接
crimedat <- sqlFetch(myconn, Crime) #连接字符串被传递给sqlFetch,它将Crime表复制到R数据框crimedat中
pundat <- sqlQuery(myconn, "select * from Punishment") #对Punishment表执行SQL语句select并将结果保存到数据框pundat中
close(myconn) #关闭连接
类似对变量进行备注
name()
names(patientdata)[2] <- "Age at hospitalization (in years)"
#将age重命名为"Age at hospitalization (in years)"
names(patientdata)[2] <- “admissionAge”
#或更为理想的命名,如“admissionAge”
factor()
patientdata$gender <- factor(patient$gender,
levels = c(1,2),
labels = c("male", "female"))