用R语言遇到的一些问题。
经常看到rcs()函数,比如拟合回归时:f <- cph(S ~ rcs(age,4) + sex, x=T, y=T)。
关于RCS的理解,可以参考:Restricted cubic splines
另外,丁香园里有人给出这样的看法:
rcs全称是restricted cubic spline 即限制立方样条函数。为什么用它呢?我们在做回归拟合数据时,经常对因变量和自变量的假定是:自变量和因变量呈线性关系,logistic回归是自变量和logitP,cox是指自变量和一个h(t)变换的一个东东(记不清了,抱歉),这些假定就叫线性假定。我们做回归的时候自变量包括连续变量和ordinal variable(等级有序变量)和因变量不成线性的时候,我可以转化为分类,也可以变量log变换,还有就是spline技术,也就是引入多项式,但rcs又不是简单的引入多项式,简单的说就是引入3次方的项,这和你选择的节点有关,这样处理的结果就是可以和好的拟合数据有限制了引入项的自由度,所以Harrell (rms包的作者)推崇这种建模方法。说的很概括,具体的内容搜索rcs就可以了解。Harrell的论文也可以参考,他认为节点的位置不那么重要,而节点数目很重要。
rms中的应用:
## Not run:
options(knots=4, poly.degree=2)
# To get the old behavior of rcspline.eval knot placement (which didnt' handle
# clumping at the lowest or highest value of the predictor very well):
# options(fractied = 1.0) # see rcspline.eval for details
country <- factor(country.codes)
blood.pressure <- cbind(sbp=systolic.bp, dbp=diastolic.bp)
fit <- lrm(Y ~ sqrt(x1)*rcs(x2) + rcs(x3,c(5,10,15)) +
lsp(x4,c(10,20)) + country + blood.pressure + poly(age,2))
# sqrt(x1) is an implicit asis variable, but limits of x1, not sqrt(x1)
# are used for later plotting and effect estimation
# x2 fitted with restricted cubic spline with 4 default knots
# x3 fitted with r.c.s. with 3 specified knots
# x4 fitted with linear spline with 2 specified knots
# country is an implied catg variable
# blood.pressure is an implied matrx variable
# since poly is not an rms function (pol is), it creates a
# matrx type variable with no automatic linearity testing
# or plotting
f1 <- lrm(y ~ rcs(x1) + rcs(x2) + rcs(x1) %ia% rcs(x2))
# %ia% restricts interactions. Here it removes terms nonlinear in
# both x1 and x2
f2 <- lrm(y ~ rcs(x1) + rcs(x2) + x1 %ia% rcs(x2))
# interaction linear in x1
f3 <- lrm(y ~ rcs(x1) + rcs(x2) + x1 %ia% x2)
# simple product interaction (doubly linear)
# Use x1 %ia% x2 instead of x1:x2 because x1 %ia% x2 triggers
# anova to pool x1*x2 term into x1 terms to test total effect
# of x1
## End(Not run)
SVM的调参,关于e1071包,好像如果把数据“尺度化”(scaling)后,使用默认的参数就能训练出较好的模型。
R语言里,回归的参数,如果传formula,比如Y~X,那么这里的X不应该是dataframe或matrix,而应该用向量比如Y~x1+x2。如果向量太多,那么可以这样传两个参数:formula和data,比如glm(Y~., data=X)。这是在Logistics回归中被坑过,在predict时解决了将newdata用dataframe形式传入的问题后,总是报错说变量数不对(一方面可能fit 时有问“glm传参数data可能没传好”,另一方面newdata的dataframe可能需要转置一下)。
R语言的向量、矩阵、数据框、数组、列表等等,有点烦人,和我之前的C语言系列思维差别太大。
R的e1701中都在svm中,仅当y变量是factor类型时构建SVC,否则构建SVR
ggplot做出来的图,如何去除默认的背景和网格?(摘自:移除ggplot2的网格和背景色)
# 方法1
myplot<-myplot+ theme(panel.grid.major =element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# 方法2
myplot<-myplot+theme_bw()+
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
生存分析
cph与coxph是点不同的,虽然得到的系数一样,p值一样,但是wald统计量有些区别(进一步进行anova分析时可以看出区别)。cph来自rms包,coxph来自survival包。
cph的一些进一步的说明:
“Modification of Therneau’s coxph function to fit the Cox model and its extension, the AndersenGill model. The latter allows for interval time-dependent covariables, time-dependent strata, and repeated events. The Survival method for an object created by cph returns an S function for computing estimates of the survival function. The Quantile method for cph returns an S function. for computing quantiles of survival time (median, by default). The Mean method returns a function for computing the mean survival time. This function issues a warning if the last follow-up time is uncensored, unless a restricted mean is explicitly requested.”
绘制Nomogram举例1:
require(rms) ##调用rms包
# 建立数据集
y = c(0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
0, 0, 1, 0, 1, 0, 1, 0, 1)
age = c(28, 42, 46, 45, 34, 44, 48, 45, 38, 45, 49, 45, 41, 46, 49, 46, 44, 48,
52, 48, 45, 50, 53, 57, 46, 52, 54, 57, 47, 52, 55, 59, 50, 54, 57, 60,
51, 55, 46, 63, 51, 59, 48, 35, 53, 59, 57, 37, 55, 32, 60, 43, 59, 37,
30, 47, 60, 38, 34, 48, 32, 38, 36, 49, 33, 42, 38, 58, 35, 43, 39, 59,
39, 43, 42, 60, 40, 44)
sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
0, 1, 1, 1, 0, 1)
ECG = c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 2, 1, 0, 0, 2, 2, 0, 0, 2, 2,
0, 1, 2, 2, 0, 1, 0, 2, 0, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 2,
1, 1, 0, 2, 1, 1)
ECG<-as.factor(ECG)
# 设定nomogram的参数
ddist <- datadist(age, sex, ECG)
options(datadist='ddist')
# logistic回归
f <- lrm(y ~ age + sex + ECG)
# nomogram
nom <- nomogram(f, fun=plogis,
fun.at=c(.001, .01, .05, seq(.1,.9, by=.1), .95, .99, .999),
lp=F, funlabel="Risk")
plot(nom)
绘制nomagram举例2
library(foreign) # 读取spss格式数据
library(survival)
library(rms)
library(VIM) # 包中aggr()函数,判断数据缺失情况
par(family = "STXihei") # 指定绘制图片中的字体
# 数据来自张文彤《SPSS统计分析高级教程》,侵删!
url <- 'http://online.hyconresearch.com:4096/spss20/spss20advdata.zip'
# 书中数据下载地址
file <- basename(url)
# 获取文件名 spss20advdata.zip
if (!file.exists(file)) download.file(url, file)
# 判断工作目录下是否有spss20advdata.zip文件,如果不存在则执行下载数据命令
unzip(file) # 解压缩zip文件
pancer <- read.spss('part4/pancer.sav') # 读取文件
pancer <- as.data.frame(pancer)
# Cox回归所需数据类型为数据框格式,将其转换为数据框格式
aggr(pancer,prop=T,numbers=T)
# 判断pancer各个变量的数据缺失情况,出现红色即有缺失,绘制列线图不允许缺失值存在
pancer$censor <- ifelse(pancer$censor=='死亡',1,0)
# Cox回归结局变量需为数值变量
pancer$Gender <- as.factor(ifelse(pancer$sex=='男',"Male","Female"))
# 更改变量名称以及变量取值
pancer$Gendr <- relevel(pancer$Gender,ref='Female') # 设置参考组
dd<-datadist(pancer)
options(datadist='dd')
coxm <- cph(Surv(time,censor==1)~age+Gender+trt+bui+ch+p+stage,x=T,y=T,data=pancer,surv=T)
# 建立Cox回归方程
surv <- Survival(coxm) # 建立生存函数
surv1 <- function(x)surv(1*3,lp=x) # lp: linear predictor, 用这个函数去
surv2 <- function(x)surv(1*6,lp=x)
surv3 <- function(x)surv(1*12,lp=x)
# maxscale 参数指定最高分数,一般设置为100或者10分
# fun.at 设置生存率的刻度
# xfrac 设置数值轴与最左边标签的距离,可以调节下数值观察下图片变化情况
plot(nomogram(coxm,fun=list(surv1,surv2,surv3),lp= F,
funlabel=c('3-Month Survival','6-Month survival','12-Month survival'),maxscale=100,
fun.at=c('0.9','0.85','0.80','0.70','0.6','0.5','0.4','0.3','0.2','0.1')),xfrac=.45)
med <- Quantile(coxm)
surv4 <- function(x) med(x)
plot(nomogram(f, fun=surv4,
fun.at=c(12,6,3,1,0),lp=F, funlabel="Median Survival Time"))
绘制Nomogram举例3:
require(rms)
require(Hmisc) ##需要下载安装
require(survival) ##R默认自带的用于做生存分析的包
# 建立数据集(使用rms包example的代码,未改动)
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
# label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n,
rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
dt <- -log(runif(n))/h
# label(dt) <- 'Follow-up Time'
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
# units(dt) <- "Year"
# 设定nomogram的参数
ddist <- datadist(age, sex)
options(datadist='ddist')
# Cox回归
S <- Surv(dt,e)
f <- cph(S ~ rcs(age,4) + sex, x=T, y=T, surv=T)
med <- Quantile(f)
# nomogram
nom <- nomogram(f, fun=function(x) med(x),
fun.at=c(15,12,11,9,6,5),lp=F, funlabel="Median Survival Time")
plot(nom) ##绘制Nomgram图
先贴代码,以后整理下。再就是,可以研究下rms包文档里面的nomogram例子。
-----------------------------------Fisher exact test------------------------------------
Fisher exact test 不仅适用于2*2表,在R*C表中也能使用Fisher精确检验。
有文献对此进行过说明(Statistical notes for clinical researchers: Chi-squared test and Fisher's exact test)。
-----------------------------------非正态数据的正态化------------------------------------
http://blog.sina.com.cn/s/blog_13ec735f50102x59u.html
https://blog.csdn.net/qq_40587575/article/details/84349900
偏度和峰度的计算:
http://blog.sina.com.cn/s/blog_4d2fda500102wk0r.html