R语言--20/21/22

cor相关性：

cor():
cor(x, y , method = c("pearson", "kendall", "spearman"))
计算x,y的相关性，x,y需要为向量，若为矩阵则计算x列，y列的相关性。method指定相关性计算的方法,默认情况下是Pearson相关性系数。
Pearson，spearman计算方法见笔记本
cor.test():

cor.test()使用方法.png

cor()只给出相关性值，cor.test()同时给出检验的p值。

lm(）线性拟合

lm()函数返回拟合结果的对象，可以用summary()函数查看其内容

> test
      height weight gender      BMI
tom      180     75   male 23.14815
cindy    165     58 female 21.30395
jimmy    175     72   male 23.51020
sam      173     68   male 22.72044
lucy     160     60 female 23.43750
lily     165     55 female 20.20202> result=lm(height~weight,data=test)
> result

Call:
lm(formula = height ~ weight, data = test)

Coefficients:
(Intercept)       weight  
     115.34         0.84  
#拟合方程为height=115.34+0.84weight
> summary(result)

Call:
lm(formula = height ~ weight, data = test)

Residuals:
    tom   cindy   jimmy     sam    Lucy    lily 
 1.6529  0.9336 -0.8270  0.5332 -5.7465  3.4537 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)#Pr(>|t|)为显著性
(Intercept) 115.3441    12.5825   9.167 0.000786 ***
weight        0.8400     0.1933   4.346 0.012198 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.519 on 4 degrees of freedom
Multiple R-squared:  0.8252,    Adjusted R-squared:  0.7815 
#拟合优度和矫正后的拟合优度，此项数值越大越好
F-statistic: 18.89 on 1 and 4 DF,  p-value: 0.0122
#检验整个方程的p值，<0.05为宜

检验一套数据是否符合正态分布

符合正态分布的数据情况

法一

>data=rnorm(1000)
> hist(data,prob=T)
> lines(density(data))

输出结果.png

绘制数据的统计图，大致符合钟形分布即可大致判断符合正态分布
法二

qqnorm(data)
qqline(data)

截屏2021-06-28 下午4.21.52.png

直线大致满足y=x即可判断为符合正态分布
法三

> shapiro.test(data)

    Shapiro-Wilk normality test

data:  data
W = 0.99786, p-value = 0.229
#>0.05符合正态分布，<0，05不符合正态分布

不符合正态分布的数据情况

法一

> a=c(rep(1,10),rep(2,5),rep(3,4),6,8,10,12,20)
> hist(a,breaks=seq(0.5,21,by=1),prob=TRUE)
> lines(density(a),col="blue")#明显看出是偏态分布
> abline(v=median(a),col="red")#添加中值，红色线
> abline(v=mean(a),col="green")#添加平均数，绿色线
> #对于偏态分布，中值比平均数更有意义

法一输出结果

法二

> qqnorm(a)
> qqline(a)

法二输出结果

明显为偏态分布
法三

> shapiro.test(a)

    Shapiro-Wilk normality test

data:  a
W = 0.63575, p-value = 1.589e-06
#pvalue<0.05,偏态分布

不符合正态分布的数据，不能使用t检验，应该使用秩和检验。

> a=c(rep(1,10),rep(2,5),rep(3,4),6,8,10,12,20)
> b=c(rep(2,7),rep(3,5),rep(5,8),8,10,18,25)
#a,b不符合正态分布
> t.test(a,b)

    Welch Two Sample t-test

data:  a and b
t = -1.2025, df = 44.761, p-value = 0.2355
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.681665  1.181665
sample estimates:
mean of x mean of y 
 3.666667  5.416667 
#不符合正态分布的数据，不能用T检验，此时P value>0.05，无显著差异
> wilcox.test(a,b,exact = FALSE)#秩和检验，非参数检验

    Wilcoxon rank sum test with continuity correction

data:  a and b
W = 162.5, p-value = 0.008673
alternative hypothesis: true location shift is not equal to 0
#不符合正态分布的数据，用秩和检验，此时P value<0.05，有显著差异

百分比检验

假设我们知道全球流感对死亡率是10%，在美国对调查发现400名流感病人中有51人死亡了，请问美国流感对死亡率是否显著高于全球流感对死亡率。

> prop.test(51,400,p=0.1,alternative = "greater")

    1-sample proportions test with continuity correction

data:  51 out of 400, null probability 0.1
X-squared = 3.0625, df = 1, p-value = 0.04006
alternative hypothesis: true p is greater than 0.1
95 percent confidence interval:
 0.101422 1.000000
sample estimates:
     p 
0.1275

卡方检验和fisher精确检验

卡方检验例子——判断吸烟和不吸烟患气管炎是否有显著差异PNG

> data=rbind(c(50,250),c(8,10))
> rownames(data)=c("cig","non-cig")
> colnames(data)=c("qiguanyan","non-qiguanyan")
> chisq.test(data)#卡方检验，注意输入的值直接为数据框

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 7.0225, df = 1, p-value = 0.008049

Warning message:
In chisq.test(data) : Chi-squared approximation may be incorrect
#此时会产生报错，产生此项错误的原因是吸烟患者样本数太少，出现此类报错，可用下文的fisher精确检验做统计检验
> fisher.test(data)#fisher精确检验

    Fisher's Exact Test for Count Data

data:  data
p-value = 0.007489
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.08452923 0.77224505
sample estimates:
odds ratio 
   0.25151

卡方检验例子——判断男性女性癌症的不同阶段的人数是否有差异.PNG

方差分析示例

step1：大致查看数据

> cholesterol
      trt response
1   1time   3.8612
2   1time  10.3868
3   1time   5.9059
4   1time   3.0609
5   1time   7.7204
6   1time   2.7139
7   1time   4.9243
8   1time   2.3039
9   1time   7.5301
10  1time   9.4123
11 2times  10.3993
12 2times   8.6027
13 2times  13.6320
14 2times   3.5054
15 2times   7.7703
16 2times   8.6266
17 2times   9.2274
18 2times   6.3159
19 2times  15.8258
20 2times   8.3443
21 4times  13.9621
22 4times  13.9606
23 4times  13.9176
24 4times   8.0534
25 4times  11.0432
26 4times  12.3692
27 4times  10.3921
28 4times   9.0286
29 4times  12.8416
30 4times  18.1794
31  drugD  16.9819
32  drugD  15.4576
33  drugD  19.9793
34  drugD  14.7389
35  drugD  13.5850
36  drugD  10.8648
37  drugD  17.5897
38  drugD   8.8194
39  drugD  17.9635
40  drugD  17.6316
41  drugE  21.5119
42  drugE  27.2445
43  drugE  20.5199
44  drugE  15.7707
45  drugE  22.8850
46  drugE  23.9527
47  drugE  21.5925
48  drugE  18.3058
49  drugE  20.3851
50  drugE  17.3071
> boxplot(response~trt,data=cholesterol)

输出结果.png

大致可以看出不同处理间y是有差异的。

step2:检查正态分布及方差齐性

> shapiro.test(cholesterol$response)
#检验是否为正态分布,>0.05没有显著差异，为正态分布

    Shapiro-Wilk normality test

data:  cholesterol$response
W = 0.97722, p-value = 0.4417

> bartlett.test(response~trt,data=cholesterol)
#检查方差齐性，>0.05方差没有显著差异。

    Bartlett test of homogeneity of variances

data:  response by trt
Bartlett's K-squared = 0.57975, df = 4, p-value = 0.9653

step3方差分析：

> fit<-aov(response~trt,data=cholesterol)
> summary(fit)
            Df Sum Sq Mean Sq F value   Pr(>F)    
trt          4 1351.4   337.8   32.43 9.82e-13 ***
Residuals   45  468.8    10.4                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Pr(>F)  即p值小于0.001，极显著，即处理组间有差异，可以用于事后多重比较

step4:事后多重比较

> TukeyHSD(fit)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = response ~ trt, data = cholesterol)

$trt
                  diff        lwr       upr     p adj
2times-1time   3.44300 -0.6582817  7.544282 0.1380949
4times-1time   6.59281  2.4915283 10.694092 0.0003542
drugD-1time    9.57920  5.4779183 13.680482 0.0000003
drugE-1time   15.16555 11.0642683 19.266832 0.0000000
4times-2times  3.14981 -0.9514717  7.251092 0.2050382
drugD-2times   6.13620  2.0349183 10.237482 0.0009611
drugE-2times  11.72255  7.6212683 15.823832 0.0000000
drugD-4times   2.98639 -1.1148917  7.087672 0.2512446
drugE-4times   8.57274  4.4714583 12.674022 0.0000037
drugE-drugD    5.58635  1.4850683  9.687632 0.0030633
#diff：两组平均值之间的差异
#lwr，upr：置信区间的上下端点为95％（默认值）
#p adj：调整后的多个比较的p值。
#如本例，除了2times-1time，2times-1time，drugD-4times没有显著差异，其余组别均有显著差异

step5 图例展示

> library(multcomp) 
> par(mar=c(5,4,6,2))
> tuk<-glht(fit,linfct=mcp(trt="Tukey"))
> plot(cld(tuk,level=0.05),col="lightgreen")

输出结果

有相同字母的组别表示差异不显著