TwoSampleMR：两样本孟德尔随机化

孟德尔随机化推断暴露因素与健康结局的因果关系

主要步骤：
1. 读取暴露因素GWAS数据
选取合适的工具变量（如有必要，需要进行clumping）
2. 读取结局GWAS数据，提取上述的工具变量的SNPs
3. 对暴露因素与结局的GWAS数据进行预处理，使其格式统一
4. 采用选中的MR方法进行分析
5. 结果可视化 (散点图、森林图、漏斗图）

1. 工具变量

这一步读取暴露因素和结局变量需要的数据格式是一样的。需要一个工具变量的data frame，每行对应一个SNP，至少需要4列，分别为：
1. SNP: rs ID
2. beta: The effect size. If the trait is binary then log(OR) should be used
3. se: The standard error of the effect size
4. effect_allele: The allele of the SNP which has the effect marked in beta

其他可能有助于MR预处理或分析的列包括
1. other_allele – The non-effect allele
2. eaf – The effect allele frequency
3. Phenotype – The name of the phenotype for which the SNP has an effect
4. pval – The P-value for the SNP’s association with the exposure

也可以提供额外的信息（可选，对于分析非必须）
1. chr – Physical position of variant (chromosome)
2. position – Physical position of variant (position)
3. samplesize – Sample size for estimating the effect size
4. ncase – Number of cases
5. ncontrol – Number of controls
6. units – The units in which the effects are presented
7. gene – The gene or other annotation for the the SNP

install.packages("devtools")
devtools::install_github("MRCIEU/TwoSampleMR") #这个命名是安装最新版本的TwosampleMR包
library(TwoSampleMR)
## TwoSampleMR自带的数据集
bmi2_file <- system.file("extdata/bmi.csv", package="TwoSampleMR")
bmi_exp_dat <- read_exposure_data(
  filename = bmi2_file,
  sep = ",",
  snp_col = "rsid",
  beta_col = "effect",
  se_col = "SE",
  effect_allele_col = "a1",
  other_allele_col = "a2",
  eaf_col = "a1_freq",
  pval_col = "p-value",
  units_col = "Units",
  gene_col = "Gene",
  samplesize_col = "n"
)
# No phenotype name specified, defaulting to 'exposure'.

改好自己的暴露名

bmi_exp_dat$exposure <- "BMI"

对于一个标准的两样本MR分析来说，我们需要确保工具变量之间是互相独立的（即不存在显著的连锁互换(LD)），在读取完数据后我们应当对其进行LD Clumping，这里可以使用clump_data函数，指定LD参考面板为EUR，r2的阈值为0.01，对数据进行Clumping。该函数与OpenGWAS API进行交互，其存储了千人基因组中5个群体（EUR, SAS, EAS, AFR, AMR）的LD数据。对东亚人进行分析时指定pop = "EAS"即可。

bmi_exp_dat <- clump_data(bmi_exp_dat, clump_r2=0.01, pop = "EUR")
# API: public: http://gwas-api.mrcieu.ac.uk/
# Please look at vignettes for options on running this locally if you need to run many instances of this command.
# Clumping 51As7f, 30 variants, using EUR population reference

2. 结局变量

选好工具变量之后，就要将工具变量从结局的GWAS中提取出来。
可以像上面一样从本地读取，也可以从R包里提。下面展示从R包提取的方法。

从数据库获取结局相关的GWAS (这里用的是ieu-a-7)

#> API: public: http://gwas-api.mrcieu.ac.uk/
# ao <- available_outcomes()
# head(ao)

chd_out_dat <- extract_outcome_data(
  snps = bmi_exp_dat$SNP,
  outcomes = 'ieu-a-7'
)

head(chd_out_dat )

网站查找：（https://gwas.mrcieu.ac.uk/）

3. 数据预处理

对数据进行预处理，使其效应等位与效应量保持统一，就是调整exp和out上位点的方向和beta，使其统一。如果一致无需调整，如果相反则调整beta值为(-beta)，eaf为(1-eaf)。

dat <- harmonise_data(
  exposure_dat = bmi_exp_dat,
  outcome_dat = chd_out_dat
)

4. 孟德尔随机化和敏感性分析

4.1 MR分析

## MR
res <- mr(dat)
View(res)

⚠️最重要的结果就是后面的p值，b是OR的log值，se是b的标准误

默认五种方法（五列）：MR Egger、Weighted median、Inverse variance weighted、Simple mode和Weighted mode。只要上表后面的P值有意义，就可以。

IVW是最早使用的方法，也是最常用的。该方法中文叫做逆方差加权法，它的特点是回归时不考虑截距项的存在并且用结局方差（se的二次方）的倒数作为权重来进行拟合。它需要SNP完全符合MR研究三原则才能得到正确的因果估计。
具体的R语言代码如下：
fit <- summary(lm(b_out ~ -1 +b_exp,  weights = 1/se_out^2)) 
代码里的b_out表示结局的beta值，b_exp表示暴露的beta值，se_out就是结局的标准误，se_out^2就代表结局beta值的方差，而模型中的-1表示的就是去除截距项。

MR-Egger多加了截距项，其主要目的是判断水平多效性的有无；另外它也使用结局方差（se的二次方）的倒数作为权重来进行拟合，具体的R语言代码如下：
fit <- summary(lm(b_out ~ b_exp,  weights = 1/se_out^2)) 
这个代码和IVW的非常相似，区别就在于少了-1，这是因为R函数lm()里默认回归模型保留截距项。同样地，这次回归得出来的beta，se和P值就是MR分析的结果。
因此我们不难看出，IVW和MR-Egger这两个的核心算法都是很简单的，两者最大的区别就是回归时是否考虑截距项的存在。

Weighted Median是利用大部分SNP（majority of genetic variants）来判断因果关系的有无。

参考：关于MR-Egger方法的注意事项

文献参考：

High Blood Pressure and Risk of Dementia: A Two-Sample Mendelian Randomization Study in the UK Biobank

用mr_method_list() 查看当前支持的其他统计方法。
在mr函数中添加想要使用的方法

mr(dat, method_list=c("mr_egger_regression", "mr_ivw"))

计算系数的置信区间及OR值 (⚠️在画森林图的时候需要有OR值和置信区间)

generate_odds_ratios(res)

附：beta和OR的转换

4.2 敏感性分析（看结果是否稳定，主要也是看p值）

广义的敏感性分析包括以下三项：
1）异质性检验：主要检验多个IV是否存在水平多效性，常用MR Egger法的截距项表示，如果该截距项与0差异很大，说明存在水平多效性。除此以外，MR-PRESSO包也是常用的检验水平多效性的R包。

## Heterogeneity statistics
mr_heterogeneity(dat) #异质性检验 ##异质性检验的结果看P值，P值大于0.05说明纳入的 IVs 不存在异质性，研究结果不需要考虑异质性造成的影响。
#   id.exposure id.outcome                              outcome
# 1      UeLrA3    ieu-a-7 Coronary heart disease || id:ieu-a-7
# 2      UeLrA3    ieu-a-7 Coronary heart disease || id:ieu-a-7
#   exposure                    method        Q Q_df     Q_pval
# 1 exposure                  MR Egger 42.81295   28 0.03628115
# 2 exposure Inverse variance weighted 42.81356   29 0.04736584

异质性检验显示这些IV之间存在很强的异质性（Q_pval小于0.05）。一般来说，异质性大(p<0.001)的时候我们使用随机效应模型，异质性小的时候使用固定效应模型。(如下)

mr(dat,method_list=c('mr_ivw_mre'))  #使用随机效应模型

2）基因多效性检验：孟德尔随机化分析有三个假设，其中的一个假设就是工具变量（一般是遗传位点）必须通过暴露因素(exposure)影响结果（outcome）。如果工具变量可以不通过暴露因素直接影响结果，那么就违反了孟德尔随机化的思想，即检验结果存在水平多效性。
用孟德尔随机化进行因果关系推断的大前提是没有水平多效性。

# pleiotropy test
mr_pleiotropy_test(dat) #水平多效性检验 
#   id.exposure id.outcome                              outcome
# 1      UeLrA3    ieu-a-7 Coronary heart disease || id:ieu-a-7
#   exposure egger_intercept          se      pval
# 1 exposure   -9.760649e-05 0.004877288 0.9841754

使用MR-PRESSO检验表露因素和结局变量的水平基因多效性

MR-PRESSO方法在2018年5月发表在《nature genetics》上（PMID: 29686387），是目前应用比较广泛的检验水平多效性的方法。
它的核心包含三部分：
（1）利用“MR-PRESSO global test”检测水平多效性是否存在
（2）利用“MR-PRESSO outlier test”来剔除异常SNP（outliers）并估计矫正后的结果（此结果去除了水平多效性）

（3）利用“MR-PRESSO distortion test”来检验矫正前和矫正后的结果是否存在差异

library(MRPRESSO)
data(SummaryStats)
# Run MR-PRESSO global method
> mr_presso(BetaOutcome = "beta.outcome", BetaExposure = "beta.exposure", SdOutcome = "se.outcome", SdExposure = "se.exposure", OUTLIERtest = TRUE, DISTORTIONtest = TRUE, data = dat, NbDistribution = 1000,  SignifThreshold = 0.05)
# $`Main MR results`
#        Exposure       MR Analysis Causal Estimate         Sd   T-stat      P-value
# 1 beta.exposure               Raw       0.1143116 0.01540012 7.422773 3.524779e-08
# 2 beta.exposure Outlier-corrected              NA         NA       NA           NA

# $`MR-PRESSO results`
# $`MR-PRESSO results`$`Global Test`
# $`MR-PRESSO results`$`Global Test`$RSSobs
# [1] 47.66689

# $`MR-PRESSO results`$`Global Test`$Pvalue
# [1] 0.067

MR-PRESSO results$Global Test$Pvalue即为我们检测的水平多效性，P值小于0.05，说明暴露因素和结局变量存在水平多效性。

3）“leave-one-out”法：主要是逐个剔除IV后计算剩下IV的MR结果，如果剔除某个IV后其它IV估计出来的MR结果和总结果差异很大，说明MR结果对该IV是敏感的。（侠义的敏感性分析指的是“leave-one-out”法。）

# leave-one-out analysis
res_loo <- mr_leaveoneout(dat)
View(res_loo)
#“leave-one-out”法指的是逐步剔除每个SNP，计算剩余SNP的meta效应，观察剔除每个SNP后结果是否发生变化
#如果剔除了某一个SNP后，结果改变很大，说明存在某一个SNP对结果影响很大，这是我们不愿意看到的情况。
#理想的情况应该是逐步剔除某个SNP后结果变化不大。

5. 可视化

# scatter plot
p1 <-mr_scatter_plot(res, dat)
length(p1) # to see how many plots are there
p1[[1]]

上图就是不同参数估计方法下面估计出来的暴露对结局的因果效应，这个图的纵轴是工具变量对结局的影响效应，横轴是工具变量对暴露的因果效应，两个效应做比值就是暴露对结局的效应，也就是图中线的斜率，也可以看出来这个线不同的算法总体是斜向上的，就意味着暴露对结局是有着正向的因果关系的。

# forest plot
res_single <- mr_singlesnp(dat)#,all_method=c("mr_ivw", "mr_two_sample_ml")) to specify method used
p2 <- mr_forest_plot(res_single) 
p2[[1]]

#漏斗图
res_single <- mr_singlesnp(dat)
p4 <- mr_funnel_plot(res_single)
p4[[1]]

以上这些函数都可以同通过method=增加特定的方法。
导出图片

### save images ##########
library(ggplot2)
ggsave(p1[[1]], file="mr_scatter_plot.png", 
       width = 7, height=7, dpi=300)

6. Variance explained

library(knitr)
out <- directionality_test(dat)
kable(out)

MR假定工具变量先影响暴露，然后通过暴露影响结局，但也有可能是一个cis-acting SNP先影响了基因表达或DNA甲基化。所以，为了确定这个假定的方向性，我们使用 the Steiger test分别计算工具变量对暴露和结局的variance explained，并判断结局的variance是否小于暴露；如果是，则方向正确。
这个测试有可能因为研究GWAS本身的质量不够好而导致结果不可靠，可以再做sensitivity analyses。方法如下：

Provide estimates of measurement error for the exposure and the outcome, and obtain an adjusted estimate of the causal direction.

For all possible values of measurement error, identify the proportion of the parameter space which supports the inferred causal direction

mr_steiger(
    p_exp = dat$pval.exposure, 
    p_out = dat$pval.outcome, 
    n_exp = dat$samplesize.exposure, 
    n_out = dat$samplesize.outcome, 
    r_xxo = 1, 
    r_yyo = 1,
    r_exp=0)