找到Matrix eQTL这个包,看下文章Matrix eQTL: ultra fast eQTL analysis via large matrix operations(https://doi.org/10.1093/bioinformatics/bts163
)
eQTL(表达数量性状位点)计算transcript-SNP 的关系,即分析SNP与基因的表达是否相关。由于计算数量巨大,很多人都用较小的数据来做。因此该作者开发了Matrix eQTL,用于处理大数据,支持additive linear and ANOVA models with covariates,并且可以将cis- and trans-eQTLs分开计算。
Matrix eQTL相较于其他软件如FastMap — 18.4 min, Merlin — 12.3 min, Plink — 9.0 min, Matrix eQTL — 5.7 min and snpMatrix — 3.3 min要快,它设置一个阈值,只有超过这个阈值的p值才会被计算。
采用的是线型回归模型,g为基因表达情况,s为SNP分型结果。
说明文档http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html
示例数据:http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/R.html
分析过程很简单,首先设置好要分析文件的路径和名称:
install.packages("MatrixEQTL")
# 设置数据目录,示例数据放在包的安装目录下了。
base.dir = find.package("MatrixEQTL")
#设置分析的模型
useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS
#设置SNP文件的名称
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
# 设置表达数据文件的名称
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
# 设置协变量文件的名称
# 无协变量设置为character()
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
output_file_name = tempfile();
- 提供了三种分析模型供选择
(1) modelLINEAR
Model: useModel = modelLINEAR
Equation: expression = α + ∑k βk⋅covariatek + γ⋅genotype_additive
Testing for significance of: γ
Test statistic: t-statistic
(2) modelANOVA
Model: useModel = modelANOVA
Equation: expression = α + ∑k βk⋅covariatek + γ1⋅genotype_additive + γ2⋅genotype_dominant
Testing for significance of: (γ1,γ2) pair
Test statistic: F-statistic
(3) modelLINEAR_CROSS
Model: useModel = modelLINEAR_CROSS
Equation:
expression = α + ∑k βk⋅covariatek + γ⋅genotype_additive + δ⋅genotype_additive⋅covariateK
Testing for significance of: δ
Test statistic: t-statistic
- 注意这里要设置一个p值的阈值,一般越大的数据量阈值设的越小,之前说过它会按这个阈值来计算结果,如果设的过大,分析耗时并且输出很多结果。输出的结果都储存在output_file_name里
pvOutputThreshold = 1e-2
# 设置协变量矩阵为 numeric(),很少用,默认
errorCovariance = numeric()
# 这里建立了一个SlicedData的新对象,用于存放martix的数据,并设置存放数据的格式
snps = SlicedData$new();
# 设置数据分隔符为tab
snps$fileDelimiter = "\t"; # the TAB character
# 设置缺失值为NA
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in pieces of 2,000 rows
snps$LoadFile( SNP_file_name );
## Load gene expression data
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
## Load covariates
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
看下文件格式,snp文件用0,1,2表示,基因文件是表达量,cvrt是covariates:
设置好文件后可以用 Matrix_eQTL_engine主函数进行eQTL分析了,参数snps设置SNP文件,gene设置表达量文件,cvrt设置协变量。然后将每行的SNP和gene放到一块进行线性回归的分析。
me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
pvalue.hist = TRUE,
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
运行完后得到的me对象是一个list:
输出文件的每行eqtl为:SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR。
Matrix eQTL可以区分顺式(cis,local)和反式(trans,distant)eQTL,主要用Matrix_eQTL_main函数来分析。其包括以下几个参数:
* `pvOutputThreshold.cis` – p-value threshold for cis-eQTLs.
* `output_file_name.cis` – detected cis-eQTLs are saved in this file.
* `cisDist` – maximum distance at which gene-SNP pair is considered local.
* `snpspos` – data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See [sample SNP location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/snpsloc.txt).
* `genepos` – data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See [sample gene location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/geneloc.txt).
下面来看具体代码:
# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");
library(MatrixEQTL)
## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');
# base.dir = '.';
## Settings
# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
# Output file name
output_file_name_cis = tempfile();
output_file_name_tra = tempfile();
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 2e-2;
pvOutputThreshold_tra = 1e-2;
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6;
## Load genotype data
snps = SlicedData$new();
snps$fileDelimiter = "\t"; # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
## Load gene expression data
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
## Load covariates
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name);
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name_tra,
pvOutputThreshold = pvOutputThreshold_tra,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis = pvOutputThreshold_cis,
snpspos = snpspos,
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
unlink(output_file_name_tra);
unlink(output_file_name_cis);
## Results:
cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');
cat('Detected local eQTLs:', '\n');
show(me$cis$eqtls)
cat('Detected distant eQTLs:', '\n');
show(me$trans$eqtls)
## Plot the Q-Q plot of local and distant p-values
plot(me)
因此,分析自己的数据需要准备
genotype
expression
covariates
gene location
SNP location
这五个文件,前三个需要每列的样本名对应且顺序一致。
作者也提供了生成模拟数据的代码:
# Create an artificial dataset and plot the histogram and Q-Q plot of all p-values
library('MatrixEQTL')
# Number of samples
n = 100;
# Number of variables
ngs = 2000;
# Common signal in all variables (population stratification)
pop = 0.2 * rnorm(n);
# data matrices
snps.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop;
gene.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop + snps.mat*((1:ngs)/ngs)^9/2;
# data objects for Matrix eQTL engine
snps1 = SlicedData$new( t( snps.mat ) );
gene1 = SlicedData$new( t( gene.mat ) );
cvrt1 = SlicedData$new( );
rm(snps.mat, gene.mat)
# Slice data in blocks of 500 variables
snps1$ResliceCombined(500);
gene1$ResliceCombined(500);
# name of temporary output file
filename = tempfile();
# Perform analysis recording information for
# a histogram
meh = Matrix_eQTL_engine(
snps = snps1,
gene = gene1,
cvrt = cvrt1,
output_file_name = filename,
pvOutputThreshold = 1e-100,
useModel = modelLINEAR,
errorCovariance = numeric(),
verbose = TRUE,
pvalue.hist = 100);
unlink( filename );
# png(filename = "histogram.png", width = 650, height = 650)
plot(meh, col="grey")
# dev.off();
# Perform the same analysis recording information for
# a Q-Q plot
meq = Matrix_eQTL_engine(
snps = snps1,
gene = gene1,
cvrt = cvrt1,
output_file_name = filename,
pvOutputThreshold = 1e-6,
useModel = modelLINEAR,
errorCovariance = numeric(),
verbose = TRUE,
pvalue.hist = "qqplot");
unlink( filename );
# png(filename = "QQplot.png", width = 650, height = 650)
plot(meq, pch = 16, cex = 0.7)
# dev.off();
阅读推荐:
生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
B站链接:https://m.bilibili.com/space/338686099
YouTube链接:https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists
生信工程师入门最佳指南:https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA
学徒培养:https://mp.weixin.qq.com/s/3jw3_PgZXYd7FomxEMxFmw