用R语言对vcf文件进行数据挖掘.10 利用vcf信息判断物种染色体倍数

目录

  1. 前言
  2. 方法简介
  3. 从vcf文件里提取有用信息
  4. tidy vcfR
  5. vcf可视化1
  6. vcf可视化2
  7. 测序深度覆盖度
  8. 窗口缩放
  9. 如何单独分离染色体
  10. 利用vcf信息判断物种染色体倍数
  11. CNV分析

vcf数据包含了所有的等位对立基因的信息,这样就可以帮助我们判断染色体的倍数。比方说有一个位点的碱基是A/T,测序覆盖率为20, 如果这个物种是二倍体,那么A,T的出现概率就是(50%),会各自出现10次,如果是3倍体,那么A会出现13次,T会出现7次,当然也有可能相反。当把所有的点位集合在一起的时候,我们就可以判断这个物种的倍数体了。

数据导入

用包里的自带数据,有疑问的小盆友可以查阅之前的文章,这里就不做赘述了。

# Load libraries
library(vcfR)
library(pinfsc50)

# Determine file locations
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",
                        package = "pinfsc50")

# Read data into memory
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 22,031 variants
## Object size: 20.9 Mb
## 7.929 percent missing data
## *****        *****         *****

等位对立基因平衡

高通量数据测序可以保证每一个位点都经过很多次的读取,这样就相当于每一个等位基因都被测序过了差不多相等的次数。假设我们对一个二倍杂合体进行了覆盖率为30的测序,那么每一条染色体都被测了15次。当然真实情况不可能正好是这个数字,毕竟测序的时候会发生一定概率的错误。
假设我们用覆盖率为30给一个三倍杂合体进行测序,某基因位点为A/A/T,那么,A和T出现的期待值将是20和10。当某个基因位点的组合是A/G/C时,那么A,G,C就会各自出现10次。

knitr::kable(vcf@gt[c(1:2,11,30),1:4])
FORMAT  BL2009P4_us23   DDR7602 IN2009T1_us22
GT:AD:DP:GQ:PL  1|1:0,7:7:21:283,21,0   1|1:0,6:6:18:243,18,0   1|1:0,8:8:24:324,24,0
GT:AD:DP:GQ:PL  0|0:12,0:12:36:0,36,427 0|0:20,0:20:60:0,60,819 0|0:16,0:16:48:0,48,650
GT:AD:DP:GQ:PL  0|1:17,12:29:99:453,0,667   0|0:32,0:32:96:0,96,1433    0|0:40,0:40:99:0,120,1765
GT:AD:DP:GQ:PL  1|0:7,4:11:99:130,0,260 0|0:16,0:16:48:0,48,677 0|0:26,0:26:78:0,78,1073

FORAMT里的AD表示对立基因的各自出现的次数。所以我们可以提取AD数据。

ad <- extract.gt(vcf, element = 'AD')

一般的SNP Caller都会默认双倍体检验,也就是出现两种对立基因型。所以可以计算每种基因的出现概率。

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

然后用直方图可视化一下。

hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","1/3","3/4",1))

可以发现,大多数都是纯合,所以需要去掉纯合的部分。

gt <- extract.gt(vcf, element = 'GT')
hets <- is_het(gt)

is.na( ad[ !hets ] ) <- TRUE

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","1/3","3/4",1))

我们发现峰值出现在了1/2,说明这个物种时二倍体,和预期的一样。
然而这里有一个小小的问题,Fequency几乎从0到1横跨整个横坐标,这个明显不合理,需要进行改善。

根据等位对立测序深度进行改善

我们可以通过等位对立深度(AD)的信息来改善刚才提到的问题。

ad <- extract.gt(vcf, element = 'AD')
#ad[1:3,1:4]

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

# Subset to a vector for manipulation.
tmp <- allele1[,"P17777us22"]
#sum(tmp == 0, na.rm = TRUE)
#tmp <- tmp[tmp > 0]
tmp <- tmp[tmp <= 100]

hist(tmp, breaks=seq(0,100,by=1), col="#808080", main = "P17777us22")

sums <- apply(allele1, MARGIN=2, quantile, probs=c(0.15, 0.95), na.rm=TRUE)
sums[,"P17777us22"]
## 15% 95% 
##  19  75
abline(v=sums[,"P17777us22"], col=2, lwd=2)

我们可以看到80%的数据分布在了19和75之间。然后再靠近40和60点的地方出现了两个峰,这分别代表杂合峰和纯合峰。然后整个数据还拖着一个尾巴,最长的地方超过了100,这表示部分区域包含了着非常高的拷贝数(CNVs)。此处的目的是为了可视化倍数体,所以选择100以下15%~95%的数据。

tmp <- allele2[,"P17777us22"]
tmp <- tmp[tmp>0]
tmp <- tmp[tmp<=100]

hist(tmp, breaks=seq(1,100,by=1), col="#808080", main="P17777us22")

sums[,"P17777us22"]
## 15% 95% 
##  19  75
abline(v=sums[,"P17777us22"], col=2, lwd=2)

回想一下之前文章里介绍过的用箱图做可视化的内容,我们也可以通过同样的方法来确认过滤数据的效果。

#vcf <- extract.indels(vcf)
#gq <- extract.gt(vcf, element = 'GQ', as.numeric = TRUE)
#vcf@gt[,-1][ gq < 99 ] <- NA

ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
boxplot(allele1, las=3)
#hist(allele1[,"P17777us22"], ylim=c(0,2000),
#     breaks = seq(0,max(allele1[,"P17777us22"], na.rm = TRUE),by=5),
#     xlim=c(0,100))

# Subset to a vector for manipulation.
#tmp <- allele1[,"P17777us22"]
#tmp <- tmp[tmp <= 100]
#hist(tmp, breaks=seq(0,100,by=1), col="#808080")
sums <- apply(allele1, MARGIN=2, quantile, probs=c(0.15, 0.95), na.rm=TRUE)
# Allele 1
dp2 <- sweep(allele1, MARGIN=2, FUN = "-", sums[1,])
#allele1[dp2 < 0] <- NA
vcf@gt[,-1][ dp2 < 0 & !is.na(vcf@gt[,-1]) ] <- NA
dp2 <- sweep(allele1, MARGIN=2, FUN = "-", sums[2,])
#allele1[dp2 > 0] <- NA
vcf@gt[,-1][dp2 > 0] <- NA
# Allele 2
dp2 <- sweep(allele2, MARGIN=2, FUN = "-", sums[1,])
vcf@gt[,-1][ dp2 < 0 & !is.na(vcf@gt[,-1]) ] <- NA
dp2 <- sweep(allele2, MARGIN=2, FUN = "-", sums[2,])
vcf@gt[,-1][dp2 > 0] <- NA

# Hard filter
#dp[dp < 4] <- NA
#vcf@gt[,-1][allele1 < 8] <- NA
#

#adp <- adp[adp<=100]
#adp <- adp[adp>=sums[,"P17777us22"][1]]
#adp <- adp[adp<=sums[,"P17777us22"][2]]

#hist(adp, breaks=seq(0, max(adp, na.rm = TRUE )+1, by=1), col="#C0C0C0", xlim = c(0,100))
#axis(side=1,at=1:4*10)
#abline(v=sums[,"P17777us22"], col=2, lwd = 2)

#adp <- allele2[,"P17777us22"]
#adp <- adp[adp>0]
#adp <- adp[adp<=100]
#hist(adp, breaks=seq(0, max(adp, na.rm = TRUE), by=1), col="#C0C0C0")
#par(mfrow=c(1,1))
#abline(v=sums[,"P17777us22"], col=2, lwd = 2)

看一下过滤后的结果。

ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)
boxplot(allele1, las=3)

果然好看很多。
最后再回到一开始,看倍数体的可视化效果。

gt <- extract.gt(vcf, element = 'GT')
hets <- is_het(gt)
is.na( ad[ !hets ] ) <- TRUE

allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

hist(ad2[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#1f78b4", xaxt="n", main="P17777us22")
hist(ad1[,"P17777us22"], breaks = seq(0,1,by=0.02), col = "#a6cee3", add = TRUE)
axis(side=1, at=c(0,0.25,0.333,0.5,0.666,0.75,1), labels=c(0,"1/4","1/3","1/2","2/3","3/4",1))

结果明显干净易懂好多。
有同学会问,那么不是二倍体的话会出现什么样的结果呢。数据包的样本里正好有一个三倍体。



可以看到两个峰出现在了1/3,2/3处。结果和实际完美匹配。

你可能感兴趣的:(用R语言对vcf文件进行数据挖掘.10 利用vcf信息判断物种染色体倍数)