3010份亚洲稻群体重测序项目是由中国农业科学院作物科学研究所牵头,联合国际水稻研究所、华大基因等16家单位共同完成。该研究对来自89个国家的3,010份水稻品种进行了重测序研究(resequence),参考的是日本晴(Nipponbare)基因组。所有3,010个基因组的平均测序深度(average sequencing depth)为14×,平均基因组覆盖率(average genome coverage)和绘图率(average mapping rate)分别为94.0%和92.5%。
该项目的意义自然不用多说,并且该项目的测序数据都可以在国际水稻所或CNGB上下载。我们可以从原始序列开始组装得到大的结构变异 或call到一些稀有SNP来做研究。当然以上想法不在本文讨论范围内,本文主要讨论只想知道几个基因区域上 SNP 及其分化系数等群体指标的计算。
2021.09.15 更新了单倍型分型、作图和方差分析。
放在前面
RFGB可以极大程度上满足查找某个基因上的SNP&indel和单倍型查找。并提供15个表型可以进行单倍型分析,也可以上传表型进行分析。接下来探索一下。
这个页面提供单倍型分析。分别可以用基因名,染色体位置和特定的SNP来进行分析。提供MAF是最小等位基因频率,建议选一下。可以用全部的3024个品种进行查找也可以分亚群进行查找,也提供了表型关联分析。
这里是用的LOC_Os08g33530 CDS 区域,选的籼稻群体,共找到20个SNP,和16个单倍型。其中hap11的谷粒长最高。
这个网站用来做单倍型都是SNP,而且是复等位的,就是没找到在哪下数据集。
可惜没有LD的图和pi值。
但我找到了水稻单倍型的数据,具体可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
数据地址
也可以用这个单倍型数据和表型做关联分析,用方差分析就可以。表型数据下载
以下主要是3k构建的SNP的探索以及简单利用。
工具准备
R
R package(LDheatmap,genetics)
plink1.9(3个操作系统版本都有)下载地址
vcftools(只有liunx版本,只是用来算群体遗传统计)下载地址
数据下载
可以在IRRI上下载到双等位SNP&indel数据(多等位的SNP貌似只能从头call,没找到下载的地方)。bed、bim、fam文件为一组,bed 是存放位点信息的二进制格式文件,用文本打开会发现是乱码;bim是存放位点信息的文件,类似于map格式;fam是存放样本信息的文件,第一列为FID,第二列为IID。
Nipponbare_indel 是双等位indel数据集,共有2.3m个,但是是没有经过筛选的,后续需要加工下。
双等位的SNP标记共5个。
NB_final_snp 包含了所有的SNP,共29m个,是没有经过筛选的。
Base 的SNP是经过基础筛选的,共18m个。
base_filtered_v0.7 是从Base的SNP进一步筛选的,共4.2m个,本文主要利用这个数据集。
pruned_v2.1 是2.1版本的筛选方案,包含1m个SNP。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基础上进一步筛选 共 404k 个,在用的时候发现很多基因区域上都没有SNP,毕竟个数少,比较稀疏。
各个数据集的筛选方法可以看看里边的readme文件,不同的筛选方案主要目的就是提高精度,找到因果变异。
准备目标基因的位置
其实只需要染色体号和起始位置就好了,基因名字写啥都行,就是拿来做图片标题的。忘记提了,这个数据集中没有线粒体和叶绿体上的标记。建议起始位置可以把上游和下游都囊括进去,多个2kb,3kb 都是可以的,多找几个总是好的,万一和表型关联上了呢。合并想要的SNP数据集&indel数据集
其实本可以一个个数据集来,但我想把2.3m个indel和4.2m个SNP标记合在一起分析咋办呢。最简单的办法还是用vcftools或bcftools合并两个数据集。可以参考这篇。bash代码如下:
plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先转到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一个样本 IRIS_313-8921 并且这个样本在官网给的分类为NA,空值。所以就需要删掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf
bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz #构建引索
bcftools index Nipponbare_indel.vcf.gz
bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz
bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并两个vcf文件
plink --vcf combine_base_filtered_indel.vcf --make-bed --out combine_base_filtered_indel #再转回bed格式 主要是这样占内存小点,这个vcf文件有55G ,bed文件只有5.23G
那么就不想用vcftools合并两个数据集咋办呢?有些时候并不想得到所有位点的信息,只需要一部分就可以了。其实可以用R和plink来实现合并两个数据集,并只提取几个位点,这种方案内存代价小点,且看后续讲解。
准备分类文件
群体分类文件如下这个分类太细了,给它分成五类就可以了,GJ、XI、admix、aus、bas,用excel的分列就能完成,需要把IRIS_313-8921去掉,这个分类是NA。当然也可以用自定义的分类。
突然想起踩到的一个坑,plink1.9和2在筛选的时候无法识别样本名称带有"-"字符,所以需要把 分类文件里 "-"换成"_" ,"IRIS_313-8921"换成"IRIS_313_8921"如用 excel 就可以完成,对了fam文件里也要换。
目标区域SNP位点提取
首先需要把plink.exe和分类文件(plink_cluster.txt)放到工作目录下,方便调用,也可以把它放到环境目录下。
R代码如下:
#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt" #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05
miss <- 0.4
color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD
target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
temp <- read.table(paste0(data_name[i],".fam"),header = F)
temp <- nrow(temp)
tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]
for (i in 1:length(data_name)) {
total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')
if (length(data_name)>=2){ #合并数据集
for (i in 2:length(data_name)) {
temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
temp_all<-rbind(temp_all ,temp )
}
temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}
### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM", "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题,还是再转下bed格式
system(total_cmd,intern = FALSE)
temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
#不同群体提取 按基因提取
for(g in target_gene$GENE){
extract <-dplyr::filter(target_gene ,GENE==g)
extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
### total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
### system(total_cmd,intern = FALSE)
}
}
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt --keep-cluster-names ",p," --make-bed --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
画LD heatmap
画LD用Tassel就可以话,上述代码也生成了vcf格式文件可以直接在Tassel中打开。这里用下LDheatmap。LDheatmap具体的教程比较多,自由度也比较高,可自行找些教程。当然也有在线能画LD图的,就是有点麻烦。这里具体就是如何将vcf或bed格式文件转到LDheatmap可用的格式。R代码如下:
library(LDheatmap)
library(genetics)
plot_LD <- function(vcf_file_name,title){
temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束
temp_snp <- temp_vcf[0,]
for (i in 1:nrow(temp_vcf)) {
temp <- temp_vcf[i,]
for(j in 10:ncol(temp_vcf)){
if(temp_vcf[i,j]=="0/0"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
next
}
if(temp_vcf[i,j]=="0/1"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="1/1"){
temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="./."){
temp[1,j] <- NA
next
}
}
temp_snp <- rbind(temp_snp,temp)
}
SNPpos <- temp_snp$POS
SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))
### 转换成LDheatmap能用的格式
for(i in 1:ncol(SNP)){
SNP[,i]<-as.genotype(SNP[,i])
}
###画图
LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}
for(g in target_gene$GENE){
plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
for (p in pop_name) {
plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
}
}
计算群体遗传性指标
首先推荐用vcftools来计算,因为方便。但理论上知道公式用excel也可以算。
计算Fst,Pi,Tajimi'D based on SNP vcf file
Fst详解(具体计算步骤)
【群体遗传学】 π (pi)的计算
Fst
需要分成至少两个群体 每个群体一个文件,可以放所有的5个分类群体,也可以放几个感兴趣的群体。bash代码
vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###单位点计算
vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows计算 单位是bp
也可以利用plink 计算单位点的Fst。
total_pop<-length(fst_pop)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt"," --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in fst_pop){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST
for (i in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
fst[,i+1] <- temp$FST
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)
names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
结果和vcftools是一样的
pi
用vcftools计算 可以单位点计算 也可以按windows 计算
一般认为按windows更合理,如果研究目标是SNP,还是单点算合理。
在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt
vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###单位点
vcftools --vcf combine_base_filtered_indel.vcf --window-pi-step 3000 --out combine_base_filtered_indel ###3000个snp为一个windows
双等位的基因 pi值计算较简单,参考下上面计算fst的代码就可以了
total_pop<-length(pop_name)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in pop_name){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)
for (j in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)
names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
结果基本相同。
Tajima's D 中性检测的指标
在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt
vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp计算
这个用R计算有点麻烦,但是用Tassel计算很方便,把vcf输入就可以了。
单倍型分析
这里都拿到SNP数据集了,也就可以进行单倍型分析了,这里用的是Haploview,要先装java环境(但不支持1.8以上版本),有GUI比较友好。输入格式可以用plink格式(ped/map)。
但Haploview只能做SNP的分析,数量也不能太多。用前面的步骤生成了ped/info 文件,按linkage format 输入就可以了。
这里是利用了admix群体的48个SNP生成的结果。
但是我想要画单倍型的地图,这个软件不太方便,最后我选择了用beagle+dnasp+popart来做单倍型分型和画图。
beagle:用做单倍型推断和填充,需要java环境。
dnasp:用作单倍型分型,生成nex文件。
popart:用作单倍型画图,如果有地理信息,也可以加上去。
由于单倍型分析不允许有缺失,所以需要基因型填充,可以利用上面的vcf文件作为输入文件,输出为vcf.gz,解压后得到的vcf文件可以到后续利用。定相之后vcf里的"/"会变为"|"。另一个原因是3K水稻数据中并没有说明bed中的SNP是定相的,这里为了保险一点。
在powershell中输入命令
# gt 是需要的vcf文件,按基因位置分割 out 是输出文件前缀
java -Xss5m -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased
得到定相后的vcf文件然后生成fasta文件,由于dnasp的输入fasta文件需要等长,这里对有indel的地方加了"-"。每个样本能生成两条单链。
这里用R解决
phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
g <- "g" #输出前缀
vcf <- data.table::fread(phase_file,header = T)
temp <- c()
for (i in 1:nrow(vcf)) {
temp_value <- vcf[i,10:ncol(vcf)]
ref <- vcf[i,4]
alt <- vcf[i,5]
if(nchar(alt)>nchar(ref)){
alt <- substr(alt,2,nchar(alt))
ref <- paste(rep("-",nchar(alt)),collapse = "")
}
if(nchar(ref)>nchar(alt)){
ref <- substr(ref,2,nchar(ref))
alt <- paste(rep("-",nchar(ref)),collapse = "")
}
temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)
cat("",file=paste0(out_name,g,".fasta"),append = FALSE)
for (i in 1:ncol(temp)) {
c1 <- c()
c2 <- c()
for (j in 1:nrow(temp)) {
c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
}
cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}
输入fasta后,Generate 选Haplotype 生成单倍型,得到单倍型文件(nex)
这里一共是得到315个单倍型,但大部分都只有1个样本。
由于需要统计样本的信息,这里把nex的freq字段另存为test.txt(),用来做统计,并生成TraitLabels字段,这里需要样本的分类信息plink_cluster.txt。
clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])
clut_hap <- clust[,2:3]
tem <- as.data.frame(strsplit(hap[,2]," "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])
for (i in 1:nrow(clust)) {
flag <- 0
v1 <- 0
v2 <- 0
for (j in 1:nrow(hap)) {
if (flag==2) break
if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
}
}
clut_hap[i,3] <- v1
clut_hap[i,4] <- v2
}
write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)
out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp)
names(out_matrix) <- gsub(" ","_",colnames(temp))
out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]
cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
把生成的文件全部内容贴到dnasp生成nex文件底部。
但我在用popart的时后,文件还是导不进去。发现需要在nex文件删掉一段。
然后就用popart可视化就好了。
这个太多了,手动去了下,只保留17个单倍型(修改TAXA,CHARACTERS,TRAITS三个字段)。
也可以把分类信息改成地理信息
我发现popart计算出的中性突变系数和Tassel是一样的。
都有单倍型信息了也顺便做下方差分析好了。
这里用RFGB下载到的GL表型试一下。
正常的关联分析应该用的是混合线性模型。
但这里只有一个基因上的单倍型信息,用整体的协方差或亲缘关系矩阵会把模型变复杂,就用简单点的方差分析好了。(这方法有待商榷)
pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #单倍型文件
thr <- 50 #去掉样本少的单倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <- as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]
library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齐的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups
plot(out)
简单的方差分析可得hap_101的粒长较长。
做个总结
上述主要是提供了一个在所有7.1m的SNP数据集中,提取部分位点的标记,并计算LD和一些群体遗传指标,基本上是R+plink就解决了。
当然如果手头上有一些品种的重测序数据,也可以和3k水稻的SNP数据集放到一起进行比对,相信这个数据集还有很多信息没有被挖掘,共勉。
R代码汇总
提取位点,群体指标计算
#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt" #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05
miss <- 0.4
color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD
target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
###
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
temp <- read.table(paste0(data_name[i],".fam"),header = F)
temp <- nrow(temp)
tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]
for (i in 1:length(data_name)) {
total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')
if (length(data_name)>=2){ #合并数据集
for (i in 2:length(data_name)) {
temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
temp_all<-rbind(temp_all ,temp )
}
temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}
### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM", "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题,还是再转下bed格式
system(total_cmd,intern = FALSE)
temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
#不同群体提取 按基因提取
for(g in target_gene$GENE){
extract <-dplyr::filter(target_gene ,GENE==g)
extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
### total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
### system(total_cmd,intern = FALSE)
}
}
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt --keep-cluster-names ",p," --make-bed --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)
plot_LD <- function(vcf_file_name,title){
temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束
temp_snp <- temp_vcf[0,]
for (i in 1:nrow(temp_vcf)) {
temp <- temp_vcf[i,]
for(j in 10:ncol(temp_vcf)){
if(temp_vcf[i,j]=="0/0"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
next
}
if(temp_vcf[i,j]=="0/1"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="1/1"){
temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="./."){
temp[1,j] <- NA
next
}
}
temp_snp <- rbind(temp_snp,temp)
}
SNPpos <- temp_snp$POS
SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))
### 转换成LDheatmap能用的格式
for(i in 1:ncol(SNP)){
SNP[,i]<-as.genotype(SNP[,i])
}
###画图
LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}
for(g in target_gene$GENE){
plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
for (p in pop_name) {
plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
}
}
###
total_pop<-length(fst_pop)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt"," --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in fst_pop){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST
for (i in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
fst[,i+1] <- temp$FST
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)
names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
###
total_pop<-length(pop_name)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in pop_name){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)
for (j in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)
names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
###
vcf to fasta
phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g
vcf <- data.table::fread(phase_file,header = T)
temp <- c()
for (i in 1:nrow(vcf)) {
temp_value <- vcf[i,10:ncol(vcf)]
ref <- vcf[i,4]
alt <- vcf[i,5]
if(nchar(alt)>nchar(ref)){
alt <- substr(alt,2,nchar(alt))
ref <- paste(rep("-",nchar(alt)),collapse = "")
}
if(nchar(ref)>nchar(alt)){
ref <- substr(ref,2,nchar(ref))
alt <- paste(rep("-",nchar(ref)),collapse = "")
}
temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)
cat("",file=paste0(out_name,g,".fasta"),append = FALSE)
for (i in 1:ncol(temp)) {
c1 <- c()
c2 <- c()
for (j in 1:nrow(temp)) {
c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
}
cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}
nex traits 文件生成
clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])
clut_hap <- clust[,2:3]
tem <- as.data.frame(strsplit(hap[,2]," "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])
for (i in 1:nrow(clust)) {
flag <- 0
v1 <- 0
v2 <- 0
for (j in 1:nrow(hap)) {
if (flag==2) break
if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
}
}
clut_hap[i,3] <- v1
clut_hap[i,4] <- v2
}
write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)
out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp)
names(out_matrix) <- gsub(" ","_",colnames(temp))
out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]
cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
方差分析
pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <- as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]
library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups
plot(out)
吐槽一下,的这个界面太不好写了,也不好看。代码中难免会有多个空格,逗号不对啥的,请见谅。