本文方法是根据"Ageing LTR insertions"(https://github.com/SIWLab/Lab_Info/wiki/Ageing-LTR-insertions)这篇文章提及的方法进行的. 但该文很多部分都是简略介绍,提供的script也已经失效,因而只能自己重新写脚本.指导思想看该文,本文只补充实操部分.
用LTRharvest寻找LTR后,获得注释文件EGL.gff
其中有long_terminal_repeat的行为LTR所在注释行, 用此行信息提取出LTR对,每对生成一个文件
LTRharvest的结果又一个坑爹的地方,它会不管原来的序列名称是什么,统一替换为seq0, seq1, seq2......等
提取前,首先要将gff的seq还原为原来的染色体名字,将前面有#的注释行删除. 在第一行加上"###"号,以供awk识别.
mkdir ltr
cd ltr
cp EGL.gff.rename
ulimit -n 100000 #解除文件数限制,否则报错
awk '/###/{filename=NR".txt"}; /long_terminal_repeat/{gsub(/Parent=/,"");print $1,$4,$5,$9 >filename}' OFS="\t" EGL.gff.rename #生成每对ltr的bed
ls | grep txt | sort -n > file.list #生成bed的列表
cp file.list ..
cd ..
目标是用bedtools提取每对LTR,再用muscle比对,这里写python生成bash:
filelistname = 'F:/sequence/EGL_hub/file.list'
fileresult = filelistname+".sh"
f = open(fileresult , "w")
with open(filelistname,'r') as r:
lines=r.read().splitlines()
i = 0
for l in lines:
i= i+1
f.write("bedtools getfasta -fi ../ensete_glaucum.assembly.fna -bed " + l + " -fo retroltr" + str(i) +".fa -name"+ '\n')
f.write("muscle -in retroltr" + str(i) +".fa -out retroltr" + str(i) +".fa.afa" + '\n')
将所有比对结果(.afa结尾)放到一个新的文件夹.这里使用R包ape进行计算:
mkdir ../EGL.ltr
cp *afa ../EGL.ltr
R包代码:
library(ape)
library(xlsx)
setwd('F:/sequence/ltr_time/ltr.alignment') ###The path should be changed
list <- list.files()
fas.F1 = read.FASTA(list[1])
mat1 = dist.dna(fas.F1,as.matrix = T, model = "K80")
merge.data = mat1[1,2]
list <- list.files()
mutate_rate <- 1.3e-8 #根据水稻的突变率,(Ma, 2004),可以根据物种改变
time1 = merge.data/(2*mutate_rate)
v1.merge = c(list[1],merge.data, time)
dir = paste("./",list,sep="")
n = length(dir)
for (i in 2:n){
fas.F.new = read.FASTA(list[i])
mat.new = dist.dna(fas.F.new, as.matrix = T, model = "K80")
time = mat.new[1,2]/(2*mutate_rate)
v.new = c(list[i], mat.new[1,2], time)
v1.merge = rbind(v1.merge, v.new)
}
write.xlsx(v1.merge,file = 'F:/sequence/ltr_time/EGL.xls')
更详细的描述可见:https://github.com/wangziwei08/LTR-insertion-time-estimation
如果使用了这个插入时间估算方法,希望能引用以下文章:
A chromosome-level reference genome of Ensete glaucum gives insight into diversity and chromosomal and repetitive sequence evolution in the Musaceae