prokka-原核及病毒基因组高效便捷注释

文章目录

  • 简介
  • 安装
  • 使用
  • 下游分析
    • 提取16S序列物种注释
  • 统计5S/16S/23S/tRNA数量
  • 统计基因的数量、位置等信息
  • 注意
  • 参考

简介

Prokka: rapid prokaryotic genome annotation
全基因组注释是在一组基因组DNA序列中识别感兴趣的特征,并用有用的信息标记它们的过程。Prokka是一款软件工具,可以快速注释细菌、古菌和病毒基因组,并生成符合标准的输出文件。

安装

conda create prokka -c conda-forge -c bioconda -c defaults prokka=1.14
# 1.13版本会报blastp <2.2,实际上已经安装blastp 2.10
  • Test
    Type prokka and it should output its help screen.
    Type prokka --version and you should see an output like prokka 1.x
    Type prokka --listdb and it will show you what databases it has installed to use.

  • 运行时出现如下报错则重新按照上述命令安装。

[20:43:45] Prokka needs blastp 2.2 or higher. Please upgrade and try again.

使用

 prokka contigs.fa
# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
prokka --force --outdir mydir --prefix mygenome contigs.fa

time prokka  --force --cpu 100 --outdir ecoli_prokka --prefix ecoli  ../Ecoli_k12/Ecoli_k12.fasta
#  大肠杆菌8核1min30s,100核50s
  • 预测病毒基因
nohup time prokka --force  --centre X --compliant --cpus 80 --kingdom  Viruses --outdir Viruses_kingdom  --prefix 
../Viral_prediction/Virsorter_Virfinder_Deepvirfinder_share_at_least_two_method.fa &>prokka.log&

nohup time prokka  --cpus 80 --kingdom  Viruses --outdir Viruses_kingdom  ../Viral_prediction/Virsorter_Virfinder_Deepvirfinder_share_at_least_two_method.fa &>prokka.log&

Contig ID must <= 37 chars long: k141_4519235_length_122628_cov_55.0330

  • 默认使用Barrnap 预测rRNA

–rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

下游分析

提取16S序列物种注释

# *ffn文件中保存着所有预测基因的核酸序列,可以通过匹配comment中的关键词来提取相应序列
# 5S ribosomal RNA
# 16S ribosomal RNA
# 23S ribosomal RNA
# tRNA
# 提取16S
for i in *ffn;do bioawk -c fastx -v sample=${i%.ffn} '$comment~/16S ribosomal RNA/{print ">"sample"_"$name"_len"length($seq)"\t"$comment"\n"$seq}' $i >../Prokka_16S/${i%.ffn}_16S.fna;done &
# 构建PM seq.list 和metadata.tsv(第一行ID"\t"Group)
 mkdir -p  ../PM_out; grep -c ">" *|awk -F: '$2!=0{print $1}'|while read i ; do printf "${i%.fna}\t$PWD/$i\n";done >../PM_out/16S_seq.list
# 构建metadata.tsv
awk 'BEGIN{print "ID\tGroup"}{print $1"\t"$1}' 16S_seq.list  >16S_metadata.tsv
# 提取23S
for i in *ffn;do res=$(bioawk -c fastx -v sample=${i%.ffn} '$comment~/23S ribosomal RNA/{print ">"sample"_"$name"_len"length($seq)"\t"$comment"\n"$seq}' $i);if [ -n "$res" ];then echo "$res" >../Prokka_23S/${i%.ffn}_23S.fna;fi;done

#  PM-pipline
nohup time PM-pipeline -D S -d 0.97 -m 16S_metadata.tsv -i 16S_seq.list -o 16S_Silva_out -f F  -t  100 -L 123456 &>16S.log
# -d 比对的相似性

# 统计
[u@h@Single_Sample]$
for i in *16S;do awk -v id=${i%_16S} 'NR==2{$1=$2=$3="";gsub(/; /,";",$0);print id,$0}' ${i}/classification.txt;done > /mnt/nfs/yutao/972Isolates/829_Comp95_Cont5_Isolates/829_Comp95_Cont5_Isolates_16S_Taxonomy.tsv

统计5S/16S/23S/tRNA数量

*.txt文件保存着各个基因种类数量,可以通过合并成表格来统计

[u@h@101raw_genomes_processed_prokka]$ cat Y322-2.txt
organism: Genus species strain
contigs: 2122
bases: 13615552
CDS: 11428
rRNA: 5
tRNA: 90
tmRNA: 2

统计基因的数量、位置等信息

  • 需要注意prokka注释的基因数量比较少

(base) [yutao@myosin Genome_integration]$ (printf "genome\ttotal_num\tnames\tlocus\tlength\n"; head 826.id |while read i;do num=$(grep -c "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv);locus=($(grep  "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $1}'));name=($(grep  "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $4}'));len=($(grep  "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $3}'));if [ $num -eq 0 ];then locus=0;name=0;len=0;fi;printf "${i}\t$num\t${name[*]}\t${locus[*]}\t${len[*]}\n";done) 
  • 选择最大的16S序列求均值
(base) [yutao@myosin ColdSeepDB_ANI99_to_3179MAGs_Prokka]$ awk -F"\t" '$2!=0 &&  NR>1{print $NF}' tmp/3179MAGs_prokka_16S.tsv |awk '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print "max of line",NR": ",m}' |awk '{s+=$NF}END{print s/NR}'

注意

  • 如果contig ID中包含"|“,将会被替换成”_"
[15:52:55] This is prokka 1.14.6
[15:54:16] Changing illegal '|' to '_' in sequence name: HTR7|k141_1252805
  • 序列id名称过长
    序列名称过程使用--centre X --compliant参数
Contig ID must <= 37 chars long: k127_1068279_length_11625_cov_149.1380
[22:08:01] Please rename your contigs OR try '--centre X --compliant' to generate clean contig names.

参考

prokka github

你可能感兴趣的:(prokka)