Prokka: rapid prokaryotic genome annotation
全基因组注释是在一组基因组DNA序列中识别感兴趣的特征,并用有用的信息标记它们的过程。Prokka是一款软件工具,可以快速注释细菌、古菌和病毒基因组,并生成符合标准的输出文件。
conda create prokka -c conda-forge -c bioconda -c defaults prokka=1.14
# 1.13版本会报blastp <2.2,实际上已经安装blastp 2.10
Test
Type prokka and it should output its help screen.
Type prokka --version and you should see an output like prokka 1.x
Type prokka --listdb and it will show you what databases it has installed to use.
运行时出现如下报错则重新按照上述命令安装。
[20:43:45] Prokka needs blastp 2.2 or higher. Please upgrade and try again.
prokka contigs.fa
# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
prokka --force --outdir mydir --prefix mygenome contigs.fa
time prokka --force --cpu 100 --outdir ecoli_prokka --prefix ecoli ../Ecoli_k12/Ecoli_k12.fasta
# 大肠杆菌8核1min30s,100核50s
nohup time prokka --force --centre X --compliant --cpus 80 --kingdom Viruses --outdir Viruses_kingdom --prefix
../Viral_prediction/Virsorter_Virfinder_Deepvirfinder_share_at_least_two_method.fa &>prokka.log&
nohup time prokka --cpus 80 --kingdom Viruses --outdir Viruses_kingdom ../Viral_prediction/Virsorter_Virfinder_Deepvirfinder_share_at_least_two_method.fa &>prokka.log&
Contig ID must <= 37 chars long: k141_4519235_length_122628_cov_55.0330
–rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
# *ffn文件中保存着所有预测基因的核酸序列,可以通过匹配comment中的关键词来提取相应序列
# 5S ribosomal RNA
# 16S ribosomal RNA
# 23S ribosomal RNA
# tRNA
# 提取16S
for i in *ffn;do bioawk -c fastx -v sample=${i%.ffn} '$comment~/16S ribosomal RNA/{print ">"sample"_"$name"_len"length($seq)"\t"$comment"\n"$seq}' $i >../Prokka_16S/${i%.ffn}_16S.fna;done &
# 构建PM seq.list 和metadata.tsv(第一行ID"\t"Group)
mkdir -p ../PM_out; grep -c ">" *|awk -F: '$2!=0{print $1}'|while read i ; do printf "${i%.fna}\t$PWD/$i\n";done >../PM_out/16S_seq.list
# 构建metadata.tsv
awk 'BEGIN{print "ID\tGroup"}{print $1"\t"$1}' 16S_seq.list >16S_metadata.tsv
# 提取23S
for i in *ffn;do res=$(bioawk -c fastx -v sample=${i%.ffn} '$comment~/23S ribosomal RNA/{print ">"sample"_"$name"_len"length($seq)"\t"$comment"\n"$seq}' $i);if [ -n "$res" ];then echo "$res" >../Prokka_23S/${i%.ffn}_23S.fna;fi;done
# PM-pipline
nohup time PM-pipeline -D S -d 0.97 -m 16S_metadata.tsv -i 16S_seq.list -o 16S_Silva_out -f F -t 100 -L 123456 &>16S.log
# -d 比对的相似性
# 统计
[u@h@Single_Sample]$
for i in *16S;do awk -v id=${i%_16S} 'NR==2{$1=$2=$3="";gsub(/; /,";",$0);print id,$0}' ${i}/classification.txt;done > /mnt/nfs/yutao/972Isolates/829_Comp95_Cont5_Isolates/829_Comp95_Cont5_Isolates_16S_Taxonomy.tsv
*.txt
文件保存着各个基因种类数量,可以通过合并成表格来统计
[u@h@101raw_genomes_processed_prokka]$ cat Y322-2.txt
organism: Genus species strain
contigs: 2122
bases: 13615552
CDS: 11428
rRNA: 5
tRNA: 90
tmRNA: 2
(base) [yutao@myosin Genome_integration]$ (printf "genome\ttotal_num\tnames\tlocus\tlength\n"; head 826.id |while read i;do num=$(grep -c "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv);locus=($(grep "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $1}'));name=($(grep "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $4}'));len=($(grep "16S ribosomal RNA" Prokka_out/${i%.fna}.tsv|awk '{print $3}'));if [ $num -eq 0 ];then locus=0;name=0;len=0;fi;printf "${i}\t$num\t${name[*]}\t${locus[*]}\t${len[*]}\n";done)
(base) [yutao@myosin ColdSeepDB_ANI99_to_3179MAGs_Prokka]$ awk -F"\t" '$2!=0 && NR>1{print $NF}' tmp/3179MAGs_prokka_16S.tsv |awk '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print "max of line",NR": ",m}' |awk '{s+=$NF}END{print s/NR}'
[15:52:55] This is prokka 1.14.6
[15:54:16] Changing illegal '|' to '_' in sequence name: HTR7|k141_1252805
--centre X --compliant
参数Contig ID must <= 37 chars long: k127_1068279_length_11625_cov_149.1380
[22:08:01] Please rename your contigs OR try '--centre X --compliant' to generate clean contig names.
prokka github