导读
第一次做宏基因组还是在2018年,现在2021很多软件数据库大更新,这里搭建目前比较受认可的流程,即用Kneaddata(trimmomatic bowtie2) kraken2 humann3进行宏基因组数据基础分析。
一、质控 - KneadData
kneaddata
GITHUB: https://github.com/biobakery/kneaddata
kneaddata软件获取
conda install kneaddata
参考:Kneaddata, Fastqc, Trimmomatic, Bowtie2下载,安装,使用
kneaddata数据处理 - 脚本
#!/usr/bin/env bash
# 帮助文档
helpdoc(){
cat < -o
kneaddata数据处理 - 运行
mkdir rawdata
cd rawdata
# 软连接原始数据
ln -s /[route]/E100011007_L01_501_1.fq.gz 501_1.fq.gz
ln -s /[route]/E100011007_L01_501_2.fq.gz 501_2.fq.gz
cd ../
mkdir Result
mkdir Result/kneaddata
# qsub提交
qsub -cwd -l vf=100G,p=64 -q st.q -P PID -binding linear:8 q_kneaddata.sh -i 502
qsub -cwd -l vf=100G,p=64 -q st.q -P PID -binding linear:8 q_kneaddata.sh -i 502
qsub -cwd -l vf=100G,p=64 -q st.q -P PID -binding linear:8 q_kneaddata.sh -i 503
kneaddata数据处理 - 过程
Running Trimmomatic ...
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.1.fastq ): 184963001
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.2.fastq ): 184963001
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.single.1.fastq ): 9683036
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.single.2.fastq ): 15706257
Decontaminating ...
kneaddata数据处理 - 结果
运行中已舍弃trim文件、比对文件,再舍弃污染序列,舍弃unpaired序列。
参考:
KneadData质控、SortMeRNA去rRNA
Quality Control
二、物种分类 - Kraken2
kraken2
官网:https://ccb.jhu.edu/software/kraken2/
GITHUB:https://github.com/DerrickWood/kraken2/wiki
MANUAL:https://github.com/DerrickWood/kraken2/wiki/Manual
DATABASE: https://benlangmead.github.io/aws-indexes/k2
OLD_DB:http://ccb.jhu.edu/software/kraken2/downloads.shtml
KRAKEN2 2.1.1 (newest version):https://github.com/DerrickWood/kraken2
KRAKEN1/2文章:
标题:Improved metagenomic analysis with Kraken 2
译题:kraken2优化宏基因组分析
期刊:Genome Biology
时间:2019
标题:Kraken: ultrafast metagenomic sequence classification using exact alignments
译题:利用精确比对进行超快的宏基因组序列鉴定
期刊:Genome Biology
时间:2014
kraken2软件获取
# 激活
source /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
# 新建
conda create -n kraken2
# 启动
conda activate kraken2
# 安装
conda install kraken2 -c bioconda
kraken2 --help
kraken2 --version # version 2.0.7-beta
# 升级
conda update kraken2 -c bioconda
kraken2 --version # All requested packages already installed.
kraken2是一个perl程序
kraken2数据库获取
DATABASE: https://benlangmead.github.io/aws-indexes/k2
网页数据链接:
kraken2-build下载选项:
试图下载标准库 + PFP:
archaea, bacteria, viral, plasmid, human, UniVec_Core, protozoa, fungi, plant
古菌,细菌,病毒,质粒,人,载体,原生生物,真菌,植物的基因组
# 下载PLUSPFP
nohup wget -c https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20210127.tar.gz &
成功
kraken2数据处理 - 脚本
#!/usr/bin/env bash
# 帮助文档
helpdoc(){
cat < -o
kraken2数据处理 - 运行
qsub -cwd -l vf=100G,p=64 -q st.q -P P18 -binding linear:8 script/q_kraken2.sh -i 501
qsub -cwd -l vf=100G,p=64 -q st.q -P P18 -binding linear:8 script/q_kraken2.sh -i 502
qsub -cwd -l vf=100G,p=64 -q st.q -P P18 -binding linear:8 script/q_kraken2.sh -i 503
kraken2数据处理 - 过程
Loading database information... done.
59865565 sequences (11707.39 Mbp) processed in 697.071s (5152.9 Kseq/m, 1007.71 Mbp/m).
53282884 sequences classified (89.00%)
6582681 sequences unclassified (11.00%)
kraken2数据处理 -结果
out file
C E100011007L1C001R00300000012 Prevotella sp. oral taxon 299 str. F0039 (taxid 575
C E100011007L1C001R00300000018 Streptococcus sanguinis (taxid 1305) 100|100 0:6
C E100011007L1C001R00300000528 Streptococcus pneumoniae (taxid 1313) 100|100 0:1
C E100011007L1C001R00300000539 Streptococcus pneumoniae (taxid 1313) 100|100 0:2
C E100011007L1C001R00300002413 Streptococcus (taxid 1301) 63|100 0:15 1301:1
report file
11.00 6582681 6582681 U 0 unclassified
89.00 53282884 19056 R 1 root
88.91 53228791 72548 R1 131567 cellular organisms
87.30 52262033 397920 D 2 Bacteria
73.21 43825487 145121 D1 1783272 Terrabacteria group
68.81 41192272 92634 P 1239 Firmicutes
65.68 39320728 138891 C 91061 Bacilli
62.23 37256687 72766 O 186826 Lactobacillales
60.86 36431407 18819 F 1300 Streptococcaceae
60.81 36401562 11156413 G 1301 Streptococcus
report file explain
1 Percentage of fragments covered by the clade rooted at this taxon
2 Number of fragments covered by the clade rooted at this taxon
3 Number of fragments assigned directly to this taxon
4 A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
5 NCBI taxonomic ID number
6 Indented scientific name
举例说明:这里的A = B,第二列是cover,第三列是direct
多样本结果数据合并-依赖和脚本
脚本:Kraken2-output-manipulation
依赖:
conda create -n python3.7 python=3.7 -c bioconda
conda activate python3.7
python3
import numpy
import scipy
import argparse
import pandas
import collection
python3 -m pip install numpy
python3 -m pip install collection
vi kraken_multiple.py # copy code
vi kraken_multiple_taxa.py # copy code
python /hutongyuan/analysis/oral/script/kraken_multiple.py --help
python /hutongyuan/analysis/oral/script/kraken_multiple_taxa.py --help
参数:
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY Enter a directory with kraken summary reports
-r {U,R,D,K,P,C,O,F,G,S} Enter a rank code
-c {1,2,3,4,5,6} Enter the column number in the report you would like to include in the output
-o OUTPUT Enter the output file name
col1: Percentage of fragments covered by the clade rooted at this taxon
col2: Number of fragments covered by the clade rooted at this taxon
col3: Number of fragments assigned directly to this taxon
col4: A rank code
col5: NCBI taxonomic ID number (this doesnt make sense to report for every column but its possible)
col6: Indented scientific name
合并数据
# 种水平
python /script/kraken_multiple_taxa.py \
-d kraken2_report -r S -c 2 -o kraken2_species_tax.tsv &
# 属水平
python /script/kraken_multiple_taxa.py \
-d kraken2_report -r G -c 2 -o kraken2_genus_tax.tsv &
# 门水平
python /script/kraken_multiple_taxa.py \
-d kraken2_report -r P -c 2 -o kraken2_phylum_tax.tsv &
## 修改格式
cat kraken2_report_merge_raw/kraken2_species_tax.tsv | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" | sed -e "s/kraken2_report\///g;s/_report.tsv//g" > kraken2_report_merge_clean/kraken2_species_tax.tsv
cat kraken2_report_merge_raw/kraken2_genus_tax.tsv | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" | sed -e "s/kraken2_report\///g;s/_report.tsv//g" > kraken2_report_merge_clean/kraken2_genus_tax.tsv
cat kraken2_report_merge_raw/kraken2_phylum_tax.tsv | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" | sed -e "s/kraken2_report\///g;s/_report.tsv//g" > kraken2_report_merge_clean/kraken2_phylum_tax.tsv
结果
cat kraken2_phylum_tax.tsv | wc -l
74
cat kraken2_genus_tax.tsv | wc -l
3055
cat kraken2_species_tax.tsv | wc -l
17092
注释结果包含所有分类,所以该kraken数据库种有:74个门,3055个属,17092个种。举例,门水平注释,合并,格式转换结果如下:
更多KRAKEN2:
Kraken2:宏基因组快速物种注释神器
Kraken2+Bracken(Kraken2联合Bracken宏基因组物种注释)
Kraken拓展工具KrakenTools
bracken
官网:https://ccb.jhu.edu/software/bracken/index.shtml
文章:Bracken: estimating species abundance in metagenomics data. peerJ Computer Science. 2017
bracken软件获取
conda create -n bracken
conda activate bracken
conda install bracken -c bioconda
bracken使用
source /your_route/miniconda3/etc/profile.d/conda.sh
conda activate bracken
mkdir ./kraken_quant/bracken
for i in `ls ./kraken_quant/$infile/`; do
base=${i%_report.txt}
bracken \
-d /public/home/zzumgg03/huty/projects/oral/kraken_database \
-i ./kraken_quant/$infile/$i \
-t 30 \
-o ./kraken_quant/bracken/${base}_report_bracken.txt \
-w ./kraken_quant/bracken/${base}_report_bracken.report
done
参数:
-d kraken2数据库地址
-i kraken2 report文件
-t 线程数
-o 输出结果
-w 输出报告
bracken输出文件
1 sample_report_bracken.txt
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
s__4 217 S 489584 44999 534583 0.02665
s__159 154 S 392118 123430 515548 0.02570
s__6 239 S 339437 143082 482519 0.02405
2 sample_report_bracken.report
100.00 20062259 0 R 1 root
100.00 20062259 0 R1 2 d__Bacteria
36.07 7237346 0 P 5 p__Firmicutes
36.07 7237346 0 C 11 c__Bacilli
32.20 6459679 0 O 27 o__Lactobacillales
29.05 5828230 0 F 55 f__Streptococcaceae
29.05 5827410 0 G 86 g__Streptococcus
2.66 534583 534583 S 217 s__4
2.57 515548 515548 S 154 s__159
3 过程文件
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i ./kraken_quant/merge/ERR589349.1_report.txt -o ./kraken_quant/bracken/ERR589349.1_report_bracken.txt -k /public/home/zzumgg03/huty/projects/oral/kraken_database_195/database100mers.kmer_distrib -l S -t 30
>> Checking report file: ./kraken_quant/merge/ERR589349.1_report.txt
PROGRAM START TIME: 04-26-2022 07:14:14
BRACKEN SUMMARY (Kraken report: ./kraken_quant/merge/ERR589349.1_report.txt)
>>> Threshold: 30
>>> Number of species in sample: 195
>> Number of species with reads > threshold: 192
>> Number of species with reads < threshold: 3
>>> Total reads in sample: 7570278
>> Total reads kept at species level (reads > threshold): 4931353
>> Total reads discarded (species reads < threshold): 46
>> Reads distributed: 5533145
>> Reads not distributed (eg. no species above threshold): 68
>> Unclassified reads: 2037065
BRACKEN OUTPUT PRODUCED: ./kraken_quant/bracken/ERR589349.1_report_bracken.txt
PROGRAM END TIME: 04-26-2022 07:14:14
Bracken complete.
三、功能注释 - HUMAnN3
HUMAnN3
官网:https://huttenhower.sph.harvard.edu/humann
GITHUB MANUAL: https://github.com/biobakery/humann
humann3软件获取
conda create -n humann3
conda activate humann3
conda install python=3.7
失败
conda create -n humann3 python=3.7
conda activate humann3
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery
conda config --show
# 安装humann
conda install humann -c biobakery
humann --version # humann v3.0.0.alpha.3
humann3 --version # humann v3.0.0.alpha.3
metaphlan --version # MetaPhlAn version 3.0.13 (27 Jul 2021)
成功
bowtie2 --help报错,直接影响humann能否正常工作
miniconda3/envs/humann3/bin/bowtie2-align-s:
symbol lookup error: miniconda3/envs/humann3/bin/bowtie2-align-s:
undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_initializeEv
(ERR): Description of arguments failed!
Exiting now ...
bowtie2-build-s: symbol lookup error, undefined symbol
conda安装bowtie2的报错:undefined symbol
一个成功的偏方
# bowtie2 error
# https://blog.csdn.net/u012110870/article/details/115166861
# 先启动metawrap, 再启动r403搞定
在另一台服务器中,metawrap安装不了,利用metawrap空环境python2.7仅安装bowtie2也报错。bowtie2默认会同时安装perl tbb=2021,下面试试安装tbb=2020,然后我成功了,嘎嘎嘎嘎。
conda create -y -n metawrap-env python=2.7
conda activate metawrap-env
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels ursky
# conda install biopython blas=2.5 blast=2.6.0 bmtagger bowtie2 bwa checkm-genome fastqc kraken=1.1 kraken=2.0 krona=2.7 matplotlib maxbin2 megahit metabat2 pandas prokka quast r-ggplot2 r-recommended salmon samtools=1.9 seaborn spades trim-galore
# 失败
conda install bowtie2
bowtie2 --help # 失败
conda install tbb=2020.2
bowtie2 --help # 成功
humann3数据库获取
http://huttenhower.sph.harvard.edu/humann_data/
物种库:
旧http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/
新http://huttenhower.sph.harvard.edu/humann_data/chocophlan/
功能库:
旧http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/
新http://huttenhower.sph.harvard.edu/humann_data/
注释库:
旧http://huttenhower.sph.harvard.edu/humann2_data/
新http://huttenhower.sph.harvard.edu/humann_data/
wget -c http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan.v296_201901.tar.gz
wget -c http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_v201901.tar.gz
wget -c http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz
成功
解压,打开chocophlan看看
tar -zxvf chocophlan/full_chocophlan.v296_201901.tar.gz
ls chocophlan | wc -l
# 10670 个基因组序列文件
ls chocophlan | head
rm chocophlan/full_chocophlan.v296_201901.tar.gz
du -h chocophlan # 16G
g__Abditibacterium.s__Abditibacterium_utsteinense.centroids.v296_201901.ffn.gz
g__Abiotrophia.s__Abiotrophia_defectiva.centroids.v296_201901.ffn.gz
g__Abiotrophia.s__Abiotrophia_sp_HMSC24B09.centroids.v296_201901.ffn.gz
g__Absiella.s__Absiella_dolichum.centroids.v296_201901.ffn.gz
g__Absiella.s__Absiella_sp_AM09_45.centroids.v296_201901.ffn.gz
g__Abyssibacter.s__Abyssibacter_profundi.centroids.v296_201901.ffn.gz
g__Acaryochloris.s__Acaryochloris_marina.centroids.v296_201901.ffn.gz
g__Acaryochloris.s__Acaryochloris_sp_RCC1774.centroids.v296_201901.ffn.gz
g__Acetanaerobacterium.s__Acetanaerobacterium_elongatum.centroids.v296_201901.ffn.gz
解压,打开uniref_annotated看看
tar -zxvf uniref90_annotated_v201901.tar.gz
ll -alh
rm uniref90_annotated_v201901.tar.gz
-rw-r--r-- 1 hutongyuan ST_META 34G Jun 12 2019 uniref90_201901.dmnd
-rw-r--r-- 1 hutongyuan ST_META 20G Apr 22 10:21 uniref90_annotated_v201901..gz
解压,打开full_mapping文件看看
tar -zxvf full_mapping_v201901.tar.gz
rm full_mapping_v201901.tar.gz
ls -alh
配置数据库路径
humann_config # 更新前
humann_config --update database_folders nucleotide /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/chocophlan/
humann_config --update database_folders protein /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/uniref_annotated/
humann_config --update database_folders utility_mapping /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/mapping/
humann_config # 更新后
humann3数据处理 - 脚本
#!/usr/bin/env bash
# 帮助文档
helpdoc(){
cat < -o
关于paired-end处理,manaual给出意见:
HUMAnN 3.0 and paired-end sequencing data
End-pairing relationships are currently not taken into account during HUMAnN 3.0's alignment steps. This is due to the fact that HUMAnN 3.0 strictly aligns reads to isolated coding sequences: either at the nucleotide level or through translated search. As such, it will frequently be the case that one read (READ1) will map inside a given coding sequence while its mate-pair (READ2) will not.
运行bug
更多:
HUMAnN 3.0 (alpha)安装及使用
Humann3 Paired end reads
Build an appropriate kraken2 database