如何计算宏基因组测序数据中来自于人基因组的污染?

KneadData是一款宏基因组测序数据质控的软件,其主要功能包括使用Trimmomatic对序列过滤和bowtie2比对至宿主基因组去除宿主序列。今天我们使用这款软件来计算宏基因组测序数据中来自于人基因组的量

conda安装kneaddata
conda install kneaddata

查看直接下载就可使用的数据库
kneaddata_database --available

KneadData Databases ( database : build = location )
human_genome : bmtagger = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_BMTagger_v0.1.tar.gz
human_genome : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_Bowtie2_v0.1.tar.gz
mouse_C57BL : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/mouse_C57BL_6NJ_Bowtie2_v0.1.tar.gz
human_transcriptome : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz
ribosomal_RNA : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.1.tar.gz
下载人基因组数据库
kneaddata_database --download human_genome bowtie2 .

Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_Bowtie2_v0.1.tar.gz
Downloading file of size: 3.44 GB
kneaddata -h 显示帮助

usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR
[-db REFERENCE_DB] [--bypass-trim]
[--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>]
[-q {phred33,phred64}] [--run-bmtagger] [--run-trf]
[--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]
[--remove-intermediate-output] [--cat-final-output]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG]
[--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
[--trimmomatic-options TRIMMOMATIC_OPTIONS]
[--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS]
[--no-discordant] [--cat-pairs] [--reorder] [--serial]
[--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH]
[--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI]
[--minscore MINSCORE] [--maxperiod MAXPERIOD]
[--fastqc FASTQC_PATH]

KneadData

optional arguments:
-h, --help show this help message and exit
-v, --verbose additional output is printed

global options:
--version show program's version number and exit
-i INPUT, --input INPUT
input FASTQ file (add a second argument instance to run with paired input files)
-o OUTPUT_DIR, --output OUTPUT_DIR
directory to write output files
-db REFERENCE_DB, --reference-db REFERENCE_DB
location of reference database (additional arguments add databases)
--bypass-trim bypass the trim step
--output-prefix OUTPUT_PREFIX
prefix for all output files
[ DEFAULT : SAMPLE_kneaddata ] -t <1>, --threads <1> number of threads [ Default : 1 ] -p <1>, --processes <1> number of processes [ Default : 1 ] -q {phred33,phred64}, --quality-scores {phred33,phred64} quality scores [ DEFAULT : phred33 ] --run-bmtagger run BMTagger instead of Bowtie2 to identify contaminant reads --run-trf run TRF to remove tandem repeats --run-fastqc-start run fastqc at the beginning of the workflow --run-fastqc-end run fastqc at the end of the workflow --store-temp-output store temp output files [ DEFAULT : temp output files are removed ] --remove-intermediate-output remove intermediate output files [ DEFAULT : intermediate output files are stored ] --cat-final-output concatenate all final output files [ DEFAULT : final output is not concatenated ] --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} level of log messages [ DEFAULT : DEBUG ] --log LOG log file [ DEFAULT :OUTPUT_DIR/$SAMPLE_kneaddata.log ]

trimmomatic arguments:
--trimmomatic TRIMMOMATIC_PATH
path to trimmomatic
[ DEFAULT : $PATH ]
--max-memory MAX_MEMORY
max amount of memory
[ DEFAULT : 500m ]
--trimmomatic-options TRIMMOMATIC_OPTIONS
options for trimmomatic
[ DEFAULT : SLIDINGWINDOW:4:20 MINLEN:70 ]
MINLEN is set to 70 percent of total input read length

bowtie2 arguments:
--bowtie2 BOWTIE2_PATH
path to bowtie2
[ DEFAULT : $PATH ]
--bowtie2-options BOWTIE2_OPTIONS
options for bowtie2
[ DEFAULT : --very-sensitive ]
--no-discordant do not include discordant alignments for pairs (ie one of the two pairs aligns)
[ DEFAULT : Discordant alignments are included ]
--cat-pairs concatenate pair files before aligning so reads are aligned as single end
[ DEFAULT : paired reads are aligned as pairs ]
--reorder order the sequences in the same order as the input
[ DEFAULT : With discordant paired alignments sequences are not ordered ]
--serial filter the input in serial for multiple databases so a subset of reads are processed in each database search

bmtagger arguments:
--bmtagger BMTAGGER_PATH
path to BMTagger
[ DEFAULT : $PATH ]

trf arguments:
--trf TRF_PATH path to TRF
[ DEFAULT : $PATH ]
--match MATCH matching weight
[ DEFAULT : 2 ]
--mismatch MISMATCH mismatching penalty
[ DEFAULT : 7 ]
--delta DELTA indel penalty
[ DEFAULT : 7 ]
--pm PM match probability
[ DEFAULT : 80 ]
--pi PI indel probability
[ DEFAULT : 10 ]
--minscore MINSCORE minimum alignment score to report
[ DEFAULT : 50 ]
--maxperiod MAXPERIOD
maximum period size to report
[ DEFAULT : 500 ]

fastqc arguments:
--fastqc FASTQC_PATH path to fastqc
[ DEFAULT : name_1.fastq.gz -i $name_2.fastq.gz -o kneaddata_out --trimmomatic Trimmomatic-0.36/ --remove-intermediate-output -db Homo_sapiens_Bowtie2

--remove-intermediate-output 清理中间文件
-db 人基因组的bowtie2索引文件
--trimmomatic 质控程序位置

过滤后结果统计
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.out

cat kneaddata_read_counts.out

Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1 decontaminated Homo_sapiens pair2 decontaminated Homo_sapiens orphan1 decontaminated Homo_sapiens orphan2
final pair1 final pair2 final orphan1 final orphan2
kneaddata 72577172.0 72577172.0 49961458.0 49961458.0 20031875.0 955031.0 48388320.0 48388320.0 21348792.0 901878.0 48388320.0 48388320.0 21348792.0 901878.0
在这个栗子中,宏基因组测序原始paired-end reads数为72577172,过滤低质量序列后的paired-end reads数为49961458.0,过滤完人基因组之后的paired-end reads数为48388320.0。

摘自:
https://www.it610.com/article/1212381366600175616.htm

你可能感兴趣的:(如何计算宏基因组测序数据中来自于人基因组的污染?)