如何计算宏基因组测序数据中来自于人基因组的污染?

KneadData是一款宏基因组测序数据质控的软件,其主要功能包括使用Trimmomatic对序列过滤和bowtie2比对至宿主基因组去除宿主序列。今天我们使用这款软件来计算宏基因组测序数据中来自于人基因组的量

conda安装kneaddata

conda install kneaddata

查看直接下载就可使用的数据库

kneaddata_database --available

KneadData Databases ( database : build = location )
human_genome : bmtagger = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_BMTagger_v0.1.tar.gz
human_genome : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_Bowtie2_v0.1.tar.gz
mouse_C57BL : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/mouse_C57BL_6NJ_Bowtie2_v0.1.tar.gz
human_transcriptome : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz
ribosomal_RNA : bowtie2 = http://huttenhower.sph.harvard.edu/kneadData_databases/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.1.tar.gz

下载人基因组数据库

kneaddata_database --download human_genome bowtie2 .

Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_Bowtie2_v0.1.tar.gz
Downloading file of size: 3.44 GB

kneaddata -h 显示帮助

usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR
                 [-db REFERENCE_DB] [--bypass-trim]
                 [--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>]
                 [-q {phred33,phred64}] [--run-bmtagger] [--run-trf]
                 [--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]
                 [--remove-intermediate-output] [--cat-final-output]
                 [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG]
                 [--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
                 [--trimmomatic-options TRIMMOMATIC_OPTIONS]
                 [--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS]
                 [--no-discordant] [--cat-pairs] [--reorder] [--serial]
                 [--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH]
                 [--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI]
                 [--minscore MINSCORE] [--maxperiod MAXPERIOD]
                 [--fastqc FASTQC_PATH]

KneadData

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         additional output is printed

global options:
  --version             show program's version number and exit
  -i INPUT, --input INPUT
                        input FASTQ file (add a second argument instance to run with paired input files)
  -o OUTPUT_DIR, --output OUTPUT_DIR
                        directory to write output files
  -db REFERENCE_DB, --reference-db REFERENCE_DB
                        location of reference database (additional arguments add databases)
  --bypass-trim         bypass the trim step
  --output-prefix OUTPUT_PREFIX
                        prefix for all output files
                        [ DEFAULT : $SAMPLE_kneaddata ]
  -t <1>, --threads <1>
                        number of threads
                        [ Default : 1 ]
  -p <1>, --processes <1>
                        number of processes
                        [ Default : 1 ]
  -q {phred33,phred64}, --quality-scores {phred33,phred64}
                        quality scores
                        [ DEFAULT : phred33 ]
  --run-bmtagger        run BMTagger instead of Bowtie2 to identify contaminant reads
  --run-trf             run TRF to remove tandem repeats
  --run-fastqc-start    run fastqc at the beginning of the workflow
  --run-fastqc-end      run fastqc at the end of the workflow
  --store-temp-output   store temp output files
                        [ DEFAULT : temp output files are removed ]
  --remove-intermediate-output
                        remove intermediate output files
                        [ DEFAULT : intermediate output files are stored ]
  --cat-final-output    concatenate all final output files
                        [ DEFAULT : final output is not concatenated ]
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        level of log messages
                        [ DEFAULT : DEBUG ]
  --log LOG             log file
                        [ DEFAULT : $OUTPUT_DIR/$SAMPLE_kneaddata.log ]

trimmomatic arguments:
  --trimmomatic TRIMMOMATIC_PATH
                        path to trimmomatic
                        [ DEFAULT : $PATH ]
  --max-memory MAX_MEMORY
                        max amount of memory
                        [ DEFAULT : 500m ]
  --trimmomatic-options TRIMMOMATIC_OPTIONS
                        options for trimmomatic
                        [ DEFAULT : SLIDINGWINDOW:4:20 MINLEN:70 ]
                        MINLEN is set to 70 percent of total input read length

bowtie2 arguments:
  --bowtie2 BOWTIE2_PATH
                        path to bowtie2
                        [ DEFAULT : $PATH ]
  --bowtie2-options BOWTIE2_OPTIONS
                        options for bowtie2
                        [ DEFAULT : --very-sensitive ]
  --no-discordant       do not include discordant alignments for pairs (ie one of the two pairs aligns)
                        [ DEFAULT : Discordant alignments are included ]
  --cat-pairs           concatenate pair files before aligning so reads are aligned as single end
                        [ DEFAULT : paired reads are aligned as pairs ]
  --reorder             order the sequences in the same order as the input
                        [ DEFAULT : With discordant paired alignments sequences are not ordered ]
  --serial              filter the input in serial for multiple databases so a subset of reads are processed in each database search

bmtagger arguments:
  --bmtagger BMTAGGER_PATH
                        path to BMTagger
                        [ DEFAULT : $PATH ]

trf arguments:
  --trf TRF_PATH        path to TRF
                        [ DEFAULT : $PATH ]
  --match MATCH         matching weight
                        [ DEFAULT : 2 ]
  --mismatch MISMATCH   mismatching penalty
                        [ DEFAULT : 7 ]
  --delta DELTA         indel penalty
                        [ DEFAULT : 7 ]
  --pm PM               match probability
                        [ DEFAULT : 80 ]
  --pi PI               indel probability
                        [ DEFAULT : 10 ]
  --minscore MINSCORE   minimum alignment score to report
                        [ DEFAULT : 50 ]
  --maxperiod MAXPERIOD
                        maximum period size to report
                        [ DEFAULT : 500 ]

fastqc arguments:
  --fastqc FASTQC_PATH  path to fastqc
                        [ DEFAULT : $PATH ]

质量过滤和去除宿主序列

kneaddata -i $name\_1.fastq.gz -i $name\_2.fastq.gz   -o kneaddata_out --trimmomatic Trimmomatic-0.36/  --remove-intermediate-output -db Homo_sapiens_Bowtie2

--remove-intermediate-output 清理中间文件
-db 人基因组的bowtie2索引文件
--trimmomatic 质控程序位置 

过滤后结果统计

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.out

cat kneaddata_read_counts.out

Sample  raw pair1       raw pair2       trimmed pair1   trimmed pair2   trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1       decontaminated Homo_sapiens pair2       decontaminated Homo_sapiens orphan1     decontaminated Homo_sapiens orphan2
     final pair1     final pair2     final orphan1   final orphan2
kneaddata    72577172.0      72577172.0      49961458.0      49961458.0      20031875.0      955031.0        48388320.0      48388320.0      21348792.0      901878.0        48388320.0      48388320.0      21348792.0      901878.0

在这个栗子中,宏基因组测序原始paired-end reads数为72577172,过滤低质量序列后的paired-end reads数为49961458.0,过滤完人基因组之后的paired-end reads数为48388320.0。

感谢您的阅读,欢迎点赞、评论、支持和转发!!

如何计算宏基因组测序数据中来自于人基因组的污染?_第1张图片
image

你可能感兴趣的:(如何计算宏基因组测序数据中来自于人基因组的污染?)