基于二代NGS数据新冠病毒无参组装示例

  • 数据位置 
    • 路径
      • path: /opt/data/assembly
    • 原始数据
      • /opt/data/assembly/test_1.fastq
      • /opt/data/assembly/test_2.fastq
    • SARS-CoV-2 Reference - NC_045512.2
      • /opt/data/assembly/refseq/2019_nCoV.fasta
    • SARS-CoV-2 blastdb-nucl
      • /opt/data/assembly/refseq/2019_nCoV
  • 软件路径
    • conda安装,做了软链接在/usr/bin,使用时直接可以调用
    • python3
      • /opt/software/miniconda3/bin/python3.8
    • fastqc
      • /opt/software/miniconda3/bin/fastqc
    • trimmomatic
      • /opt/software/miniconda3/bin/trimmomatic
    • MultiQC
      • /opt/software/miniconda3/bin/multiqc
    • megahit
      • /opt/software/miniconda3/bin/megahit
    • QUAST
      • /opt/software/miniconda3/envs/quast/bin/quast
    • blastn
      • /opt/software/ncbi-blast-2.10.0+/bin/blastn
    • 挑选序列脚本
      • /opt/software/extract_seq_f_fa.py
    • bowtie2
      • /opt/software/miniconda3/bin/bowtie2
      • /opt/software/miniconda3/bin/bowtie2-build
    • samtools
      • /opt/software/miniconda3/bin/samtools
    • weeSAM
      • /opt/software/weeSAM.py
    • VAPiD
      • /opt/software/VAPiD-master/vapid3.py
      • /opt/data/assembly/annotation/test.sbt
      • /opt/data/assembly/annotation/test_metadata.csv
        • 自己根据组装结果修改
  • 流程
    • raw data
      • 二代测序数据,VLP
      • 原始fastq
    • Raw Data QC - FastQC
      • 原始数据质控
      • software: FastQC
      • version: 0.11.9
      • command:
        • mkdir 1_raw_fastqc
        • fastqc -o /home/test01/test/1_raw_fastqc/ /opt/data/assembly/*.fastq > /home/test01/test/0_logs/1_raw_qc.log 2>&1
    • Quality Control - Trimmomatic
      • 数据质控,去除低质量和接头
      • software: Trimmomatic
      • version: 0.39
      • adapter: /opt/software/miniconda3/share/trimmomatic-0.39-1/adapters/TruSeq3-PE-2.fa
      • 主要参数
        • TruSeq3-PE-2.fa:用的多的那个adapter文件
        • https://www.jianshu.com/p/a8935adebaae
      • command:
        • mkdir 2_trim
        • trimmomatic PE -summary /home/test01/test/0_logs/2_trim_summary.txt /opt/data/assembly/test_1.fastq /opt/data/assembly/test_2.fastq -baseout /home/test01/test/2_trim/test.fastq.gz ILLUMINACLIP:/opt/software/miniconda3/share/trimmomatic-0.39-1/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:36 >/home/test01/test/0_logs/2_trim.log 2>&1
    • Clean Data QC - FastQC
      • 高质量数据质控
      • software: FastQC
      • version: 0.11.9
      • command:
        • mkdir 3_clean_fastqc
        • fastqc -o /home/test01/test/3_clean_fastqc/ /home/test01/test/2_trim/*P.fastq.gz > /home/test01/test/0_logs/3_clean_qc.log 2>&1
    • de novo Assembly - MEGAHIT
      • 无参组装
      • software: MEGAHIT
      • version: 1.2.9
      • command:
        • *不预先创建输出文件夹
        • megahit -1 /home/test01/test/2_trim/test_1P.fastq.gz -2 /home/test01/test/2_trim/test_2P.fastq.gz --min-contig-len 500 -o 4_assembly/ > /home/test01/test/0_logs/4_assembly.log 2>&1
    • Assembly Statistics - QUAST
      • 组装结果统计
      • software: QUAST
      • version: 5.0.2
      • command:
        • mkdir 5_quast
        • quast /home/test01/test/4_assembly/final.contigs.fa -o /home/test01/test/5_quast/ > /home/test01/test/0_logs/5_quast.log 2>&1
    • Blast to Reference Genome (NC_045512.2) - BLASTN
      • 组装序列比对新冠参考基因组,挑出比对上的序列
      • software: blastn
      • version: 2.10.0+
      • refseq: NC_045512.2
      • outfmt6表格输出结果题头
        • query acc.ver, query length, subject acc.ver, subject length, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
      • command:
        • mkdir 6_viral_contigs
        • blastn -query /home/test01/test/4_assembly/final.contigs.fa -db /opt/data/assembly/refseq/SARS-CoV-2 -outfmt '6 qaccver qlen saccver slen pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle' -qcov_hsp_perc 50 -out /home/test01/test/6_viral_contigs/nCoV_blastn_6.tsv 
        • -qcov_hsp_perc 50: Percent query coverage per hsp, 50%
      • 挑选出比对上的contigs
        • 根据contigs ID在final.contigs.fa中挑出
        • cat /home/test01/test/6_viral_contigs/nCoV_blastn_6.tsv | cut -f 1 | sort -u > /home/test01/test/6_viral_contigs/viral_list.txt
        • /opt/software/extract_seq_f_fa.py /home/test01/test/4_assembly/final.contigs.fa /home/test01/test/6_viral_contigs/viral_list.txt /home/test01/test/6_viral_contigs/viral_contigs.fa
    • 查看reads分布
      • Reads mapping查看reads利用率,查看viral contigs的reads覆盖情况
      • software: Bowtie2
      • version: 2.4.1
      • software: samtools
      • version: 1.11
      • weeSAM
        • /opt/software/weeSAM.py
      • command:
        • mkdir -p 7_bowtie2/index
        • cp /home/test01/test/6_viral_contigs/viral_contigs.fa /home/test01/test/7_bowtie2/index/
        • 建索引
          • bowtie2-build /home/test01/test/7_bowtie2/index/viral_contigs.fa /home/test01/test/7_bowtie2/index/viral_contigs > /home/test01/test/0_logs/7_bowtie2_build.log 2>&1
        • reads mapping
          • bowtie2 -x /home/test01/test/7_bowtie2/index/viral_contigs -1 /home/test01/test/2_trim/test_1P.fastq.gz -2 /home/test01/test/2_trim/test_2P.fastq.gz | samtools sort -O bam -o - > /home/test01/test/7_bowtie2/viral_contigs_sorted.bam
        • 统计比对率
          • samtools flagstat /home/test01/test/7_bowtie2/viral_contigs_sorted.bam > /home/test01/test/0_logs/7_bowtie2_stats.txt
        • 统计序列覆盖度等信息
          • samtools coverage /home/test01/test/7_bowtie2/viral_contigs_sorted.bam > /home/test01/test/0_logs/7_contigs_coverage.txt
            • OR
          • /opt/software/weeSAM.py --bam /home/test01/test/7_bowtie2/viral_contigs_sorted.bam --out /home/test01/test/7_bowtie2/contigs_coverage_ws.tsv --html /home/test01/test/7_bowtie2/viral_contigs > /home/test01/test/0_logs/7_weesam.log 2>&1
    • SARS-CoV-2 Genome Annotation - VAPiD
      • 新冠病毒基因组注释,使用VAPiD软件,把完整的viral contigs挑出来用于注释
      • software: VAPiD
      • version: 1.6.6
      • /opt/data/assembly/annotation/test.sbt
      • /opt/data/assembly/annotation/test_metadata.csv
      • command:
        • mkdir 8_annotation && cd 8_annotation
        • 根据自己序列修改metada.tsv
          • 参考文件/opt/data/assembly/annotation/test_metadata.csv
          • 修改strain(自己contigs名称)、collection-date(自定义)、country(自定义)、coverage(weeSAM结果中Avg_Depth)、full_name(自定义)
          • /home/test01/test/8_annotation/test_metadata.csv
        • /opt/software/VAPiD-master/vapid3.py /home/test01/test/6_viral_contigs/viral_contigs.fa /opt/data/assembly/annotation/test.sbt --metadata_loc /home/test01/test/8_annotation/test_metadata.csv --db /opt/data/assembly/refseq/SARS-CoV-2 > /home/test01/test/0_logs/8_annotation.log 2>&1
    • 汇总报告
      • 使用multiqc软件汇总流程中的部分日志报告
      • software: MultiQC
      • version: 1.9
      • command:
        • mkdir /home/test01/test/9_all_report
        • multiqc /home/test01/test/ -o /home/test01/test/9_all_report/ -n all_report > /home/test01/test/0_logs/9_multiqc.log 2>&1

  • 步骤流程

(以下命令行中的路径“/home/test01/test”请使用自己建立的路径代替,如“/home/usr01/zhangsan/Assembly”或者如果严格按照文档设计的结构执行,出现“/home/test01/test””.”替换)

    • 创建文件夹
cd /home/usrxx/xxxxx/Assembly

mkdir test && cd test

mkdir 0_logs
    • 1. raw data qc
mkdir 1_raw_fastqc

fastqc -o /home/test01/test/1_raw_fastqc/ /opt/data/assembly/*.fastq > /home/test01/test/0_logs/1_raw_qc.log 2>&1
    • 2. trim
mkdir 2_trim

trimmomatic PE -summary /home/test01/test/0_logs/2_trim_summary.txt /opt/data/assembly/test_1.fastq /opt/data/assembly/test_2.fastq -baseout /home/test01/test/2_trim/test.fastq.gz ILLUMINACLIP:/opt/software/miniconda3/share/trimmomatic-0.39-1/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:36 >/home/test01/test/0_logs/2_trim.log 2>&1
    • 3. clean data qc
mkdir 3_clean_fastqc

fastqc -o /home/test01/test/3_clean_fastqc/ /home/test01/test/2_trim/*P.fastq.gz > /home/test01/test/0_logs/3_clean_qc.log 2>&1
    • 4. de novo assembly
      • 不预先创建输出文件夹
megahit -1 /home/test01/test/2_trim/test_1P.fastq.gz -2 /home/test01/test/2_trim/test_2P.fastq.gz --min-contig-len 500 -o /home/test01/test/4_assembly/ >/home/test01/test/0_logs/4_assembly.log 2>&1
    • 5. QUAST
mkdir 5_quast

quast /home/test01/test/4_assembly/final.contigs.fa -o /home/test01/test/5_quast/ > /home/test01/test/0_logs/5_quast.log 2>&1
    • 6. blastn
mkdir 6_viral_contigs

blastn -query /home/test01/test/4_assembly/final.contigs.fa -db /opt/data/assembly/refseq/SARS-CoV-2 -outfmt '6 qaccver qlen saccver slen pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle' -qcov_hsp_perc 50 -out /home/test01/test/6_viral_contigs/nCoV_blastn_6.tsv

cat /home/test01/test/6_viral_contigs/nCoV_blastn_6.tsv | cut -f 1 | sort -u > /home/test01/test/6_viral_contigs/viral_list.txt

/opt/software/extract_seq_f_fa.py /home/test01/test/4_assembly/final.contigs.fa /home/test01/test/6_viral_contigs/viral_list.txt /home/test01/test/6_viral_contigs/viral_contigs.fa
    • 7. reads mapping
mkdir -p 7_bowtie2/index

cp /home/test01/test/6_viral_contigs/viral_contigs.fa /home/test01/test/7_bowtie2/index/

bowtie2-build /home/test01/test/7_bowtie2/index/viral_contigs.fa /home/test01/test/7_bowtie2/index/viral_contigs >/home/test01/test/0_logs/7_bowtie2_build.log 2>&1

bowtie2 -x /home/test01/test/7_bowtie2/index/viral_contigs -1 /home/test01/test/2_trim/test_1P.fastq.gz -2 /home/test01/test/2_trim/test_2P.fastq.gz | samtools sort -O bam -o - > /home/test01/test/7_bowtie2/viral_contigs_sorted.bam
      • 统计比对率
samtools flagstat /home/test01/test/7_bowtie2/viral_contigs_sorted.bam > /home/test01/test/0_logs/7_bowtie2_stats.txt
      • 统计序列覆盖度等信息
/opt/software/weeSAM.py --bam /home/test01/test/7_bowtie2/viral_contigs_sorted.bam --out /home/test01/test/7_bowtie2/contigs_coverage_ws.tsv --html /home/test01/test/7_bowtie2/viral_contigs > /home/test01/test/0_logs/7_weesam.log 2>&1
    • 8. SARS-CoV-2 Genome Annotation
mkdir 8_annotation && cd 8_annotation

cp /opt/data/assembly/annotation/test_metadata.csv ./
      • 根据自己序列修改test_metadata.tsv
        • 参考文件/opt/data/assembly/annotation/test_metadata.csv
        • 修改strain(自己contigs名称)、collection-date(自定义)、country(自定义)、coverage(weeSAM结果中Avg_Depth)、full_name(自定义)
/opt/software/VAPiD-master/vapid3.py /home/test01/test/6_viral_contigs/viral_contigs.fa /opt/data/assembly/annotation/test.sbt --metadata_loc /home/test01/test/8_annotation/test_metadata.csv --db /opt/data/assembly/refseq/SARS-CoV-2 > /home/test01/test/0_logs/8_annotation.log 2>&1
    • 9. 汇总报告(fastqc, Trimmomatic, QUAST, samtools flagstat) 
mkdir /home/test01/test/9_all_report

multiqc /home/test01/test/ -o /home/test01/test/9_all_report/ -n all_report > /home/test01/test/0_logs/9_multiqc.log 2>&1

你可能感兴趣的:(python,linux,ubuntu)