miRNA生物信息数据分析流程初探(一)

snakemake-qiaseq-mirna流程
在github的snakemake-qiaseq-mirna[1]流程中,MiRNA的分析步骤如下:

  • Step 1: Read trimming and UMI detection
    • cutadapt[2]
    • umi_tools[3]
  • Step 2: Read alignment
  • Step 3: Deduplication
  • Step 4: Read counting

第一步: clone Github的仓库

git clone https://github.com/jounikuj/snakemake-qiaseq-mirna.git

技巧提示

要测试snakemake流程,可以先使用其提供的测试数据。在该仓库中的其测试数据在.test目录,可以使用snakemake -np --directory .test命令直接运行,测试流程是否正确,并且查看流程图。

.
├── config.yaml
├── fastq
├── LICENSE.md
├── README.md
├── samples.tsv
├── Snakefile
├── .test
│   ├── config.yaml
│   ├── fastq
│   │   ├── sampleA_S1_L001_R1_001.fastq.gz
│   │   ├── sampleA_S1_L002_R1_001.fastq.gz
│   │   ├── sampleA_S1_L003_R1_001.fastq.gz
│   │   ├── sampleA_S1_L004_R1_001.fastq.gz
│   │   ├── sampleB_S2_L001_R1_001.fastq.gz
│   │   ├── sampleB_S2_L002_R1_001.fastq.gz
│   │   ├── sampleB_S2_L003_R1_001.fastq.gz
│   │   ├── sampleB_S2_L004_R1_001.fastq.gz
│   │   ├── sampleC_S3_L001_R1_001.fastq.gz
│   │   ├── sampleC_S3_L002_R1_001.fastq.gz
│   │   ├── sampleC_S3_L003_R1_001.fastq.gz
│   │   └── sampleC_S3_L004_R1_001.fastq.gz
│   ├── samples.tsv
│   └── .snakemake
│       ├── auxiliary
│       ├── conda
│       ├── conda-archive
│       ├── incomplete
│       ├── locks
│       ├── log
│       ├── metadata
│       ├── shadow
│       └── singularity
└── workflow
├── envs
│   └── env.yaml
├── rules
│   ├── alignment.smk
│   ├── common.smk
│   ├── counting.smk
│   ├── deduplication.smk
│   ├── resources.smk
│   ├── targets.smk
│   └── trimming.smk
├── schemas
│   ├── config.schema.yaml
│   └── samples.schema.yaml
└── scripts
└── merge_read_counts.py

第二步: 查看流程图

在生成流程图的时候,可能需要安装以下软件

sudo apt install graphviz

生成流程图的具体命令是:

snakemake --dag --directory .test | dot -Tpng > dag.png
miRNA生物信息数据分析流程初探(一)_第1张图片

第三步:用一个样本解释miRNA的流程

使用下面命令生成sampleA的流程图

snakemake --dag --directory .test  results/sampleA/sampleA_counts_dedup.tsv | dot -Tpng  > dag_single.png
miRNA生物信息数据分析流程初探(一)_第2张图片

3.1 数据库下载及建立索引

get_pirna_sequences : piRNA数据下载

curl -s http://www.regulatoryrna.org/database/piRNA/download/archive/v2.0/fasta/hsa.fa.gz | zcat - > resources/pirna.fasta 2> resources/logs/get_pirna_index.log 

get_pirna_index:使用bowtie2建立piRNA数据的index

bowtie2-build --threads 1 --quiet resources/pirna.fasta resources/pirna > resources/logs/get_pirna_index.log

get_mirna_sequences : miRNA数据下载

curl -s ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz | zcat - | seqkit grep -r -p ^hsa | seqkit seq --rna2dna > resources/mirna_mature.fasta 2> resources/logs/get_mirna_sequences.log 
curl -s ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz | zcat - | seqkit grep -r -p ^hsa | seqkit seq --rna2dna > resources/mirna_hairpin.fasta 2> resources/logs/get_mirna_sequences.log 
cat resources/mirna_mature.fasta resources/mirna_hairpin.fasta  > resources/mirna.fasta

get_mirna_index : 使用bowtie2建立miRNA数据的index

bowtie2-build --threads 1 --quiet resources/mirna.fasta resources/mirna > resources/logs/get_mirna_index.log

3.2 下机fastq数据质控

merge_fastq_files : 合并单个样本所有的fastq文件到一个文件中

cat fastq/sampleA_S1_L001_R1_001.fastq.gz \
    fastq/sampleA_S1_L002_R1_001.fastq.gz \
    fastq/sampleA_S1_L003_R1_001.fastq.gz \
    fastq/sampleA_S1_L004_R1_001.fastq.gz > results/sampleA/fastq/sampleA_untrimmed.fastq.gz

trim_illumina_adapters : 从reads的3'端去除Illumina的通用接头序列(Adapter sequences)

cutadapt -a AGATCGGAAGAGCACACGTCTGAAC \
    --discard-untrimmed \
    -m 20 -q 10 -j 1 \
    -o results/sampleA/fastq/sampleA_illumina_trimmed.fastq.gz \
    --quiet results/sampleA/fastq/sampleA_untrimmed.fastq.gz > results/sampleA/logs/trim_illumina_adapters.log

trim_umi_sequences : 从reads的3'端去除UMI(Unique Molecular Identifier)序列,并将其添加到reads的头部

umi_tools extract --3prime \
  --stdin=results/sampleA/fastq/sampleA_illumina_trimmed.fastq.gz \
  --bc-pattern=NNNNNNNNNNNN \
  --stdout=results/sampleA/fastq/sampleA_umi_trimmed.fastq.gz > results/sampleA/logs/trim_umi_sequences.log

拓展阅读:单细胞数据分析中的UMI序列

trim_5_adapter : 从reads 5'端去除接头

cutadapt -g GTTCAGAGTTCTACAGTCCGACGATC \
    -m 20 -j 1 \
    -o results/sampleA/fastq/sampleA_5adapter_trimmed.fastq.gz \
    --quiet results/sampleA/fastq/sampleA_umi_trimmed.fastq.gz > results/sampleA/logs/trim_5_adapter.log

trim_3_adapter: 从reads 3'端去除接头,丢弃未修剪的reads并执行一些质量调整

cutadapt -a AACTGTAGGCACCATCAAT \
    --discard-untrimmed \
    -m 20 -q 10 -j 1 \
    -o results/sampleA/fastq/sampleA_trimmed.fastq.gz \
    --quiet results/sampleA/fastq/sampleA_5adapter_trimmed.fastq.gz > results/sampleA/logs/trim_3_adapter.lo

3.3 比对reads到数据库

map_mirna_sequences: 将reads比对到mirBase数据库序列,直接转换为排序的.bam文件,并生成.bai的索引文件。没有比对上的reads将保存为新的fastq 文件,随后与piRNA的数据库比对。

bowtie2 --very-sensitive-local \
    -p 1 -x resources/mirna \
    --un-gz results/sampleA/fastq/sampleA_mirna_unmapped.fastq.gz \
    results/sampleA/fastq/sampleA_trimmed.fastq.gz | samtools sort - > results/sampleA/bam/sampleA_mirna.bam 2> results/sampleA/logs/map_mirna_sequences.log 

samtools index results/sampleA/bam/sampleA_mirna.bam results/sampleA/bam/sampleA_mirna.bai

map_pirna_sequences: 将reads比对到piRNAdb数据库序列,直接转换为排序的.bam文件,并生成.bai的索引文件。有比对上的reads将保存为新的fastq 文件,随后与参考基因组比对(本文没有执行该步骤)。

bowtie2 --very-sensitive-local -p 1 \
    -x resources/pirna \
    --un-gz results/sampleA/fastq/sampleA_unmapped.fastq.gz \
    results/sampleA/fastq/sampleA_mirna_unmapped.fastq.gz | samtools sort - > results/sampleA/bam/sampleA_pirna.bam 2> results/sampleA/logs/map_pirna_sequences.log 

samtools index results/sampleA/bam/sampleA_pirna.bam results/sampleA/bam/sampleA_pirna.bai

3.4 合并bam文件

merge_bam_files: 将单个 样本BAM 文件合并为一个BAM文件,以减少处理文件的数量并简化后续工作流。

samtools merge --threads 1 \
    results/sampleA/bam/sampleA_merged.bam \
    results/sampleA/bam/sampleA_mirna.bam \
    results/sampleA/bam/sampleA_pirna.bam > results/sampleA/logs/merge_bam_files.log 

samtools index results/sampleA/bam/sampleA_merged.bam results/sampleA/bam/sampleA_merged.bai 

3.5 去除PCR重复

deduplicate_reads: 使用UMItools dedup命令根据先前识别的UMI序列,去除重复的reads

umi_tools dedup \
        -I results/sampleA/bam/sampleA_merged.bam \
        -S results/sampleA/bam/sampleA_merged_dedup.bam > results/sampleA/logs/deduplicate_reads.log 

samtools index results/sampleA/bam/sampleA_merged_dedup.bam results/sampleA/bam/sampleA_merged_dedup.bai

3.6 获得count表

count_deduplicated_read_counts: 统计BAM文件中去除重复reads的计数,并从输出中提取相关列。

samtools idxstats results/sampleA/bam/sampleA_merged_dedup.bam | \
        awk 'BEGIN{OFS='\t'} $3 != 0' - | \
        cut -f1,3 -| sed '1 i\Molecule       Count' - > results/sampleA/sampleA_counts_dedup.tsv

miRNA生物信息数据分析流程初探(一) - 生信小木屋


  1. snakemake-qiaseq-mirna

     ↩
  2. Cutadapt Removes Adapter Sequences From High-Throughput Sequencing Reads

     ↩
  3. UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

     ↩

你可能感兴趣的:(生物信息学,生物学,python,数据分析)