今天第一次尝试处理ATAC_seq数据,希望能尽快做完吧。
先放个找好的参考文章:ATAC-seq/ChIP-seq分析方法
对新数据建立对应实验人员(zhaoyingying)、测序类型(ATAC_seq)和日期(2021_05_03)的目录。
# 建立后如下:
(base) zexing@DNA:~/projects/zhaoyingying/ATAC_seq/2021_05_03$
# 新建对应的目录
mkdir raw_data clean_data bam bam_bw bam_sort sam macs2_bdgdiff macs2_callpeak matrix_reference_point matrix_scale_regions fastqc_report MD5_txt scripts_log
(base) zexing@DNA:~/projects/zhaoyingying/ATAC_seq/AJV5-ATAC_FKDL210049869-1a$ cat MD5_AJV5_FKDL210049869-1a.txt > check_md5sum_AJV5_FKDL210049869-1a.txt && md5sum -c check_md5sum_AJV5_FKDL210049869-1a.txt
AJV5_FKDL210049869-1a_1.clean.fq.gz: OK
AJV5_FKDL210049869-1a_2.clean.fq.gz: OK
(base) zexing@DNA:~/projects/zhaoyingying/ATAC_seq/AJV93-ATAC_FKDL210049870-1a$ cat MD5_AJV93_FKDL210049870-1a.txt > check_MD5_AJV93_FKDL210049870-1a.txt && md5sum -c check_MD5_AJV93_FKDL210049870-1a.txt
AJV93_FKDL210049870-1a_1.clean.fq.gz: OK
AJV93_FKDL210049870-1a_2.clean.fq.gz: OK
(base) zexing@DNA:~/projects/zhaoyingying/ATAC_seq/JV84-ATAC_FKDL210049867-1a$ cat MD5_JV84_FKDL210049867-1a.txt > check_MD5_JV84_FKDL210049867-1a.txt && md5sum -c check_MD5_JV84_FKDL210049867-1a.txt
JV84_FKDL210049867-1a_1.clean.fq.gz: OK
JV84_FKDL210049867-1a_2.clean.fq.gz: OK
(base) zexing@DNA:~/projects/zhaoyingying/ATAC_seq/JV85-ATAC_FKDL210049868-1a$ cat MD5_JV85_FKDL210049868-1a.txt > check_MD5_JV85_FKDL210049868-1a.txt && md5sum -c check_MD5_JV85_FKDL210049868-1a.txt
JV85_FKDL210049868-1a_1.clean.fq.gz: OK
JV85_FKDL210049868-1a_2.clean.fq.gz: OK
(base) zexing@DNA:~/projects/zhaoyingying/ChIP_seq/2021_05_01$ cd scripts_log
(base) zexing@DNA:~/projects/zhaoyingying/ChIP_seq/2021_05_01/scripts_log$ vim ChIP_seq_script_1
vim新建ATAC_seq_script_1将数据质控、比对、格式转换、排序、生成目录、bamCoverage命令转换文件格式和macs2 callpeak综合在一起。
#!/bin/bash
# 上面一行宣告这个script的语法使用bash语法,当程序被执行时,能够载入bash的相关环境配置文件。
# Program
# This program is used for ChIP-seq data analysis.
# History
# 2021/05/09 zexing First release
# 设置变量${dir}为常用目录
# 用户名称和日期需要更改
dir=/f/xudonglab/zexing/projects/zhaoyingying/ATAC_seq/2021_05_03
# 对数据进行质控
fastqc -t 16 -o ${dir}/fastqc_report/ ${dir}/clean_data/*.fq.gz
# 利用for循环进行后续操作
# 样品名称需要修改
for i in AJV5_FKDL210049869-1a AJV93_FKDL210049870-1a JV84_FKDL210049867-1a JV85_FKDL210049868-1a
do
# 对数据进行比对
bowtie2 -t -p 16 -x /f/xudonglab/zexing/reference/UCSC_mm10/bowtie2_index/mm10 -1 ${dir}/clean_data/${i}_1.clean.fq.gz -2 ${dir}/clean_data/${i}_2.clean.fq.gz -S ${dir}/sam/${i}.sam
# 对数据进行格式转换
samtools view -@ 16 -S ${dir}/sam/${i}.sam -1b -o ${dir}/bam/${i}.bam
# 对数据进行排序
samtools sort -@ 16 -l 5 -o ${dir}/bam_sort/${i}.bam.sort ${dir}/bam/${i}.bam
# 对数据生成目录
samtools index -@ 16 ${dir}/bam_sort/${i}.bam.sort
# bamCoverage命令转换文件格式
bamCoverage -p 16 -v -b ${dir}/bam_sort/${i}.bam.sort -o ${dir}/bam_bw/${i}.bam.sort.bw
# 使用macs2对ATAC_seq数据进行callpeak,需要使用Shift 模型参数
macs2 callpeak -t ${dir}/bam_sort/${i}.bam.sort -f BAM -g mm -B --nomodel --shift -100 --extsize 200 --outdir ${dir}/macs2_callpeak/ -n ${i}
done
在后台运行ATAC_seq_script_1:
nohup bash ATAC_seq_script_1 > ATAC_seq_script_1_log &
此次实验是具有生物学重复样本,处理前需要对重复样本的共有peak进行鉴定,采用IDR软件进行筛选。具体参考以下文章:
CHIP-seq流程学习笔记(9)-使用IDR 软件对生物学重复样本间的差异peak进行提取
vim新建ATAC_seq_script_2,脚本如下:
#! /bin/bash
# 上面一行宣告这个script的语法使用bash语法,当程序被执行时,能够载入bash的相关环境配置文件。
# Program:
# This program is used for calling peaks from different samples in the same condition.
#History:
# 2021/05/10 zexing First release
#
# 参数--samples Files containing peaks and scores。
# 参数--peak-list If provided, all peaks will be taken from this file。
# 参数--output-file File to write output to。Default: idrValues.txt
# 参数--output-file-type {narrowPeak,broadPeak,bed}. Output file type. Defaults to input file type when available, otherwise bed.
# 参数--idr-threshold Only return peaks with a global idr threshold below this value. Default: report all peaks
dir=/f/xudonglab/zexing/projects/zhaoyingying/ATAC_seq/2021_05_03
peak=${dir}/macs2_callpeak
results=${dir}/idr
# 对照组取公有peaks并输出为narrowPeak文件
idr --samples ${peak}/JV85_FKDL210049868-1a_peaks.narrowPeak ${peak}/JV84_FKDL210049867-1a_peaks.narrowPeak --output-file-type narrowPeak --output-file ${results}/JV85_JV84_peak.narrowPeak --idr-threshold 1
# 实验组取公有peaks并输出为narrowPeak文件
idr --samples ${peak}/AJV5_FKDL210049869-1a_peaks.narrowPeak ${peak}/AJV93_FKDL210049870-1a_peaks.narrowPeak --output-file-type narrowPeak --output-file ${results}/AJV5_AJV93_peak.narrowPeak --idr-threshold 1
后台运行ATAC_seq_script_2脚本如下
nohup bash ATAC_seq_script_2 > ATAC_seq_script_2_log &
运行记录如下:
Initial parameter values: [0.10 1.00 0.20 0.50]
Final parameter values: [0.29 1.88 0.94 0.33]
Number of reported peaks - 937/937 (100.0%)
Number of peaks passing IDR cutoff of 1.0 - 937/937 (100.0%)
Initial parameter values: [0.10 1.00 0.20 0.50]
Final parameter values: [0.23 1.28 0.83 0.52]
Number of reported peaks - 2769/2769 (100.0%)
Number of peaks passing IDR cutoff of 1.0 - 2769/2769 (100.0%)
1. computeMatrix scale-regions模式计算信号强度并用plotHeatmap/plotProfile作图
scale-regions模式计算的是区域形式,所以指定作图位置的BED或GTF格式文件为macs2 callpeak生成的后缀为peaks.narrowPeak的BED文件。
vim新建ATAC_seq_script_3,脚本如下:
#! /bin/bash
# 上面一行宣告这个script的语法使用bash语法,当程序被执行时,能够载入bash的相关环境配置文件。
# Program:
# This program is used for computeMatrix scale-regions.
#History:
# 2021/05/07 zexing First release
# In the scale-regions mode, all regions in the BED file are stretched or shrunken to the length (in bases) indicated by the user.
# 参数-R 指定作图位置的BED或GTF格式文件,可用#标记同一组区域,默认无。
# 参数-S 输入bigwig文件。
# 参数-o 指定输出为文件名用于plotHeatmap, plotProfile
# 参数-b上游(默认0bp),-a下游(默认0bp)设定感兴趣的区域,如果该区域是基因,则为基因TSS上游或TES下游。
# 参数--skipZeros设定0分区域的处理
# 参数-p 设置线程数
dir=/f/xudonglab/zexing/projects/zhaoyingying/ATAC_seq/2021_05_03
bed=${dir}/idr
bw=${dir}/bam_bw
results=${dir}/matrix_scale_regions
computeMatrix scale-regions \
-R ${bed}/JV85_JV84_peak.narrowPeak ${bed}/AJV5_AJV93_peak.narrowPeak \
-S ${bw}/JV85_FKDL210049868-1a.bam.sort.bw ${bw}/JV84_FKDL210049867-1a.bam.sort.bw ${bw}/AJV5_FKDL210049869-1a.bam.sort.bw ${bw}/AJV93_FKDL210049870-1a.bam.sort.bw \
-o ${results}/matrix_scale_ATAC.gz \
-b 3000 -a 3000 \
-p 16
# 使用plotHeatmap对结果绘制热图并聚类
plotHeatmap -m ${results}/matrix_scale_ATAC.gz \
-o ${results}/scale_ATAC_heatmap.png \
--dpi 750 \
--whatToShow 'heatmap and colorbar' \
--startLabel "Start" \
--endLabel "End" \
--regionsLabel JV85_JV84_peak AJV5_AJV93_peak \
--samplesLabel JV85 JV84 AJV5 AJV93
# 使用plotProfile对结果绘制密度图并聚类
plotProfile -m ${results}/matrix_scale_ATAC.gz \
-o ${results}/scale_ATAC_profile.png \
--dpi 750 \
--legendLocation upper-right \
--startLabel "Start" \
--endLabel "End" \
--regionsLabel JV85_JV84_peak AJV5_AJV93_peak \
--samplesLabel JV85 JV84 AJV5 AJV93 \
--perGroup
后台运行ATAC_seq_script_3脚本如下
nohup bash ATAC_seq_script_3 > ATAC_seq_script_3_log &