CHIP-seq流程学习笔记(9)-使用IDR 软件对生物学重复样本间的差异peak进行提取

参考文章:

使用IDR软件处理生物学重复样本的peak calling

Irreproducible Discovery Rate (IDR)

1. 使用Conda安装IDR软件

(base) zexing@DNA:~/projects/zhaoyingying/ChIP_seq/2021_05_01/scripts_log$ conda install idr
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
  environment location: /f/xudonglab/zexing/miniconda3
  added / updated specs:
    - idr
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    conda-4.10.1               |   py37h89c1867_0         3.1 MB  https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    idr-2.0.4.2                |   py37h77a2a36_5          77 KB  https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
    matplotlib-3.2.2           |                1           6 KB  https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    matplotlib-base-3.2.2      |   py37h1d35a4c_1         7.0 MB  https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    openssl-1.1.1k             |       h7f98852_0         2.1 MB  https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
    ------------------------------------------------------------
                                           Total:        12.3 MB
The following NEW packages will be INSTALLED:
  idr                anaconda/cloud/bioconda/linux-64::idr-2.0.4.2-py37h77a2a36_5
  matplotlib         anaconda/cloud/conda-forge/linux-64::matplotlib-3.2.2-1
The following packages will be UPDATED:
  conda                                4.9.2-py37h89c1867_0 --> 4.10.1-py37h89c1867_0
  openssl                                 1.1.1i-h7f98852_0 --> 1.1.1k-h7f98852_0
The following packages will be DOWNGRADED:
  matplotlib-base                      3.3.3-py37h4f6019d_0 --> 3.2.2-py37h1d35a4c_1

Proceed ([y]/n)? y

Downloading and Extracting Packages
conda-4.10.1         | 3.1 MB    | ####################################################################################################### | 100%
matplotlib-base-3.2. | 7.0 MB    | ####################################################################################################### | 100%
idr-2.0.4.2          | 77 KB     | ####################################################################################################### | 100%
openssl-1.1.1k       | 2.1 MB    | ####################################################################################################### | 100%
matplotlib-3.2.2     | 6 KB      | ####################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

2. IDR软件各参数的含义

usage: idr [-h] --samples SAMPLES SAMPLES [--peak-list PEAK_LIST]
           [--input-file-type {narrowPeak,broadPeak,bed,gff}] [--rank RANK]
           [--output-file OUTPUT_FILE]
           [--output-file-type {narrowPeak,broadPeak,bed}]
           [--log-output-file LOG_OUTPUT_FILE] [--idr-threshold IDR_THRESHOLD]
           [--soft-idr-threshold SOFT_IDR_THRESHOLD] [--use-old-output-format]
           [--plot] [--use-nonoverlapping-peaks]
           [--peak-merge-method {sum,avg,min,max}] [--initial-mu INITIAL_MU]
           [--initial-sigma INITIAL_SIGMA] [--initial-rho INITIAL_RHO]
           [--initial-mix-param INITIAL_MIX_PARAM] [--fix-mu] [--fix-sigma]
           [--dont-filter-peaks-below-noise-mean] [--use-best-multisummit-IDR]
           [--allow-negative-scores] [--random-seed RANDOM_SEED]
           [--max-iter MAX_ITER] [--convergence-eps CONVERGENCE_EPS]
           [--only-merge-peaks] [--verbose] [--quiet] [--version]

optional arguments:
  -h, --help            show this help message and exit
  --samples SAMPLES SAMPLES, -s SAMPLES SAMPLES
                        Files containing peaks and scores.
  --peak-list PEAK_LIST, -p PEAK_LIST
                        If provided, all peaks will be taken from this file.
  --input-file-type {narrowPeak,broadPeak,bed,gff}
                        File type of --samples and --peak-list.
  --rank RANK           Which column to use to rank peaks.
                        Options: signal.value p.value q.value columnIndex
                        Defaults:
                                narrowPeak/broadPeak: signal.value
                                bed: score
  --output-file OUTPUT_FILE, -o OUTPUT_FILE
                        File to write output to.
                        Default: idrValues.txt
  --output-file-type {narrowPeak,broadPeak,bed}
                        Output file type. Defaults to input file type when available, otherwise bed.
  --log-output-file LOG_OUTPUT_FILE, -l LOG_OUTPUT_FILE
                        File to write output to. Default: stderr
  --idr-threshold IDR_THRESHOLD, -i IDR_THRESHOLD
                        Only return peaks with a global idr threshold below this value.
                        Default: report all peaks
  --soft-idr-threshold SOFT_IDR_THRESHOLD
                        Report statistics for peaks with a global idr below this value but return all peaks with an idr below --idr.
                        Default: 0.05
  --use-old-output-format
                        Use old output format.
  --plot                Plot the results to [OFNAME].png
  --use-nonoverlapping-peaks
                        Use peaks without an overlapping match and set the value to 0.
  --peak-merge-method {sum,avg,min,max}
                        Which method to use for merging peaks.
                                Default: 'sum' for signal/score/column indexes, 'min' for p/q-value.
  --initial-mu INITIAL_MU
                        Initial value of mu. Default: 0.10
  --initial-sigma INITIAL_SIGMA
                        Initial value of sigma. Default: 1.00
  --initial-rho INITIAL_RHO
                        Initial value of rho. Default: 0.20
  --initial-mix-param INITIAL_MIX_PARAM
                        Initial value of the mixture params. Default: 0.50
  --fix-mu              Fix mu to the starting point and do not let it vary.
  --fix-sigma           Fix sigma to the starting point and do not let it vary.
  --dont-filter-peaks-below-noise-mean
                        Allow signal points that are below the noise mean (should only be used if you know what you are doing).
  --use-best-multisummit-IDR
                        Set the IDR value for a group of multi summit peaks (a group of peaks with the same chr/start/stop but different summits) to the best value across all of these peaks. This is a work around for peak callers that don't do a good job splitting scores across multi summit peaks (e.g. MACS). If set in conjunction with --plot two plots will be created - one with alternate summits and one without.  Use this option with care.
  --allow-negative-scores
                        Allow negative values for scores. (should only be used if you know what you are doing)
  --random-seed RANDOM_SEED
                        The random seed value (sor braking ties). Default: 0
  --max-iter MAX_ITER   The maximum number of optimization iterations. Default: 3000
  --convergence-eps CONVERGENCE_EPS
                        The maximum change in parameter value changes for convergence. Default: 1.00e-06
  --only-merge-peaks    Only return the merged peak list.
  --verbose             Print out additional debug information
  --quiet               Don't print any status messages
  --version             show program's version number and exit

3.在Linux服务器对生物学重复样本间的peak进行鉴定

此次实验是具有生物学重复样本,处理前需要对重复样本的共有peak进行鉴定,采用IDR软件进行筛选。

vim新建ChIP_seq_script_1,脚本如下:

#! /bin/bash
# 上面一行宣告这个script的语法使用bash语法,当程序被执行时,能够载入bash的相关环境配置文件。
# Program:
#       This program is used for calling peaks from different samples in the same condition.
#History:
# 2021/05/08         zexing              First release
# 
# 参数--samples Files containing peaks and scores。
# 参数--peak-list If provided, all peaks will be taken from this file。
# 参数--output-file File to write output to。Default: idrValues.txt
# 参数--plot Plot the results to [OFNAME].png。
# 参数--output-file-type {narrowPeak,broadPeak,bed}. Output file type. Defaults to input file type when available, otherwise bed.
# 参数--idr-threshold  Only return peaks with a global idr threshold below this value. Default: report all peaks
# 参数--soft-idr-threshold Report statistics for peaks with a global idr below this value but return all peaks with an idr below --idr. Default: 0.05

dir=/f/xudonglab/zexing/projects/zhaoyingying/ChIP_seq/2021_05_01
peak=${dir}/macs2_callpeak
results=${dir}/idr

idr \
--samples ${peak}/JV21_H3K27me3_peaks.narrowPeak ${peak}/JV22_H3K27me3_peaks.narrowPeak \
--output-file ${results}/JV21_JV22_H3K27me3_narrowPeak.txt \
--plot ${results}/JV21_JV22_H3K27me3_narrowPeak.png \
--idr-threshold 0.05

后台运行ChIP_seq_script_1脚本如下

nohup bash ChIP_seq_script_1 > ChIP_seq_script_1_log &

对于输出文件的结果有两种处理方式:

  • 通过–soft-idr-threshold参数来输出保存所有peak的结果,该默认情况下统计IDR <0.05的peak,这种情况下–output-file参数保存的是.txt文件,输出文件的第12列记录了peak对应的global IDR value的值,通过这个值进行筛选即可;
  • 通过–idr-threshold参数直接定义IDR值,以此来调整输出的peak数目,同时输出的文件仅包含符合条件的peak信息。

你可能感兴趣的:(ChIP-seq学习笔记)