Gene set enrichment analysis (GSEA) was performed using GSEA standalone(独立的电脑)desktop(桌面) programme. An expression matrix(矩阵) was created containing expression values at zero and 6 h (upon 50 nM THZ1 treatment). All SE-associated genes were used as a ‘gene set database’. GSEA was run with parameter(参数) ‘Metric for ranking genes’ set to ‘log2_Ratio_of_classes’ to calculate(计算) enrichment score for SE-associated genes.
2、根据文献内容,下载基因表达数据
数据源编号GSE76861,在GEOdataset数据库中,搜索下载即可,
GSM2039110 |
TE7_H3K27Ac |
GSM2039111 |
TE7_Input |
GSM2039112 |
KYSE510_H3K27Ac |
以上四个数据为Chip-seq原始数据,使用aspera下载代码(nohup+命令+& 可以后台运行)
nohup+命令+&:将命令放置到后台运行,并且断开连接依旧运行,QT参数可以断点续存并且加到最大速度
nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR310/004/SRR3101254/SRR3101254.fastq.gz . &
nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR310/001/SRR3101251/SRR3101251.fastq.gz . &
nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR310/002/SRR3101252/SRR3101252.fastq.gz . &
nohup ascp -QT -l 100M -i ~/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR310/003/SRR3101253/SRR3101253.fastq.gz . &
并且解压缩
gunzip SRR3101251.fastq.gz
gunzip SRR3101252.fastq.gz
gunzip SRR3101253.fastq.gz
gunzip SRR3101254.fastq.gz
文献处理好的数据:
GSE76861_RAW.tar |
431.0 Mb |
(http)(custom) |
TAR (of BW, TXT) |
相关链接 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76861
解压缩命令:tar -xvf GSE76861_RAW.tar
3、下载人类参考基因组
文献说reads were aligned to humanreference(参考) genome(基因组) (build GRCh37/hg19) using Bowtie Aligner
bowtie官网上面有hg19建好的序列
解压缩命令:unzip hg19.ebwt.zip
4、fastqc质量检测
fastqc命令:
fastqc -o . -t 5 -f fastq SRR3101251.fastq &
-t 5:表示开5个线程运行
最后的 &:表示将命令放置到后台执行
(要分别对四个fastq文件执行四次)
4、使用bowtie比对
bowtie命令:
bowtie genome/hg19 -q reads/SRR3101251.fastq -m 1 -p 4 -S 2> SRR3101251.out > SRR3101251.sam
-q表示输入fastq文件
-m 1 只保留比对上一次的序列
-5 1 由于fastqc结果显示5‘端(左端)质量不是很好,可以选择切掉1个碱基,当然也可以不切
-p 4 设置多线程个数
-S 表示输出为sam格式的文件
2> SRR3101251.out将屏幕上的结果输出到SRR3101251.out文件
> SRR3101251.sam将标准输出,定向到SRR3101251.sam文件
5、使用MACS建模获得peaks富集区
nohup macs14 -t SRR3101251.sam -c SRR3101252.sam --format SAM --name "TE7
" --keep-dup 1 --wig --single-profile --space=50 --diag &
nohup
macs14 -t SRR3101253.sam -c SRR3101254.sam --format SAM --name "KYSE510" --keep-dup 1 --wig --single-profile --space=50 --diag &
-t 表示Chip-seq处理过的文件
-c control对照组文件
--format SAM 表示输入为sam文件格式
--name "macs14" 输出文件附加的前缀
--keep-dup 1 指明Macs对于在染色体同一位置的reads(重复序列)处理方式。使用说明里写默认值1效果最好
--wig和--space=50 输出文献要求的wiggle file
Wiggle files were generated(形成) using read pileups(连环相撞) for every 50 base pair bins.
以下为相关参数使用手册详解
--keep-dup=KEEPDUPLICATES
It controls the MACS behavior towards duplicate tags
at the exact same location -- the same coordination
and the same strand. The 'auto' option makes MACS
calculate the maximum tags at the exact same location
based on binomal distribution using 1e-5 as pvalue
cutoff; and the 'all' option keeps every tags. If an
integer is given, at most this number of tags will be
kept at the same location. Default: 1. To only keep
one performs the best in terms of detecting enriched
regions, from our internal study.
--bw=BW Band width. This value is only used while building the
shifting model. DEFAULT: 300
-g GSIZE, --gsize=GSIZE 此参数默认为人类,因此无需填写
Effective genome size. It can be 1.0e+9 or 1000000000,
or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
(1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
fruitfly (1.2e8), Default:hs
-w, --wig Whether or not to save extended fragment pileup at
every WIGEXTEND bps into a wiggle file. When --single-
profile is on, only one file for the whole genome is
saved. WARNING: this process is time/space consuming!!
-B, --bdg Whether or not to save extended fragment pileup at
every bp into a bedGraph file. When it's on, -w,
--space and --call-subpeaks will be ignored. When
--single-profile is on, only one file for the whole
genome is saved. WARNING: this process is time/space
consuming!!
-S, --single-profile When set, a single wiggle file will be saved for
treatment and input. Default: False
--space=SPACE The resoluation for saving wiggle files, by default,
MACS will save the raw tag count every 10 bps. Usable
only with '--wig' option.
MACS输出文件,需要注意蓝色部分
Output files
- NAME_peaks.xls is a tabular file which contains information about called peaks. You can open it in excel and sort/filter using excel functions. Information include: chromosome name, start position of peak, end position of peak, length of peak region, peak summit position related to the start position of peak region, number of tags in peak region, -10*log10(pvalue) for the peak region (e.g. pvalue =1e-10, then this value should be 100), fold enrichment for this region against random Poisson distribution with local lambda, FDR in percentage. Coordinates in XLS is 1-based which is different with BED format.
- NAME_peaks.bed is BED format file which contains the peak locations. You can load it to UCSC genome browser or Affymetrix IGB software.
- NAME_summits.bed is in BED format, which contains the peak summits locations for every peaks. The 5th column in this file is the summit height of fragment pileup. If you want to find the motifs at the binding sites, this file is recommended.
- NAME_negative_peaks.xls is a tabular file which contains information about negative peaks. Negative peaks are called by swapping the ChIP-seq and control channel.
- NAME_model.r is an R script which you can use to produce a PDF image about the model based on your data. Load it to R by the following command. Then a pdf file NAME_model.pdf will be generated in your current directory. Note, R is required to draw this figure:
$ R —vanilla < NAME_model.r
- NAME_treat/control_afterfiting.wig.gz files in NAME_MACS_wiggle directory are wiggle format files which can be imported to UCSC genome browser/GMOD/Affy IGB. The .bdg.gz files are in bedGraph format which can also be imported to UCSC genome browser or be converted into even smaller bigWig files.
- NAME_diag.xls is the diagnosis report. First column is for various fold_enrichment ranges; the second column is number of peaks for that fc range; after 3rd columns are the percentage of peaks covered after sampling 90%, 80%, 70% ... and 20% of the total tags.
- NAME_peaks.subpeaks.bed is a text file which IS NOT in BED format. This file is generated by PeakSplitter () when —call-subpeaks option is set
6、编写程序对wig文件进行normalised
python程序如下
对TE7_H3K27Ac和KYSE510_H3K27Ac的wig文件(即MACS后生成的treat文件夹里的wig文件)计算RPM
RPM公式:(某位置的reads数目÷所有染色体上总reads数目)×1000000
7、使用wigToBigWig转化格式
下载fetchChromSizes程序,
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/fetchChromSizes
chmod 777 fetchChromSizes
获取hg19基因组对应的染色体大小信息,为
wigToBigWig程序做准备
fetchChromSizes hg19 >hg19.chrom.sizes
chrM对应的值改为16750
下载wigToBigWig程序
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/bedGraphToBigWig
进行Wiggle文件到bigwig文件的转换
wigToBigWig KYSE510_control_afterfiting_all.wig hg19.chrom.sizes KYSE510_control_afterfiting_all.bw
8、安装IGV(Integrative Genomics Viewer)对结果可视化
从IGV官网下载windows版本 http://software.broadinstitute.org/software/igv/download
根据提示安装
直接点击打开igv.jar或者对bat文件以管理员身份运行
首先,载入hg19基因组;接着载入两个normalised后的bw文件即可
9、使用deeptools进行可视化
安装:Requirements: Python 2.7, numpy, scipy installed
Commands:
$ cd ~
$ export PYTHONPATH=$PYTHONPATH:~/lib/python2.7/site-packages
$ export PATH=$PATH:~/bin:~/.local/bin
这里ubuntu可以用apt安装pip
If pip is not already available, install it with:
$ easy_install --prefix=~ pip
这里会出很多问题,把报错信息粘贴谷歌一下即可
Install deepTools and dependencies with pip:
$ pip install --user deeptools
10、安装ROSE鉴定super Enhancer
ROSE程序可以到http://younglab.wi.mit.edu/super_enhancer_code.html下载,并且有2.7G的示例数据
数据预处理
(1)安装samtools,将sam文件转化为bam文件,
需要将
SRR3101251.sam,
SRR3101253.sam,
SRR3101253.sam,
SRR3101254.sam都进行此步骤
sam转成bam文件+排序
samtools view -bS SRR3101251.sam | samtools sort - SRR3101251_sorted
为bam文件建立索引
samtools index SRR3101251_sorted.bam SRR3101251_sorted.bai
(2)准备指明峰位置的gff文件(PS:此处的gff文件不是基因注释文件)
NAME_peaks.bed和NAME_summits.bed
为MACS结果中的存储峰位置信息的文件,而
NAME_summits.bed仅为峰顶的位置信息,故选择
NAME_peaks.bed提取所需信息
awk '{print $1"\t"$4"\t"".""\t"$2"\t"$3"\t"".""\t"".""\t"".""\t"$4}' KYSE510_peaks.bed>KYSE510_peaks.gff
awk '{print $1"\t"$4"\t"".""\t"$2"\t"$3"\t"".""\t"".""\t"".""\t"$4}' TE7_peaks.bed>TE7_peaks.gff
也可以直接指定MACS14的结果中TE7_peaks.bed和KYSE510_peaks.bed为gff文件,ROSE程序会自动进行转换。
PS:ROSE使用手册中关于gff文件的说明
.gff file of constituent enhancers previously identified (gff format ref: https://genome.ucsc.edu/FAQ/FAQformat.html#format3).
.gff must have the following columns:
1: chromosome (chr#)
2: unique ID for each constituent enhancer region
4: start of constituent
5: end of constituent
7: strand (+,-,.)
9: unique ID for each constituent enhancer region
NOTE: if value for column 2 and 9 differ, value in column 2 will be used
运行ROSE程序
文献SEs were identified using ROSE ( https://bitbucket.org/youngcomputation/rose). Closely spaced peaks (except those within 2 kb of TSS) within a range of 12.5 kb were
merged
(合并) together, followed by the
measurement
(测量) of
input
(投入) and H3K27Ac signals. These merged peaks were ranked by H3K27Ac signal and then
classified
(分类) into SEs or TEs. Both SEs and TEs were
assigned
(分配) to the nearest Ensemble genes.
nohup python ROSE_main.py -g HG19 -i TE7_peaks.gff -r SRR3101251_sorted.bam -c SRR3101252_sorted.bam -o 5-ROSE-result/TE7/ -s 12500 -t 2000 2>5-ROSE-result/TE7/log.txt &
nohup python ROSE_main.py -g HG19 -i KYSE510_peaks.gff -r SRR3101253_sorted.bam -c SRR3101254_sorted.bam -o 5-ROSE-result/KYSE510/ -s 12500 -t 2000 2>5-ROSE-result/KYSE510/log.txt &
-g HG19表示基因组版本,选定HG19即可
-i 选定gff文件
-r 实验组的bam文件
-c control组的bam文件
-o 输出目录
-s 12500 相邻12500bp内的峰合并
-t 2000 除去2000bp内的TSS(转录开始位置),是考虑到了起始子promoter
使用手册:
From within root directory:
python ROSE_main.py -g GENOME_BUILD -i INPUT_CONSTITUENT_GFF -r RANKING_BAM -o OUTPUT_DIRECTORY [optional: -s STITCHING_DISTANCE -t TSS_EXCLUSION_ZONE_SIZE -c CONTROL_BAM]
Required parameters:
GENOME_BUILD: one of hg18, hg19, mm8, mm9, or mm10 referring to the UCSC genome build used for read mapping
INPUT_CONSTITUENT_GFF: .gff file (described above) of regions that were previously calculated to be enhancers. I.e. Med1-enriched regions identified using MACS.
RANKING_BAM: .bam file to be used for ranking enhancers by density of this factor. I.e. Med1 ChIP-Seq reads.
OUTPUT_DIRECTORY: directory to be used for storing output.
Optional parameters:
STITCHING_DISTANCE: maximum distance between two regions that will be stitched together (Default: 12.5kb)
TSS_EXCLUSION_ZONE_SIZE: exclude regions contained within +/- this distance from TSS in order to account for promoter biases (Default: 0; recommended if used: 2500). If this value is 0, will not look for a gene file.
CONTROL_BAM: .bam file to be used as a control. Subtracted from the density of the RANKING_BAM. I.e. Whole cell extract reads.
进行基因注释
ROSE带有Enhancer的注释程序;
-i 输入
ROSE_main.py运行出来的保存Enhancer和Super Enhancer的文件,为
AllEnhancers.table.txt
-g 输入基因组的名称
-o 输入输出目录
python ROSE_geneMapper.py -i KYSE510_peaks_AllEnhancers.table.txt -g HG19 -o 5-ROSE-result/annotation_KYSE510/
ROSE结果解读
(1)TE7_peaks_AllEnhancers.table.txt和
KYSE510_peaks_AllEnhancers.table.txt
(2)TE7_peaks_Plot_points.png
和KYSE510_peaks_Plot_points.png
横坐标为Enhancer排名。是
AllEnhancers.table.txt里的
enhancerRank ,
enhancerRank
越靠前越是super Enhancer
纵坐标为信号值,越高越可能是super Enhancer,
是直接根据
AllEnhancers里
SRR3101251_sorted.bam 列-
SRR3101252_sorted.bam列计算出
下载安装GSEA
GSEA有开发出桌面版应用,需要在Java环境下运行,注册后便可以下载
与文献的结果进行比较
1、normalised后,wig文件的统计比较
大致上看,我做出来的结果比文献里总体要小7-8倍
2、IGV可视化比较
大致的峰和文献结果是一样的
PS:问题与解释
1、
问题:如下图,bowtie官网上,
点进序列下载那一栏的iGenomes可以看见
,分别看见
GRCh37和hg19,这两个有什么不同吗?