GEM peak caller

GEM(Genome wide Event finding and Motif discovery)

2.GPS和GEM两种算法

GPS:

input:ChIP-Seq read data

output:binding event calling

GEM:

input:ChIP-Seq read data、基因组序列

output:de novo motif discovery、binding event calling

加入以下参数可调用GEM算法:

--genome/(--k) or (--k_min and --k_max) or (--seed)


3.Read distributions

GEM/GPS 需要输入read distribution文件,用户可把默认的read distribution(https://groups.csail.mit.edu/cgs/gem/download/Read_Distribution_default.txt)文件作为起始文件。经过一轮预测后,GEM会重新计算read distribution。

--smooth

--mrc

4.Input and output

(1)input

GEM以ChIP-Seq的比对结果和基因组序列作为input,返回可能的结合位点以及motif。比对结果可以是SAM或者BED格式。

如果要进行de novo motif discovery,就要提供基因组序列。详见我踩过的坑1

(2)output

GEM会返回结合位点及motif信息。因为read distribution re-estimation,GEM会返回多轮的event prediction和read distribution文件。

GEM peak caller_第1张图片
output files

软件默认的read distribution文件可能会因为噪点太大而导致发现的结合位点太少。继续运行GEM可能会使结果更糟糕。因此,GEM会保存每一轮的结果文件,这样可以通过检查spatial distributions去选择最好的那一轮的结果。GEM会将各轮结果输出到一张图上(xxx_All_Read_Distributions.png),方便比较。理想结果是靠后的曲线应该更平滑且类似于round 0的结果。如果是这样,那么用户可以使用这些轮的结果。如果不是这种情况,那么我们建议使用round 0的预测结果。曲线不平缓可能是因为结合位点太少导致的。

(1)GEM event text file(GEM_event.txt)

包括以下内容:

* Location: 结合位点的基因坐标

* IP binding strength:结合到该位点的IP reads数目

* Control binding strength: 相应区域结合的control reads数目

Fold: fold enrichment(IP/control)

Expected binding strength:the number of IP read counts expected in the binding region given its local context(defined by parameter W2 or W3),this is used as the Lambda parameter for the Poission test.通过泊松分布检验的最低IP reads数目

Q_-lg10:-log10(q-value),取二项检验和泊松检验中较大的p_value,进行多次检验校正后得到的q-value。

P_lg10:-log10(p-value),(有control存在时)二项检验假设下IP reads和ctrl reads算得的p-value

P_poiss:-log10(p-value), (不考虑control data),泊松检验假设下IP reads和 expected reads算得的p-value

IPvsEMP:Shape deviation, the KL divergence of IP reads from the empirical read distribution (log10(KL)),this is used to filter predicted events given the --sd cutoff(default=-0.4)

Noise: the fraction of the event read count estimated to be noise.(划成noise的标准是什么)

KmerGroup: 该结合位点的k-mers,只显示最显著的一个k-mer,n/n values are the total number of sequence hits of the k-mer group in the positive and negative training sequences(by default total 5000 of each),respectively

KG_hgp:log10(hypergeometric p-value) log10(hypergeometric p-value), the significance of enrichment of this k-mer group in the positive vs negative training sequences (by default total 5000 of each), it is the hypergeometric p-value computed using the pos/neg hit counts and total counts

Strand: the sequence strand that contains the k-mer group match, the orientation of the motif is determined during the GEM motif discovery, '*' represents that no k-mer is found to associated with this event

(2)K-mer set memory motifs(KSM.txt)

Header line1,dataset name

Header line2,KSM version number   

Header line3,eg."#5000/5000",number of positive/negative sequences that were used for learing the motif

Header line4,eg."#3.01", the KSM motif score cutoff, optimized to give best motif enrichment in the training sequences.

我的文件从Headerline3开始

KSM文件各列含义:

k-mer/r.c. k-mer sequence及其reverse complement

cluster:cluster ID of k-mer set(Cluster 0 是primary motif也就是最显著的motif)

Offset:该k-mer相对于seed-kmer的offset

PosCt: 包含该k-mer的postive sequence的数量

wPosCt:weighted PosCt, 用relative sequence weighting来计算。在GEM中,是log(read count)

NegCt:包含该k-mer的negative sequences数目

HgpLg10:HyperGeometric P-value(log10)

PosHits:The IDs of positive sequences that contain this k-mer in bits (base85 encoding)

NegHits:The IDs of negative sequences that contain this k-mer in bits (base85 encoding)

3.PFM file of PWM motifs(PFM.txt)

4.HTML file summarizing the GEM event and motif results

网页解析

(1)motifs in PWM format

每一个motif都对应着一个k-mer set motif(KSM)

Optimal PWM score/Maximum PWM :the score that gives most significant p-value(hgp) when scanning the positive and negative sequences.

Hit:Number of positive sequences containing a match to this PWM/Number of negative sequences containing a match to this PWM

hpg: Hypergeometric p-value of this PWM computed from the pos/neg hit counts given the number of total positive/negative sequences.

the motif spatial distribution plots show the relative positions of each motif PWM match relative to the primary motif PWM match if they are both present win the same positive sequences.

the primary PWM (anchoring motif) is at position 0. The relative position of PWM of interest is plotted.

The color is blue when the two motifs are in the same orientation(as display in the motif logos).The color is red when the motifs are in the opposite orientation.

The number pair represents the position and the counts of secondary motif at that position.For example, in the second plot,(-7,150) means when the primary motif(Sox2) is at position 0, then there are 150 instances of secondary motif(Oct4,reverse compliment orientation) at the -7 position.

Total number of coocurrence of both motif instances.

Thus the second plot says, there are 722 cases of cooccurrence of Sox2 and Oct4 motifs, and 150 of them are detected to have a motif offset of -7.This is consistent with the fact that Sox2 and Oct4 usually bind as heterodimer.

Note: For the motif spatial distribution plot,GEM reports all the motif instances. There could be multiple motif instance on both strands. And the PWM hit is counting the sequences(at most one hit per sequence).Thus the total count in the motif spatial distribution plot may be larger than PWM hit.


GEM peak caller_第2张图片
motif PWM and Motif spatial distribution

5.GEM event text files(significant,insignificant and filtered)

insignificant events: those do not pass the statistical test

filtered events: those would pass the statistical test using the read count, but have a low IP/Ctrl ratio, or the distribution of reads are quite different from the empirical distribution.

三、Command-line options

!!!All the parameters in GEM are specified in double dashes(--)

1. required parameters

--d [path] : the path to the read distribution model file 用软件提供的read distribution文件

--exptX [path]: the path to the aligned reads file for experiment(IP).X is condition name.In multi-condition alignment mode,X is used to specify different conditions.

2.Optional parameters

--ctrlX [path]: the path to the aligned reads file for control.X should match the condition name in --exptX

--g [path] : the path to a genome information file(genome.chrom.size.file)

两个点:a. tab-delimited b.染色体名需与aligned reads file 以及 基因组序列中的保持一致

--f[BED|SAM|BOWTIE|ELAND|NOVO]:SAM options allows SAM or BAM file input(default = BED)

--s [n]:The size of uniquely mappable genome.It depends on the genome and the read length. A good estimate is genome size*0.8.If it is not supplied, it will be estimated using the genome information

--genome [path_to_folder]: the path to genome sequence folder, which contains Fasta files of individual chromosomes.

--k [n]: the width of the k-mers

--k_min [n] --k_max [n]: minimum and maximum value of k

--seed [k-mer]:the seed k-mer to jump start k-mer set discovery. Exact k-mer sequence only.The width of the seed k-mer will be used to set k.

--k_seqs [n]:the number of top ranking events to get sequences for motif discovery(default=5000)

--pp_nmotifs [n]:the max number of top ranking motifs to set the motif-based positional prior(default=1)

--k_win [n]:the sequence window size around the binding event for motif discovery(default=61bp)

--strand_type [n]:Double-strand or single-strand binding event calling and motif discovery.0 for double-strand,1 for single-strand(default =0)

--nd [n]:noise distribution model,0 for no noise model,1 for uniform noise model,2 for smoothed control data specified by --ctrlX(default=1)

--min [n]: minimum number of events called by GPS/GEM to learn a read distribution and   continue the next round(default=500).Note that with less number of binding events, the read distribution might be more noisy.You can try increase --smooth parameter.

--fold [value]:Fold(IP/Control) cutoff to filter predicted events(default=3)

--icr [value]:IP/Control Ratio.By default, this ratio is estimated from the data using non-specific binding regions.It is important to set this value explicitly for synthetic dataset.

--out [name]:Output folder and file name prefix

--q [value]:significance level for q-value,specified as -log10(q-value).For example, to enforce a q-value threshold of 0.001,set this value to 3(default =2.i.e.q-value=0.01)

--a [value]:minimum alpha value for sparse prior(default is estimated by mean whole genome read coverage)

--af [value]:a constant to scale alpha value with read count(default=3).A smaller af value will give a larger alpha value.

--sd [value]:Shape deviation cutoff to filter predicted events(default = -0.4)

--smooth [n]:the width(bp) to smooth the read distribution.It it is set to -1,there will be no smoothing(default=30)

--subs [region strings]: subset of genome regions to be analyzed.The string can be in the format of "chr:start-end" or "chr", or both.For example,"1:003-1004 2 X".

--ex [region file]:file that contains subset of genome regions to be excluded.Each line contains a region in "chr:start-end" format

--t [n]:Number of threads to run GEM in parallel.It is suggested to be equal to or less than the physical CPU number on the computer

--top [n]: Number of ranked GEM events to be used for re-estimating the read distribution(default=2000).note that GEM only re-estimate when there are more than 500 significant events called.

--w2 [n]:Size of sliding window to estimate lambda parameter for Possion distribution when there is no ctrl data(default=5000, must be larger than 1000)

--w3 [n]:size of sliding window to estimate lambda parameter for Possion distribution when there is no ctrl data(default=10000,must be larger than w2)

这俩参数啥区别啊

3.Optional flags:

--poisson_control :compute Possion p-value of binding event using control data (default is to use ChIP data)

--k_neg_dinu_shuffle: Use di-nucleotide shuffled sequences as negative sequences for motif finding (default is to use flanking sequences)

--bp:use Branch-seq data specific settings

--pp_pwm: Use PWM motif to set the motif-based positional prior (default is to use the KSM motif model)

--bf: Depreciated .Base filtering is done by Poisson filtering by default.

--fa: GEM will use a fixed user-specified alpha value for all the regions

--help: Print help information and exit

--multi:Depreciated. To run GEM in multi-condition mode, you only need to specify data for conditions X and Y using --exptX and --exptY, etc.

--nf: Do not filter predicted events (default is to perform event filtering by shape and fold)

--nrf: Do not filter duplicate reads (i.e. potential PCR duplicates) (default is to apply filtering considering the read counts at its neighboring bases)

--relax: Use a more relaxed cutoff to include more weak binding calls. This is designed to provide more weak sites for analysis such as IDR analysis (default is more stringent cutoff)

--outNP: Output binding events in ENCODE NarrowPeak format (default is NO narrowPeak file output)

--outBED: Output binding events in BED format for UCSC Genome Browser (default is NO BED file output)

--outJASPAR: Output motif PFM in JASPAR format

--outMEME: Output motif PFM in MEME format

--outHOMER: Output motif PFM in HOMER format

--sl: Sort GEM output by location (default is sorted by P-value)

--print_bound_seqs: Output the "bound sequences" column in the GEM_event.txt file

我踩过的坑:


1. --genome and --g

--genome 指定基因组文件们所在的目录。所有的基因组文件必须以 chr*.fasta(.fa)的方式来命名。并且染色体名字要与genome.chrom.size中的名字相匹配。!!同时要跟bam文件中的染色体名字相匹配。

2.--exptX --ctrlX

X 一定要一致,否则输出结果中ctrl一栏是NA

你可能感兴趣的:(GEM peak caller)