position-correlation scoring feature(PCSF)

文章目录

  • PCSF
    • 来源1
    • 来源2

PCSF

位置关联打分(PCSF)特征

position-correlation scoring feature

来源1

2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approach

https://link.springer.com/article/10.1007/s12064-010-0114-8

原文:

The PWM can be constructed by counting the frequencies of oligonucleotides in conserved sites of training sequences. The probability p x i p_{xi} pxi of an oligonucleotide x x x at the ith site can be formulated as (Li and Lin 2006; Wasserman and Sandelin 2004; Kielbasa et al. 2005):
p x = ( n x i + b x i ) / ( N i + B i ) p x i p_{x}=(n_{xi}+b_{xi})/(N_i+B_i)p_{xi} px=(nxi+bxi)/(Ni+Bi)pxi
(2)

where n x i n_{xi} nxi and b x i b_{xi} bxi are real counts and pseudocounts of k-mer oligonucleotide x at the ith site, respectively. N i N_i Ni and B i B_i Bi are total number of real counts and pseudocounts at the ith site, respectively. If there are relatively few real counts, many k-mer variations may not be presented because of the small sample of sequences. The goal of adding pseudocounts is to obtain an improved estimate of the probability p x i p_{xi} pxi of k-mer oligonucleotide x at the ith site. A relatively few pseudocounts should be added when there is a good sampling of sequences, and more pseudocounts should be added when the data is sparser. One simple formula that has worked well in some studies is to make B i B_i Bi equal to √ N i √N_i Ni and b x i b_{xi} bxi equal to p 0 √ N i p_0√N_i p0Ni ( p 0 p_0 p0 is the average background frequency) in Eq. 2 (Wasserman and Sandelin 2004; Kielbasa et al. 2005), respectively. As N i N_i Ni increase, the influence of pseudocounts decrease because √ N i √Ni Ni increase more slowly. Due to the existence of pseudocounts, the estimated probabilities are strictly positive (Kielbasa et al. 2005). Based on the probabilities p x i p_{xi} pxi , the PCSF of an arbitrary sequence can be defined as (Li and Lin 2006):
F = ∑ i l n ( p x i / p 0 ) F=∑_iln(p_{xi}/p_0) F=iln(pxi/p0)
(3)

where p 0 p_0 p0 is average background probability of k-mer. The score F shows the degree of sequence closed to matrix resource.

来源2

2019-09_Mol Ther-Nucleic Acids_iProEP:A Computational Predictor for Predicting Promoter

https://www.sciencedirect.com/science/article/pii/S2162253119301611

通过对每个物种的启动子序列进行比对,我们可以构建一个位置相关评分矩阵position-correlation scoring matrix。PCSM中的每一行都由因子 p x i p_{xi} pxi组成, p x i p_{xi} pxi是启动子样本第i位的k-mer x的概率。 p x i p_{xi} pxi可通过以下公式计算:
p x i = n x i + b x i N i + B i p_{xi}=\frac{n_{xi}+b_{xi}}{N_i+B_i} pxi=Ni+Binxi+bxi
其中 n x i n_{xi} nxi是出现在第i位的x的实际计数,而 b x i b_{xi} bxi是相应的伪计数。 N i N_i Ni表示第i个位置上所有k-mers的实数之和(即正样本数),而 B i B_i Bi是相应的伪计数之和。如果样本量不够大,当k增加时,一些k-mers将不存在。因此,伪计数可以改善对第i位k-mer x的概率 p x i p_{xi} pxi的估计。 B i B_i Bi b x i b_{xi} bxi可以由下式给出:
KaTeX parse error: Expected 'EOF', got '&' at position 5: B_i&̲= \sqrt{N_i},\\…
其中 p o po po为k-mer的背景频率,等于 1 / 4 k 1/4^k 1/4k。随着样品数N_i的增加,由于 N i \sqrt{N_i} Ni 增长缓慢,伪计数的影响会减弱。
通过对LIN和LI的大量复杂的保守分析和ACC评价,筛选出了五个物种三聚体的一些保护位点。基于这些位点和PCSM,五个物种的正负样本的PCSF特征可以表示为:
P C S F = [ f 1 f 2 . . . f i . . . f n ] PCSF=[f_1f_2...f_i...f_n] PCSF=[f1f2...fi...fn]
其中n是选定的保守位点的数量,每个元素定义为:
f i = l n ( p x i / p o ) f_i=ln(p_{xi}/po) fi=ln(pxi/po)
在这个方程中, p o po po是每个三聚体的本底概率( p o = 1 / 4 3 po=1/4^3 po=1/43), p x i p_{xi} pxi可以在PCSM的基础上得到。

你可能感兴趣的:(bioinfo,#,feature,extraction,PCSF,特征提取,生物信息)