用于特征选择的F-Score打分及其Python实现

F-Score(非模型评价打分,区别与 F1_score )是一种衡量特征在两类之间分辨能力的方法,通过此方法可以实现最有效的特征选择。最初是由台湾国立大学的Yi-Wei Chen提出的(参考《Combining SVMs with Various Feature Selection Strategies》),公式如下:

用于特征选择的F-Score打分及其Python实现_第1张图片

其中i代表第i个特征,即每一个特征都会有一个F-score。x拔是所有该特征值的平均数,而(+),(-)则分别代表所有阳性样本和阴性样本的特征值(的平均数)。代表k是对于具体第i个特征的每个实例,分母的两个sigma可以理解为阳性样本与阴性样本的特征值的方差。F-score越大说明该特征的辨别能力越强。

这里给出一个Python脚本,可以实现F-score的计算。当然,一般我们的输入格式是一个矩阵,包含所有样本的特征以及这些样本的分类,最终希望得到每个特征的F-score。于是我写的脚本就是这样以特征矩阵以及分类列表两个参数作为输入,以F-score列表作为输出。

谢娟英等人还改进了F-score,将它推广至多分类的情况文献,公式如下:

用于特征选择的F-Score打分及其Python实现_第2张图片

不难发现,原版的F-score是改进版的一个特例,即l=2, 分别代表正类与负类(我们通常用0和1表示)
考虑到原Fscore可能更常用一点,这里仅分享原版的脚本。而改进版的由于我本人的研究需要也应该会写,有需要的可以留言~

def fscore_core(np,nn,xb,xbp,xbn,xkp,xkn):
    '''
    np: number of positive features
    nn: number of negative features
    xb: list of the average of each feature of the whole instances
    xbp: list of the average of each feature of the positive instances
    xbn: list of the average of each feature of the negative instances
    xkp: list of each feature which is a list of each positive instance
    xkn: list of each feature which is a list of each negatgive instance
    reference: http://link.springer.com/chapter/10.1007/978-3-540-35488-8_13
    '''

    def sigmap (i,np,xbp,xkp):
        return sum([(xkp[i][k]-xbp[i])**2 for k in range(np)])

    def sigman (i,nn,xbn,xkn):
        print sum([(xkn[i][k]-xbn[i])**2 for k in range(nn)])
        return sum([(xkn[i][k]-xbn[i])**2 for k in range(nn)])

    n_feature = len(xb)
    fscores = []
    for i in range(n_feature):
        fscore_numerator = (xbp[i]-xb[i])**2 + (xbn[i]-xb[i])**2
        fscore_denominator = (1/float(np-1))*(sigmap(i,np,xbp,xkp))+ \
                             (1/float(nn-1))*(sigman(i,nn,xbn,xkn))
        fscores.append(fscore_numerator/fscore_denominator)

    return fscores

def fscore(feature,classindex):
    '''
    feature: a matrix whose row indicates instances, col indicates features
    classindex: 1 indicates positive and 0 indicates negative
    '''
    n_instance = len(feature)
    n_feature  = len(feature[0])
    np = sum(classindex)
    nn = n_instance - np
    xkp =[];xkn =[];xbp =[];xbn =[];xb=[]
    for i in range(n_feature):
        xkp_i = [];xkn_i = []
        for k in range(n_instance):
            if classindex[k] == 1:
                xkp_i.append(feature[k][i])
            else:
                xkn_i.append(feature[k][i])
        xkp.append(xkp_i)
        xkn.append(xkn_i)
        sum_xkp_i = sum(xkp_i)
        sum_xkn_i = sum(xkn_i)
        xbp.append(sum_xkp_i/float(np))
        xbn.append(sum_xkn_i/float(nn))
        xb.append((sum_xkp_i+sum_xkn_i)/float(n_instance))
    return fscore_core(np,nn,xb,xbp,xbn,xkp,xkn)

**************************************************************************2020.05.03更

clear 
filepath=('C:\Users\Administrator\Desktop\��������.xlsx'); 
file=xlsread(filepath); 
x=file; 
[r,c]=size(x); 
x1=x(1:r/2,1:end); 
x2=x(r/2+1:r,1:end); 
[n1,m1] = size(x1); 
[n2,m2] = size(x2); 
aver1 = mean(x1); 
aver2 = mean(x2); 
aver3 = mean(x); 
numrator = (aver1-aver3).^2+(aver2-aver3).^2; 
sum_1 = zeros(1,m1); 
for k=1:n1 
   chazhi_1 = x1(k,:)-aver1; 
   added_1 = chazhi_1 .^2; 
   sum_1 = sum_1 + added_1; 
end 
deno_1 = sum_1/(n1 - 1); 
sum_2 = zeros(1,m2); 
for k = 1:n2 
   chazhi_2 = x2(k,:)-aver2; 
   added_2 = chazhi_2 .^2; 
   sum_2 = sum_2 +added_2; 
end 
deno_2 = sum_2/(n2 - 1); 
deno = deno_1 + deno_2; 
F_1 = numrator ./ deno; 
len = length(F_1); 
for k = 1:len 
   if isnan(F_1(k)) 
       F_1(k) = -1; 
   end 
end    
[F_2,ind] = sort(F_1,'descend');                 
F = [F_2',ind']'; 
filename=('C:\Users\Administrator\Desktop\��������f_score.xlsx'); 
xlswrite(filename,F);

你可能感兴趣的:(机器学习)