本文参考nltk MaxentClassifier实现了一个简单的最大熵模型,主要用于理解最大熵模型中一些数学公式的实际含义。
最大熵模型:
如何计算测试数据x被分为类别y的概率呢?
总结成一句话:我们把x和y传给每个特征函数得到相应的特征值(0或者1),然后乘上相应的权值,最后通过softmax得到。
现在面临两个问题。
1.这里的 fi(x,y) 究竟是什么鬼,如何得到?
2. wi 又如何求得?
先来看看第一个问题。
fi(x,y) 反映的是x和y的特征,也就是说它是对输入和输出同时抽取特征的
它的定义是
x=dict(a=1, b=1, c=1)
y='1'
这样一个训练数据,我们对它进行特征提取时:
对x的第一个特征抽取,用一个三元组表示就是(x某一特征名,特征名对应的值,y)
分别对x的特征进行抽取得到三个特征函数:
(‘a’,1,’1’)
(‘b’,1,’1’)
(‘c’,1,’1’)
抽取样本中所有特征函数的代码实现:
def maxent_train(train_toks):
...
mapping = {} # maps (fname, fval, label) -> fid
for(tok, label) in train_toks:
for(fname, fval) in tok.items():
if (fname,fval,label) not in mapping:
mapping[(fname,fval,label)] = len(mapping)
...
代码中mapping存储了所有特征函数,所以判断x,y是否满足某一事实就是看mapping能不能找到(x某一特征名,特征名对应的值,y)这样的三元组。
来看看第二个问题,如何求 wi
我们通过GIS算法求它,这里省略数学推导直接看结果。
算法流程如下:
1.任意初始化 wi ,一般为0:
先来看 P̂ (x,y) 和 P̂ (x)
def calculate_empirical_fcount(train_toks, mapping):
fcount = np.zeros(len(mapping))
for tok, label in train_toks:
for(index, val) in encode(tok,label,mapping):
fcount[index] += val
return fcount
fcount结构为{特征id,特征值之和}
求 P(y|x) 的实现如下:
def prob(tok, labels, mapping, weights):
prob_dict = {}
for label in labels:
total = 0.0
for(index,val) in encode(tok,label,mapping):
total += weights[index]*val
prob_dict[label] = np.exp(total)
value_sum = sum(list(prob_dict.values()))
for(label, value) in prob_dict.items():
prob_dict[label] = prob_dict[label]/value_sum
return prob_dict
∑x,yP(y|x)fi(x,y) 的实现:
def calculate_estimated_fcount(train_toks, mapping, labels, weights):
fcount = np.zeros(len(mapping))
for tok, label in train_toks:
prob_dict = prob(tok,labels,mapping,weights)
for label, p in prob_dict.items():
for (index, val) in encode(tok, label, mapping):
fcount[index] += p*val
return fcount
完整代码如下:
import numpy as np
def encode(featureset, label, mapping):
encoding = []
for (fname, fval) in featureset.items():
if(fname,fval,label) in mapping:
encoding.append((mapping[(fname,fval,label)],1))
return encoding
def calculate_empirical_fcount(train_toks, mapping):
fcount = np.zeros(len(mapping))
for tok, label in train_toks:
for(index, val) in encode(tok,label,mapping):
fcount[index] += val
return fcount
def prob(tok, labels, mapping, weights):
prob_dict = {}
for label in labels:
total = 0.0
for(index,val) in encode(tok,label,mapping):
total += weights[index]*val
prob_dict[label] = np.exp(total)
value_sum = sum(list(prob_dict.values()))
for(label, value) in prob_dict.items():
prob_dict[label] = prob_dict[label]/value_sum
return prob_dict
def calculate_estimated_fcount(train_toks, mapping, labels, weights):
fcount = np.zeros(len(mapping))
for tok, label in train_toks:
prob_dict = prob(tok,labels,mapping,weights)
for label, p in prob_dict.items():
for (index, val) in encode(tok, label, mapping):
fcount[index] += p*val
return fcount
def maxent_train(train_toks):
mapping = {} # maps (fname, fval, label) -> fid
labels = set()
feature_name = set()
for(tok, label) in train_toks:
for(fname, fval) in tok.items():
if (fname,fval,label) not in mapping:
mapping[(fname,fval,label)] = len(mapping)
feature_name.add(fname)
labels.add(label)
C = len(feature_name)+1
Cinv = 1/C
empirical_fcount = calculate_empirical_fcount(train_toks,mapping)
weights = np.zeros(len(empirical_fcount))
iter = 1
while True:
if iter == 100:
break
estimated_fcount = calculate_estimated_fcount(train_toks, mapping, labels, weights)
weights += (empirical_fcount / estimated_fcount) * Cinv
iter+=1
return weights, labels, mapping
if __name__ == '__main__':
train_data = [
(dict(a=1, b=1, c=1), '1'),
(dict(a=1, b=1, c=0), '0'),
(dict(a=0, b=1, c=1), '1')]
maxent_train(train_data)