preface:树核是一个计算相似度比较强大的工具。在nlp领域里面对句子的语义分析,解析出句法分析树,通过比较树的结构,对比不同句子的相似度等等,可以应用到很多方面。其中树核早就在svm-light这个强大的工具里面实现了,只需要将需要的句子的语义解析树作为输入,就能对任务进行分类,可以应用于多类nlp任务中。
#转载请注明:无限大地nlp_空木--在svm-light中树核的使用
Tree kernels in svm-light原文的介绍也比较详细(百度svm tk即可),卤煮这里也是根据自己的理解稍微记录点东西。
#===============================================================#
by Alessandro Moschitti
在自然语言处理领域里面,句法解析树是最有用的工具之一。然而,如何在NLP任务里面使用语义解析树是一个值得考虑的开放问题。例如,学习自动句法消歧模型,或者指代消解模型,用语义树特征将很有用,但是句法解析树的设计和选择并不容易。
卷积核(参考核在NLP领域的哲学)特征可以代替一般明显的特征。卷积核用于计算两个语法树的子结构之间的相似性(参考Collins and Duffy在2002年发表的论文)。这些方法在谓语参数分类任务中对句法信息的处理得到最优结果。
假设我们想计算两个名词性短语的解析树之间的相似度,如“a dog”和“a cat”。通过如下图所示的图形可以看出句法树内在意义:
2. 所有情况的相加,对实体1和实体2的每颗树和每个向量使用核函数叉乘:
参数说明:
<line> ::= <target><blank><set-of-vectors> | <target><blank><set-of-trees> | <target><blank><trees-and-vectors><set-of-vectors> ::= <vector> |<vector><blank><begin-vector><blank><vector><blank>..<end-vector><set-of-trees> ::= <begin-tree><blank><tree><blank>..<begin-tree><blank><tree><blank><end-tree><trees-and-vectors>::= <set-of-trees><blank><set-of-vectors><vector> ::= <feature>:<value><blank><feature>:<value><blank>...<blank><feature>:<value> | <blank><target> ::= +1 | -1 | 0 | <float><feature> ::= <integer> | "qid"<value> ::= <float>
<begin-tree> ::="|BT|"<end-tree> ::="|ET|"<begin-vector>::="|BV|"<end-vector> ::="|EV|"<tree> ::= <full-tree> | <blank><full-tree> ::= (<root><blank><full-tree>..<full-tree>) | (<root><blank><leaf>)<leaf> ::= <string><root> ::= <string><blank> ::= " " (i.e. one space)
解释说明:
文件每行都可以有以下三种情况:
- <target> 向量集合
- <target> 树集合
- <target> 树集合 向量集合
其中<target>只能为四种情况:+1、-1、0、float类型数值其中树集合形式如:|BT| 树1 |BT| 树2 |BT| 数3 |BT| 树n |ET|其中树又分为两种:整颗数或者为空格整颗数可以形式如:
- <full-tree> = (root <full-tree> <full-tree>)|<root> <leaf>(注:递归定义,如自动机右递归形式)
- (root (left_node1 (left_node2 (left_leaf leaf) (right_leaf leaf))) (right_node1 (left_leaf leaf) (right_leaf leaf)))(注:非递归定义,比较好理解)
其中<root>、<leaf>都为<string>,<string>为不包含空格左右括号的字符串其中向量集合形式可以有如下两种如:
- 向量
- 向量1 |BV| 向量2 |BV| 向量3 |BV| 向量n |EV|
向量形式定义如:
- feature1:value1 feature2:value2 feature3:value3 feature_n:value_n
- 空格
<tree>:通常使用宾州树库(http://www.cis.upenn.edu/~treebank/)的形式。
- 树的开始,“..|BT|”或者“|BT|..”,
- 向量的开始,“..|BV|”或者“|BV|..”,
- 形如序列"..|ET||BV|.."的形式,表明第一个向量是空的
usage: svm_learn [options] example_file model_fileArguments:example_file-> file with training datamodel_file -> file to store the learned decision rules inusage: svm_classify [options] example_file model_fileArguments:example_file-> file with testing datamodel_file -> file to retrieve the learned decision rules#svm_learn [options] training_data_file.txt model_file.txt#svm_classify [options] testing_data_file.txt model_file.txt
-t int -> type of kernel function:0: linear (default)1: polynomial (s a*b+c)^d2: radial basis function exp(-gamma ||a-b||^2)3: sigmoid tanh(s a*b + c)4: user defined kernel from kernel.h5: combination of forest and vector sets according to W, V, S, C options
11: re-ranking based on trees (each instance must have two trees)12: re-ranking based on vectors (each instance must have two vectors)13: re-ranking based on both tree and vectors (each instance must have two trees and two vectors)-W [S,A] -> a tree kernel is applied to the sequence of trees of two input forests and the results are summed;-> with an "A", a tree kernel is applied to all tree pairs from the two forests (default "S")-V [S,A] -> same as before but sequences of vectors are used (default "S" and the type of vector-based kernel is specified by the option -S)-S [0,4] -> kernel to be used with vectors (default polynomial of degree 3, i.e. -S = 1 and -d = 3)-C [*,+,T,V] -> combination operator between forests and vectors (default 'T')-> "T" only the contribution from trees is used-> "V" only the contribution from feature vectors is used-> "+" or "*" sum or multiplication of the contributions from feature vectors and trees (default 'T')-T float -> multiplicative constant for the contribution of tree kernels when -C = "+", i.e. K = tree-forest-kernel*r + vector-kernel (default 1)-D [0,1] -> 0, SubTree kernel or 1, SubSet Tree kernels (default 1)-L float -> decay factor in tree kernels (default 0.4)-N [0,3] -> 0 = no normalization, 1 = tree normalization, 2 = vector normalization and, 3 = normalization of both trees and vectors. The normalization is applied to each individual tree or vector (default 3).-u string -> parameter of user defined kernel-d int -> parameter d in polynomial kernel-g float -> parameter gamma in rbf kernel-s float -> parameter s in sigmoid/poly kernel-r float -> parameter c in sigmoid/poly kernel解释说明:
-t表示核函数的类型,有5种,新增加4种。
-t 0:表示线性核函数,也是默认核函数,即不使用-t这个参数,默认使用线性核函数进行计算。-t 1:表示多项式核函数,表达式为(a*b+c)^d,次数由参数d指定,偏置由c确定,即若是指定了-t 1,那么必须带参数d和参数r(负责偏置c的变化),即-t 1 -d 2 -r 3表示(a*b+3)^2。参数s不明白。-t 2:表示RBF核函数,径向基核函数 (Radial Basis Function),表达式为exp(-gamma||a-b||^2)。gamma参数由参数g确定,即-t 2 -g 2表示exp(-2||a-b||)^2,||a-b||表示2范式,不知道的不多解释。-t 3:表示sigmoid tanh(s a*b+c)函数?什么鬼,需要参数s和r,r负责偏置c的变化。-t 4:表示使用默认定义的核函数计算文件keanel.h,需要使用参数u,表明是使用自定的核函数计算。-t 4 -u kernel.h。可以修改kernel.h文件来自定义计算核函数,具体可以参考有关kernel的论文。-t 11:基于树的重排名(每个实例必须有两颗树,用于对比的两颗树,如上面关于重排名的例子)-t 12:基于向量的重排名(每个实例必须有两个向量)-t 13:基于树和向量的重排名(每个实例必须有两颗树和两个向量)-t 5:根据W,V,S,C等参数结合森林和向量集合进行计算,也即若想要同时用上树和向量,参数值必须设置为5。
-W [S,A]:森林里的两颗树的序列作为树核的输入,结果相加。(翻译不来),若是单独一个”A“,表明两个森林的所有树对作为树核输入。默认使用参数“S”。-V [S,A]:和之前的相同,表明向量序列被用上,默认使用“S”,基于向量的核类型一定要选择“S”。-S [0,4]:用于向量的核函数(默认使用3次多项式,比如-S 1 -d 3)。-C [*,+,T,V]:结合森林和向量的操作(默认使用“T”):
- "T":仅仅只使用树
- "V":仅仅只使用特征向量
- "+":特征向量和树的加和
- "*":特征向量和树的相乘
-T float:常数,用于乘以树核。用这个参数需要-C +,表示使用特征向量和树的相加,比如K = tree-forest-kernel*r + vector-kernel,(默认值为1,也即树核和向量核权重相等)-D [0,1]:0表示子树核,1表示树核的子集(默认为1)-L float:在树核中的衰减率,默认为0.4-N [0,3]:0表示没有正则化;1表示对树进行正则化;2表示对向量正则化;3表示对树和向量都正则化。正则化被应用到每个树或者特征。默认为3,即都进行正则化。
-u string:使用自定义的核函数时用上。-d int:多项式核函数的次数,使用多项式核,需要带上参数d。-g float:rbf核函数的gamma参数。-s float:使用sigmoid或多项式核中的参数s需要用上。-r float:使用sigmoid或多项式核中的偏置c需要用上。
- 明确nlp任务,可否使用语义解析树。
- 使用宾州树库对句子进行解析,或者使用stanford parser对中英文句子解析,得到句法解析树。
- 环境:下载svm-light,编译,测试svm_learn和svm_classify这两个命令是否能用,使用例子测试svm-light是否可以正确预测对。
- 将句法解析树以及特征向量化为svm-light能够接受的格式,如example文件中的例子。
- 对数据划分为训练集和测试集,有必要的话,在训练集中划分为出验证集,将剩下的训练集和验证集用来调试参数。
- 使用svm_learn命令,对训练集训练出模型,在验证集上应用模型得到结果,分析效果。
- 使用不同参数,跑出新的模型,在验证集上测试出结果对比验证集的结果,重新调试,使用更好的参数,直到结果无法再提升。
- 使用最终训练出的模型,在测试集上预测,得到最终算法性能。
- 有一点缺陷的是,缺少交叉验证,libsvm这个工具好像有,给忘了。
/************************************************************************/ /* */ /* kernel.h */ /* */ /* User defined kernel function. Feel free to plug in your own. */ /* */ /* Copyright: Alessandro Moschitti */ /* Date: 20.11.06 */ /* */ /************************************************************************/ /* KERNEL_PARM is defined in svm_common.h The field 'custom' is reserved for */ /* parameters of the user defined kernel. */ /* Here is an example of custom kernel on a forest and vectors*/ // INPUT DESCRIPTION // The basic input is a set of trees and a set of vectors. // The semantics of vectors is the following // The first vector contains the parameter weights of each tree so its length is num_of_trees. // The second vector tells which kind of kernel should be used for trees (i.e. SST or ST) so also its size is num_of_trees. // The third vector tells which kind of kernel should be used for feature vectors (i.e. -t from 0 to 3). Its size is num_of_vectors - 4. // The fourth vector contains the parameter weights of each vector. Its size is num_of_vectors - 4. // From the fith vector to num_of_vectors there are (num_of_vectors - 4) feature vectors that describe the target object. // // // The final kernel is: wt[1]*wt'[1]*TK_s1(t1,t'1)+..+wt[n]*wt'[n]*TK_sn(tn,t'n) + // + wv[1]*wv'[1]*K_r1(v1,v'1)+..+wv[m]*wv'[m]*K_rn(vn,v'n) // where: // wt[i] and wt'[i] are the weights associated with the i-th trees of the two objects, // si is the type of tree kernel applied to i-th trees (i.e. SST with si=1 or ST with si=0), // wv[i] and wv'[i] are the weight associated with the i-th feature vectors of the two objects, // ri is the type of the kernel applied to the i-th fetature vectors (i.e. ri = 0,1,2,3). // // Example, to evaluate // K(o,o) = 1*ST(t1,t1)+.5*.5*SST(t2,t2)+.1*.1*ST(t3,t3)+.125*.125*poly(v1,v1)+.670*.670*linear(v2,v2), // the following data is required (to simplify we have only one object o): // +1 |BT|(NN Paul) |BT| (JJ good) |BT| (VB give) |ET| \\ forest // 1:1 2:.5 3:.1 |BV| 1:0 2:1 3:0 |BV| \\ tree parameters // 1:.125 2:.670 |BV| 1:1 2:0 |BV| \\ feature vectors parameters // 1132:.2 1300:.01 12234:.23 30000:.23 30001:.001 30023:.034 |BV| \\ feature vectors // 4050:.3 5030:.1 11114:.7 |EV| // // To test the kernel use the following line as input_file: // +1 |BT|(NN Paul) |BT| (JJ good) |BT| (VB give) |ET| 1:1 2:.5 3:.1 |BV| 1:0 2:1 3:0 |BV| 1:.125 2:.670 |BV| 1:1 2:0 |BV| 1132:.2 1300:.01 12234:.23 30000:.23 30001:.001 30023:.034 |BV| 4050:.3 5030:.1 11114:.7 |EV| // and execute the command: svm_learn -t 4 input_file // implementation double custom_kernel(KERNEL_PARM *kernel_parm, DOC *a, DOC *b) { int i; double k; k=0; // a and b are structures containing a forest of trees and a set of vectors: // - forest_vec[i] is the i-th tree // - vectors[i] is the i-th feature vector // - num_of_trees // - num_of_vectors // summation of tree kernels for(i=0; i< a->num_of_trees && i< b->num_of_trees; i++){ // a->num_of_trees should be equal to b->num_of_trees if(a->forest_vec[i]!=NULL && b->forest_vec[i]!=NULL){// Test if one the i-th tree of instance a and b is an empty tree SIGMA = a->vectors[1]->words[i].weight; // The type of tree kernel for i-th tree is told by vector 1. // The field "weight" according to the input data is 0 (ST) or 1 (SST). LAMBDA = 0.4; // An additional vector may contain the lambda parameters instead of .4 for all trees. // other vectors may contain other specific parameters see "struct kernel_parm" in "svm_common.h". k+= // summation of tree kernels a->vectors[0]->words[i].weight * // Weight of tree i (vector 0 is used to assign weigths to trees). b->vectors[0]->words[i].weight * // Weight of tree i for instace b. tree_kernel(kernel_parm, a, b, i, i)/ // Evaluate tree kernel between the two i-th trees. sqrt(tree_kernel(kernel_parm, a, a, i, i) * tree_kernel(kernel_parm, b, b, i, i)); // Normalize respect to both i-th trees. /* TEST - print the i-th trees (of a and b instances) printf("\ntree 1: <"); writeTreeString(a->forest_vec[i]->root); printf(">\ntree 2: <"); writeTreeString(b->forest_vec[i]->root);printf(">\n"); printf("\n\n(i,i)=(%d,%d)= Kernel-Sequence :%f \n",i,i,k); fflush(stdout); */ } } // Summation of Vector Kernels for(i=0; i< a->num_of_vectors-4 && i< b->num_of_vectors-4; i++) if(a->vectors[i]!=NULL && b->vectors[i]!=NULL){ // Check if the i-th vectors are empty. kernel_parm->second_kernel = (long) a->vectors[3]->words[i].weight; // Type of standard feature vector kernel (from 0 to 3). kernel_parm->poly_degree = (long) 2; // Set the degree = 2 for polynomial kernel (for linear kernel it does not apply). // An additional vector could be defined to select different degrees for different feature vectors. k= // summation of vectors a->vectors[2]->words[i].weight * // Weight of feature vector i (vector 2 is used to assign weigths to vectors). b->vectors[2]->words[i].weight * // Weight of feature vector i for instace b. basic_kernel(kernel_parm, a, b, i, i)/ // Compute standard kernel (selected according to the "second_kernel" parameter). sqrt(basic_kernel(kernel_parm, a, a, i, i) * basic_kernel(kernel_parm, b, b, i, i)); //normalize vectors //TEST printf("\n\n(i,i)=(%d,%d)= Kernel-Sequence :%f \n",i,i,k); } return k; }