基于词表的切词——最短路径方法

最短路径方法的目的是使得分词后得到的词最少,实现的方法是从句子中匹配出所有在词表中的词,以词为边(边的权重为1)、词与词的间隔(切分点)为节点构造出一个有向无环图(DAG),有唯一的起点(句子的开始)和唯一的终点(句子的结束),找到一条最短路径,即切分最少。
 
和正向最大匹配一样最短路径方法只需要一个词表即可进行切词,但得到的结果要更好,主要原因是考虑到了上下文的衔接性,把切分过程从一维扩展到了二维。但是这种方法时间复杂性更高,而且为了保证图的连通性需要在词匹配时要分割到字。
 
最短路径方法可以使用词频方便的扩展。将词频作为边的权重,将2-gram词频作为节点权重,这样可以容易的得到加入了词频信息的最短路径切分。如果有可能的话,可以应用N-Gram词频信息在路径选择上,甚至可以将词法信息加入进来,通过于词法图的匹配提取最佳路径。
 
最短路径路径方法还有助于对专有名词的切分。由于在切分专有名词时,经常出现短词或单字,可以在进行路径选择之前先对这样的边进行识别并赋予较低的权重使得在路径选择时能够倾向于这些边。
 
最短路径算法只能得到一条结果路径,和结果路径权重相同的其它路径都被舍弃了,而且接近最短路径的k最短路径均被舍弃,这样往往会失去正确切分。这些问题可以通过使用k最短路径的其它算法得到解决。
 
下面是一个简单的最短路径法的Python实现:

  1  ''' Implements SPM(Shortest Path Matching) Method
  2  '''
  3 
  4  import  string
  5  import  codecs
  6  import  re
  7 
  8  # {entry1:cateory1, entry2:category2, ..., entryN:categoryM}
  9  dict  =  {}
10  # a string contains delimiting punctuations
11  punc  =   ''
12  max_len  =  0
13 
14  def  segment(str):
15       ''' segment the given string in a method which made the number of tokens after
16      segmentation is minium. The algorithm used here can bu summaried as following:
17      1. use delimiting punctuations to segment the given string into short sentences.
18      2. pick the first sentence
19      3. find all known words in this picked sentence.
20      4. organize all words into DAG
21      5. find the shortest path from the start to the end, which is the segmentation we want
22      6. pick the next sentence and repeat from 3 until all sentences has been processed.
23       '''
24       global  punc
25      ret  =  []
26 
27      re_sent  =  re.compile( ' ([^%s]+)([%s])+ '   %  (punc, punc), re.MULTILINE)
28      cnt  =  0
29       for  match  in  re_sent.finditer(str):
30          sen  =  match.group( 1 )
31           # print sen
32          dag  =  organize(sen)
33           # print dag
34          path  =  find_path(dag)
35           for  i, l  in  path:
36               # print sen[i:i+l],
37              ret.append(sen[i:i + l])
38           # print
39           #  append a punctuation after the sentence
40           #  NOTICE: multiple punctuations is not supported
41          ret.append(match.group( 2 ))
42       return  ret
43 
44  def  organize(sentence):
45       ''' find all known words in the given sentence and organize it into a DAG
46      To represent nodes in a DAG, here a data structure of node is used as following:
47      [hop1, hop2]
48      hop is the distence from this node to the next one. On one char in the sentence
49      there could be more than one node structures that represent the multiple ways to
50      the segment the chars after this one. There is an ending node, [0], for easily
51      traversing the DAG.
52      And to represent the DAG, the following structure is used:
53      [[2,5], ..., [0]]
54       '''
55       global  dict
56      dag  =  []
57       # find all known words
58      n  =  l  =  len(sentence)
59       if  l > max_len:
60          l  =  max_len
61      c  =  0
62       while  c < len(sentence):
63          tl  =  l
64           if  c + tl > len(sentence):
65              tl  =  len(sentence) - c
66           while  tl > 1 :
67              t  =  sentence[c:c + tl]
68               if  dict.has_key(t):  #  find
69                   if  len(dag) == c:  #  first time to reach a node
70                      dag.append([len(t)])
71                   else :
72                      dag[c].append(len(t))
73               #  truncate one and retry
74              tl  -=   1
75           else # only one char left
76               if  len(dag) == c:
77                  dag.append([ 1 ])
78               else :
79                  dag[c].append( 1 )
80          c  +=   1   # try from next char
81      
82      dag.append([0])
83       return  dag
84 
85  def  find_path(dag):
86       ''' uses statnd Dijkstra algorithm to find the shortest path
87      from in the given dag. returns the path in such a format:
88      [(0,2), (2, 3), (5, 1), (6,4)]
89      the format of tuples in above sequence is (n, l), in which n
90      represent the index of this token and l is the length of this
91      token.
92       '''
93      wt  =  []
94      rc  =  []
95      pre  =  []
96      es  =  []
97       for  i  in  range(0, len(dag)):
98          wt.append(len(dag) + 1 )
99          rc.append(0)
100          pre.append(i - 1 )
101      rc[0]  =   1
102      wt[0]  =  0
103      es.append(0)
104       while   1 :
105           if  len(es) == 0:
106               break
107          min_node  =   - 1
108          min  =  len(dag) + 1
109           for  e  in  es:
110               if  wt[e] < min:
111                  min_node  =  e
112                  min  =  e
113          c  =  min_node
114          es.remove(c)
115          
116           for  e  in  dag[c]:
117              t  =  e + c
118               if   not  rc[t]:
119                  d  =  wt[c]  +   1
120                   if  d < wt[t]:
121                      wt[t]  =  d
122                      pre[t]  =  c
123                      es.append(t)
124      c  =  len(dag) - 1
125      path  =  []
126       while  pre[c] !=- 1 :
127          path.append((pre[c], c - pre[c]))
128          c  =  pre[c]
129      path.reverse()
130       return  path
MSN Space Link: http://spaces.msn.com/vanzolo/blog/cns!4A43F3D396FBF12F!1198.entry?_c11_blogpart_blogpart=blogview&_c=blogpart#permalink

你可能感兴趣的:(中文分词,path,structure,algorithm,string,tuples,c)