英文词干提取(stemming)算法 - Lovins, Porter

英文词干提取有多种方式,在实践中,可能涉及到机器学习数据挖掘等多方面的内容。
这里主要介绍的是易于实现的几种原始算法:

  • Lovins (1968)

  • Porter (1980)

  • Porter2 (2000)

1. Lovins

Lovins是最早的实现

1.1. 简介

算法涉及如下部件:

  • ending, 词后缀,共有294个,详细列表见最后

  • condition, 词后缀去除条件,每个ending对应一个condition,共有29个,详细列表见最后

  • transformation, 转换ending的方式,共有35个,详细列表见最后

算法分为两部:

  1. 对英文词,根据ending列表,按照ending从长到短扫描,找到第一个符合condition的ending

  2. 根据剩下的stem应用transformation,将ending转为恰当的形式

1.2. 例子

第一步

英文词为nationally,按照endling列表,从长到短扫描,首先找到 .09. ationally B
对应的规则是B Minimum stem length = 3,要求去除ending后,剩余的部分长度大于等于3
nationally 去除 ationally 后只剩下 n, 不符合condition

继续扫描ending,找到 .07. ionally A,对应的规则是 A No restrictions on stem,没有任何限制。
于是最终选定 ionally作为ending

第二步

英文词nationally的stem是nat, 查找transformation,发现没有符合的transformation,不进行变换直接输出。
比如又一个词sitting,第一步得到stem是sitt, 第二步这里会应用第一条transformation,最终输出sit

1.Appendix.A endings 列表

.11.
alistically B   arizability A   izationally B

.10.
antialness A    arisations A    arizations A    entialness A

.09.
allically C     antaneous A     antiality A     arisation A
arization A     ationally B     ativeness A     eableness E
entations A     entiality A     entialize A     entiation A
ionalness A     istically A     itousness A     izability A
izational A

.08.
ableness A      arizable A      entation A      entially A
eousness A      ibleness A      icalness A      ionalism A
ionality A      ionalize A      iousness A      izations A
lessness A

.07.
ability A       aically A       alistic B       alities A
ariness E       aristic A       arizing A       ateness A
atingly A       ational B       atively A       ativism A
elihood E       encible A       entally A       entials A
entiate A       entness A       fulness A       ibility A
icalism A       icalist A       icality A       icalize A
ication G       icianry A       ination A       ingness A
ionally A       isation A       ishness A       istical A
iteness A       iveness A       ivistic A       ivities A
ization F       izement A       oidally A       ousness A

.06.
aceous A        acious B        action G        alness A
ancial A        ancies A        ancing B        ariser A
arized A        arizer A        atable A        ations B
atives A        eature Z        efully A        encies A
encing A        ential A        enting C        entist A
eously A        ialist A        iality A        ialize A
ically A        icance A        icians A        icists A
ifully A        ionals A        ionate D        ioning A
ionist A        iously A        istics A        izable E
lessly A        nesses A        oidism A

.05.
acies A         acity A         aging B         aical A
alist A         alism B         ality A         alize A
allic BB        anced B         ances B         antic C
arial A         aries A         arily A         arity B
arize A         aroid A         ately A         ating I
ation B         ative A         ators A         atory A
ature E         early Y         ehood A         eless A
elity A         ement A         enced A         ences A
eness E         ening E         ental A         ented C
ently A         fully A         ially A         icant A
ician A         icide A         icism A         icist A
icity A         idine I         iedly A         ihood A
inate A         iness A         ingly B         inism J
inity CC        ional A         ioned A         ished A
istic A         ities A         itous A         ively A
ivity A         izers F         izing F         oidal A
oides A         otide A         ously A

.04.
able A          ably A          ages B          ally B
ance B          ancy B          ants B          aric A
arly K          ated I          ates A          atic B
ator A          ealy Y          edly E          eful A
eity A          ence A          ency A          ened E
enly E          eous A          hood A          ials A
ians A          ible A          ibly A          ical A
ides L          iers A          iful A          ines M
ings N          ions B          ious A          isms B
ists A          itic H          ized F          izer F
less A          lily A          ness A          ogen A
ward A          wise A          ying B          yish A

.03.
acy A           age B           aic A           als BB
ant B           ars O           ary F           ata A
ate A           eal Y           ear Y           ely E
ene E           ent C           ery E           ese A
ful A           ial A           ian A           ics A
ide L           ied A           ier A           ies P
ily A           ine M           ing N           ion Q
ish C           ism B           ist A           ite AA
ity A           ium A           ive A           ize F
oid A           one R           ous A

.02.
ae A            al BB           ar X            as B
ed E            en F            es E            ia A
ic A            is A            ly B            on S
or T            um U            us V            yl R
s' A            's A

.01.
a A             e A             i A             o A
s W             y B 

1.Appendix.B conditions 列表

A   No restrictions on stem
B   Minimum stem length = 3
C   Minimum stem length = 4
D   Minimum stem length = 5
E   Do not remove ending after e
F   Minimum stem length = 3 and do not remove ending after e
G   Minimum stem length = 3 and remove ending only after f
H   Remove ending only after t or ll
I   Do not remove ending after o or e
J   Do not remove ending after a or e
K   Minimum stem length = 3 and remove ending only after l, i or u*e
L   Do not remove ending after u, x or s, unless s follows o
M   Do not remove ending after a, c, e or m
N   Minimum stem length = 4 after s**, elsewhere = 3
O   Remove ending only after l or i
P   Do not remove ending after c
Q   Minimum stem length = 3 and do not remove ending after l or n
R   Remove ending only after n or r
S   Remove ending only after dr or t, unless t follows t
T   Remove ending only after s or t, unless t follows o
U   Remove ending only after l, m, n or r
V   Remove ending only after c
W   Do not remove ending after s or u
X   Remove ending only after l, i or u*e
Y   Remove ending only after in
Z   Do not remove ending after f
AA  Remove ending only after d, f, ph, th, l, er, or, es or t
BB  Minimum stem length = 3 and do not remove ending after met or ryst
CC  Remove ending only after l

1.Appendix.C transformations 列表

1   remove one of double b, d, g, l, m, n, p, r, s, t
2   iev   ->   ief
3   uct   ->   uc
4   umpt  ->   um
5   rpt   ->   rb
6   urs   ->   ur
7   istr  ->   ister
7a  metr  ->   meter
8   olv   ->   olut
9   ul    ->   l except following a, o, i
10  bex   ->   bic
11  dex   ->   dic
12  pex   ->   pic
13  tex   ->   tic
14  ax    ->   ac
15  ex    ->   ec
16  ix    ->   ic
17  lux   ->   luc
18  uad   ->   uas
19  vad   ->   vas
20  cid   ->   cis
21  lid   ->   lis
22  erid  ->   eris
23  pand  ->   pans
24  end   ->   ens except following s
25  ond   ->   ons
26  lud   ->   lus
27  rud   ->   rus
28  her   ->   hes except following p, t
29  mit   ->   mis
30  ent   ->   ens except following m
31  ert   ->   ers
32  et    ->   es except following n
33  yt    ->   ys
34  yz    ->   ys 

2. Porter

2.1. 简介

元音与辅音

元音辅音与常见的定义略有不同:

  • 元音(Vowel) - A E I O U, 以及辅音后边的Y

  • 辅音(Consonant) - 除了 A E I O U,以及元音后边的Y

单词的分组

连续的元音看作元音组V,连续的辅音看作辅音组C,于是任意一个单词都可以表示成VC交错的形式,例如:

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC
porter -> p/o/rt/e/r -> CVCVC
application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC
apple -> a/ppl/e -> V/C/V

综合起来,可以表示为 VC 组的形式:$$ C^m[V] $$
其中参数m类似于Lovin中condition的stem长度,用于后续的判断

规则

Porter算法以rule为主,rule的形式为:

(condition) S1 -> S2

condition作用于去除了S1的stem,除了m还有其他特征:

  • m - 表示VC组的数目

  • * - 表示任意字符, 和子串,v,d,o配合使用

  • 大写字母 - 表示子串

  • v - 表示一个元音字符

  • d - 表示两个一样的辅音

  • o - 表示cvc, 其中第二个c不能是W,X,Y

S1是词的后缀,S2的变化后的后缀

和Lovin不同,一个词语经过多个规则的串联处理,输出目标词(Lovin是一次性输出)
例如 hopping, 首先应用规则(*v*) ING ->, 变为hopp
然后应用规则(*d and not (*L or *S or *Z)) -> single letter,从hopp变为hop

流程

整个算法是从上往下应用规则,有些规则比较特殊,如果触发了要处理额外的规则
规则很多,于是对规则进行分组(step),这里的分组是为了逻辑上做区分(实际上算法也可以根据分组优化),整个算法就是从头到位执行的,流程如下:

  • do Step_1a

  • do Step_1b (如果命中step 2b.2 or step 2b.3, 则做一些额外工作)

  • do Step_1c

  • do Step_2

  • do Step_3

  • do Step_4

  • do Step_5a

  • do Step_5b

每个Step的详细内容见附录

2.2. 例子

2.Appendix Step 1a

      SSES  ->   SS
      IES   ->   I
      SS    ->   SS
      S     ->

2.Appendix Step 1b

(m>0) EED     ->   EE
(*v*) ED      ->
(*v*) ING     ->

If the second or third of the rules in Step 1b is successful, the following is done:

      AT      ->   ATE
      BL      ->   BLE
      IZ      ->   IZE
      (*d and not (*L or *S or *Z)) -> single letter
      (m=1 and *o)  ->   E

2.Appendix Step 1c

(*v*) Y       ->   I

2.Appendix Step 2

(m>0) ATIONAL ->   ATE
(m>0) TIONAL  ->   TION
(m>0) ENCI    ->   ENCE
(m>0) ANCI    ->   ANCE
(m>0) IZER    ->   IZE
(m>0) ABLI    ->   ABLE
(m>0) ALLI    ->   AL
(m>0) ENTLI   ->   ENT
(m>0) ELI     ->   E
(m>0) OUSLI   ->   OUS
(m>0) IZATION ->   IZE
(m>0) ATION   ->   ATE
(m>0) ATOR    ->   ATE
(m>0) ALISM   ->   AL
(m>0) IVENESS ->   IVE
(m>0) FULNESS ->   FUL
(m>0) OUSNESS ->   OUS
(m>0) ALITI   ->   AL
(m>0) IVITI   ->   IVE
(m>0) BILITI  ->   BLE

2.Appendix Step 3

(m>0) ICATE   ->   IC
(m>0) ATIVE   ->
(m>0) ALIZE   ->   AL
(m>0) ICITI   ->   IC
(m>0) ICAL    ->   IC
(m>0) FUL     ->
(m>0) NESS    ->

2.Appendix Step 4

(m>1) AL      ->
(m>1) ANCE    ->
(m>1) ENCE    ->
(m>1) ER      ->
(m>1) IC      ->
(m>1) ABLE    ->
(m>1) IBLE    ->
(m>1) ANT     ->
(m>1) EMENT   ->
(m>1) MENT    ->
(m>1) ENT     ->
(m>1 and (*S or *T)) ION   ->
(m>1) OU      ->
(m>1) ISM     ->
(m>1) ATE     ->
(m>1) ITI     ->
(m>1) OUS     ->
(m>1) IVE     ->
(m>1) IZE     ->

2.Appendix Step 5a

(m>1) E   ->
(m=1 and not *o) E   ->

2.Appendix Step 5b

(m > 1 and *d and *L)   ->   single letter

你可能感兴趣的:(算法,自然语言,nlp)