英文词干提取有多种方式,在实践中,可能涉及到机器学习数据挖掘等多方面的内容。
这里主要介绍的是易于实现的几种原始算法:
Lovins (1968)
Porter (1980)
Porter2 (2000)
1. Lovins
Lovins是最早的实现
1.1. 简介
算法涉及如下部件:
ending, 词后缀,共有294个,详细列表见最后
condition, 词后缀去除条件,每个ending对应一个condition,共有29个,详细列表见最后
transformation, 转换ending的方式,共有35个,详细列表见最后
算法分为两部:
对英文词,根据ending列表,按照ending从长到短扫描,找到第一个符合condition的ending
根据剩下的stem应用transformation,将ending转为恰当的形式
1.2. 例子
第一步
英文词为nationally,按照endling列表,从长到短扫描,首先找到 .09. ationally B
,
对应的规则是B Minimum stem length = 3
,要求去除ending后,剩余的部分长度大于等于3
nationally 去除 ationally 后只剩下 n, 不符合condition
继续扫描ending,找到 .07. ionally A
,对应的规则是 A No restrictions on stem
,没有任何限制。
于是最终选定 ionally
作为ending
第二步
英文词nationally的stem是nat, 查找transformation,发现没有符合的transformation,不进行变换直接输出。
比如又一个词sitting,第一步得到stem是sitt, 第二步这里会应用第一条transformation,最终输出sit
1.Appendix.A endings 列表
.11.
alistically B arizability A izationally B
.10.
antialness A arisations A arizations A entialness A
.09.
allically C antaneous A antiality A arisation A
arization A ationally B ativeness A eableness E
entations A entiality A entialize A entiation A
ionalness A istically A itousness A izability A
izational A
.08.
ableness A arizable A entation A entially A
eousness A ibleness A icalness A ionalism A
ionality A ionalize A iousness A izations A
lessness A
.07.
ability A aically A alistic B alities A
ariness E aristic A arizing A ateness A
atingly A ational B atively A ativism A
elihood E encible A entally A entials A
entiate A entness A fulness A ibility A
icalism A icalist A icality A icalize A
ication G icianry A ination A ingness A
ionally A isation A ishness A istical A
iteness A iveness A ivistic A ivities A
ization F izement A oidally A ousness A
.06.
aceous A acious B action G alness A
ancial A ancies A ancing B ariser A
arized A arizer A atable A ations B
atives A eature Z efully A encies A
encing A ential A enting C entist A
eously A ialist A iality A ialize A
ically A icance A icians A icists A
ifully A ionals A ionate D ioning A
ionist A iously A istics A izable E
lessly A nesses A oidism A
.05.
acies A acity A aging B aical A
alist A alism B ality A alize A
allic BB anced B ances B antic C
arial A aries A arily A arity B
arize A aroid A ately A ating I
ation B ative A ators A atory A
ature E early Y ehood A eless A
elity A ement A enced A ences A
eness E ening E ental A ented C
ently A fully A ially A icant A
ician A icide A icism A icist A
icity A idine I iedly A ihood A
inate A iness A ingly B inism J
inity CC ional A ioned A ished A
istic A ities A itous A ively A
ivity A izers F izing F oidal A
oides A otide A ously A
.04.
able A ably A ages B ally B
ance B ancy B ants B aric A
arly K ated I ates A atic B
ator A ealy Y edly E eful A
eity A ence A ency A ened E
enly E eous A hood A ials A
ians A ible A ibly A ical A
ides L iers A iful A ines M
ings N ions B ious A isms B
ists A itic H ized F izer F
less A lily A ness A ogen A
ward A wise A ying B yish A
.03.
acy A age B aic A als BB
ant B ars O ary F ata A
ate A eal Y ear Y ely E
ene E ent C ery E ese A
ful A ial A ian A ics A
ide L ied A ier A ies P
ily A ine M ing N ion Q
ish C ism B ist A ite AA
ity A ium A ive A ize F
oid A one R ous A
.02.
ae A al BB ar X as B
ed E en F es E ia A
ic A is A ly B on S
or T um U us V yl R
s' A 's A
.01.
a A e A i A o A
s W y B
1.Appendix.B conditions 列表
A No restrictions on stem
B Minimum stem length = 3
C Minimum stem length = 4
D Minimum stem length = 5
E Do not remove ending after e
F Minimum stem length = 3 and do not remove ending after e
G Minimum stem length = 3 and remove ending only after f
H Remove ending only after t or ll
I Do not remove ending after o or e
J Do not remove ending after a or e
K Minimum stem length = 3 and remove ending only after l, i or u*e
L Do not remove ending after u, x or s, unless s follows o
M Do not remove ending after a, c, e or m
N Minimum stem length = 4 after s**, elsewhere = 3
O Remove ending only after l or i
P Do not remove ending after c
Q Minimum stem length = 3 and do not remove ending after l or n
R Remove ending only after n or r
S Remove ending only after dr or t, unless t follows t
T Remove ending only after s or t, unless t follows o
U Remove ending only after l, m, n or r
V Remove ending only after c
W Do not remove ending after s or u
X Remove ending only after l, i or u*e
Y Remove ending only after in
Z Do not remove ending after f
AA Remove ending only after d, f, ph, th, l, er, or, es or t
BB Minimum stem length = 3 and do not remove ending after met or ryst
CC Remove ending only after l
1.Appendix.C transformations 列表
1 remove one of double b, d, g, l, m, n, p, r, s, t
2 iev -> ief
3 uct -> uc
4 umpt -> um
5 rpt -> rb
6 urs -> ur
7 istr -> ister
7a metr -> meter
8 olv -> olut
9 ul -> l except following a, o, i
10 bex -> bic
11 dex -> dic
12 pex -> pic
13 tex -> tic
14 ax -> ac
15 ex -> ec
16 ix -> ic
17 lux -> luc
18 uad -> uas
19 vad -> vas
20 cid -> cis
21 lid -> lis
22 erid -> eris
23 pand -> pans
24 end -> ens except following s
25 ond -> ons
26 lud -> lus
27 rud -> rus
28 her -> hes except following p, t
29 mit -> mis
30 ent -> ens except following m
31 ert -> ers
32 et -> es except following n
33 yt -> ys
34 yz -> ys
2. Porter
2.1. 简介
元音与辅音
元音辅音与常见的定义略有不同:
元音(Vowel) - A E I O U, 以及辅音后边的Y
辅音(Consonant) - 除了 A E I O U,以及元音后边的Y
单词的分组
连续的元音看作元音组V,连续的辅音看作辅音组C,于是任意一个单词都可以表示成VC交错的形式,例如:
segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC
porter -> p/o/rt/e/r -> CVCVC
application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC
apple -> a/ppl/e -> V/C/V
综合起来,可以表示为 VC 组的形式:$$ C^m[V] $$
其中参数m类似于Lovin中condition的stem长度,用于后续的判断
规则
Porter算法以rule为主,rule的形式为:
(condition) S1 -> S2
condition作用于去除了S1的stem,除了m还有其他特征:
m - 表示VC组的数目
* - 表示任意字符, 和子串,v,d,o配合使用
大写字母 - 表示子串
v - 表示一个元音字符
d - 表示两个一样的辅音
o - 表示cvc, 其中第二个c不能是W,X,Y
S1是词的后缀,S2的变化后的后缀
和Lovin不同,一个词语经过多个规则的串联处理,输出目标词(Lovin是一次性输出)
例如 hopping, 首先应用规则(*v*) ING ->
, 变为hopp
然后应用规则(*d and not (*L or *S or *Z)) -> single letter
,从hopp变为hop
流程
整个算法是从上往下应用规则,有些规则比较特殊,如果触发了要处理额外的规则
规则很多,于是对规则进行分组(step),这里的分组是为了逻辑上做区分(实际上算法也可以根据分组优化),整个算法就是从头到位执行的,流程如下:
do Step_1a
do Step_1b (如果命中step 2b.2 or step 2b.3, 则做一些额外工作)
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5a
do Step_5b
每个Step的详细内容见附录
2.2. 例子
2.Appendix Step 1a
SSES -> SS
IES -> I
SS -> SS
S ->
2.Appendix Step 1b
(m>0) EED -> EE
(*v*) ED ->
(*v*) ING ->
If the second or third of the rules in Step 1b is successful, the following is done:
AT -> ATE
BL -> BLE
IZ -> IZE
(*d and not (*L or *S or *Z)) -> single letter
(m=1 and *o) -> E
2.Appendix Step 1c
(*v*) Y -> I
2.Appendix Step 2
(m>0) ATIONAL -> ATE
(m>0) TIONAL -> TION
(m>0) ENCI -> ENCE
(m>0) ANCI -> ANCE
(m>0) IZER -> IZE
(m>0) ABLI -> ABLE
(m>0) ALLI -> AL
(m>0) ENTLI -> ENT
(m>0) ELI -> E
(m>0) OUSLI -> OUS
(m>0) IZATION -> IZE
(m>0) ATION -> ATE
(m>0) ATOR -> ATE
(m>0) ALISM -> AL
(m>0) IVENESS -> IVE
(m>0) FULNESS -> FUL
(m>0) OUSNESS -> OUS
(m>0) ALITI -> AL
(m>0) IVITI -> IVE
(m>0) BILITI -> BLE
2.Appendix Step 3
(m>0) ICATE -> IC
(m>0) ATIVE ->
(m>0) ALIZE -> AL
(m>0) ICITI -> IC
(m>0) ICAL -> IC
(m>0) FUL ->
(m>0) NESS ->
2.Appendix Step 4
(m>1) AL ->
(m>1) ANCE ->
(m>1) ENCE ->
(m>1) ER ->
(m>1) IC ->
(m>1) ABLE ->
(m>1) IBLE ->
(m>1) ANT ->
(m>1) EMENT ->
(m>1) MENT ->
(m>1) ENT ->
(m>1 and (*S or *T)) ION ->
(m>1) OU ->
(m>1) ISM ->
(m>1) ATE ->
(m>1) ITI ->
(m>1) OUS ->
(m>1) IVE ->
(m>1) IZE ->
2.Appendix Step 5a
(m>1) E ->
(m=1 and not *o) E ->
2.Appendix Step 5b
(m > 1 and *d and *L) -> single letter