前言、
请在阅读本文前,先确认已阅读过论文《张华平,刘群.基于角色标注的中国人名自动识别研究》。
论文把与人名相关的词分为了15个角色,通过词典查询,可以判断某些文字、词所属角色,然后根据模式匹配找到匹配上的名字。
当我分析nr.dct的时候,却发现nr.dct并非完全按照论文所描述的进行的角色划分。以下是我对tag统计后的nr.dct的内容,能够在论文中找到含义的,我标注上了含义。
Tag Count:
Tag
=
B(
1
), Count
=
513
, 姓氏
Tag
=
C(
2
), Count
=
955
, 双名的首字
Tag
=
D(
3
), Count
=
1
,
043
, 双名的末字
Tag
=
E(
4
), Count
=
574
, 单名
Tag
=
F(
5
), Count
=
3
, 前缀
Tag
=
G(
6
), Count
=
9
, 后缀
*Tag = K( 10), Count = 0
, 人名的上文
Tag
=
L(
11
), Count
=
1
,
198
, 人名的下文
Tag
=
M(
12
), Count
=
1
,
684
, 两个中国人名之间的成分
Tag
= N( 13), Count = 67, <无>
*Tag = U( 20), Count = 0
, 人名的上文与姓氏成词
*Tag = V( 21), Count = 0
, 人名的末字与下文成词
Tag
=
X(
23
), Count
=
84
, 姓与双名首字成词
Tag
=
Y(
24
), Count
=
47
, 姓与单名成词
Tag
=
Z(
25
), Count
=
388
, 双名本身成词
Tag
= m( 44), Count = 58, <无>
Tag
= *(100), Count = 1, 始##始
Tag = *(101), Count = 1, 末##末
一、关于粗分结果的切分
我们可以从这个统计结果看出,nr.dct中并没有U和V这两个标签,那ICTCLAS如何修正在粗分情况下错误切分的词语呢?据两个例子:
1、"邓/颖/超生/前/使用"
其中的"超生"就必须切开,分成"超/生",后期才可进一步的标注。
2、"叶/莲/美的/一位/亲戚"
其中"美的"必须切开,分成"美/的",后期才可进一步的标注。
首先是很不幸,对于上面第一个例子FreeICTCLAS无能为力,因为现在的ICTCLAS的nr.dct里面根本没有"超生"这个词,也就是说,如果粗分结果出现了"超生"这个错误组合了部分人名的词,那么ICTCLAS无法将其拆开从而正确识别人名。
那么对于第二个例子呢?
根据对现有的ICTCLAS的分析,注意到关于切分的地方有这么一个代码
if
(m_tagType
==
TT_NORMAL
||!
dictUnknown.IsExist(pWordItems[nWordsIndex].sWord,
44
))

...
{
// 如果TT_NORMAL 或者NE词典中无此词(with tag 44)
// 将该词放到m_sWords[i]中,调整m_nWordPosition[i+1]的位置。
strcpy(m_sWords[i],pWordItems[nWordsIndex].sWord);//store currentword
m_nWordPosition[i+1]=m_nWordPosition[i]+strlen(m_sWords[i]);
}
else

...
{
if(!bSplit)

...{
strncpy(m_sWords[i],pWordItems[nWordsIndex].sWord,2);//storecurrent word
m_sWords[i][2]=0;
bSplit=true;
}
else

...{
unsigned int nLen=strlen(pWordItems[nWordsIndex].sWord+2);
strncpy(m_sWords[i],pWordItems[nWordsIndex].sWord+2,nLen);//storecurrent word
m_sWords[i][nLen]=0;
bSplit=false;
}
m_nWordPosition[i+1]=m_nWordPosition[i]+strlen(m_sWords[i]);
}
其中:
dictUnknown.IsExist(pWordItems[nWordsIndex].sWord,
44
)
到未登录词词典中去寻找标签为44的当前词,由此判定的是否继续执行切分操作。44是谁呢?前面的统计结果中已经有了:
Tag
=
m(
44
), Count
=
58
,
<
无
>
标签44(m)没有任何对应于论文的说明。关于切分,论文只提到了对于U、V的切分。那m是U还是V呢?因为一共就58个,所以我列出了所有Tag=44的项,大家可以看一下:
Key: 三和 ID
=
2
,
564
(Tag
=
44
, Frequency
=
1
)
Key: 东家 ID
=
744
(Tag
=
44
, Frequency
=
1
)
Key: 之和 ID
=
4
,
052
(Tag
=
44
, Frequency
=
1
)
Key: 健在 ID
=
1
,
490
(Tag
=
44
, Frequency
=
7
)
Key: 初等 ID
=
482
(Tag
=
44
, Frequency
=
2
)
Key: 到时 ID
=
672
(Tag
=
44
, Frequency
=
1
)
Key: 前程 ID
=
2
,
379
(Tag
=
44
, Frequency
=
1
)
Key: 华为 ID
=
1
,
306
(Tag
=
44
, Frequency
=
3
)
Key: 华以 ID
=
1
,
307
(Tag
=
44
, Frequency
=
1
)
Key: 同江 ID
=
3
,
024
(Tag
=
44
, Frequency
=
1
)
Key: 和田 ID
=
1
,
229
(Tag
=
44
, Frequency
=
2
)
Key: 国是 ID
=
1
,
172
(Tag
=
44
, Frequency
=
1
)
Key: 国都 ID
=
1
,
164
(Tag
=
44
, Frequency
=
1
)
Key: 图说 ID
=
3
,
057
(Tag
=
44
, Frequency
=
1
)
Key: 在理 ID
=
3
,
889
(Tag
=
44
, Frequency
=
1
)
Key: 天王 ID
= 2,989 (Tag=44, Frequency=1
)
Key: 子书 ID
=
4
,
247
(Tag
=
44
, Frequency
=
1
)
Key: 子孙 ID
=
4
,
248
(Tag
=
44
, Frequency
=
1
)
Key: 学说 ID
=
3
,
506
(Tag
=
44
, Frequency
=
1
)
Key: 对白 ID
=
780
(Tag
=
44
, Frequency
=
1
)
Key: 帅才 ID
=
2
,
828
(Tag
=
44
, Frequency
=
1
)
Key: 平和 ID
=
2
,
305
(Tag
=
44
, Frequency
=
2
)
Key: 怡和 ID
=
4
,
448
(Tag
=
44
, Frequency
=
1
)
Key: 慈和 ID
=
538
(Tag
=
44
, Frequency
=
1
)
Key: 成说 ID
=
444
(Tag
=
44
, Frequency
=
1
)
Key: 文说 ID
=
3
,
186
(Tag
=
44
, Frequency
=
3
)
Key: 新说 ID
=
3
,
416
(Tag
=
44
, Frequency
=
5
)
Key: 明说 ID
=
2
,
130
(Tag
=
44
, Frequency
=
4
)
Key: 有请 ID
=
3
,
772
(Tag
=
44
, Frequency
=
1
)
Key: 来时 ID
=
1
,
817
(Tag
=
44
, Frequency
=
1
)
Key: 来由 ID
=
1
,
820
(Tag
=
44
, Frequency
=
1
)
Key: 永不 ID
=
3
,
746
(Tag
=
44
, Frequency
=
1
)
Key: 清谈 ID
=
2
,
434
(Tag
=
44
, Frequency
=
1
)
Key: 清还 ID
=
2
,
429
(Tag
=
44
, Frequency
=
6
)
Key: 特等 ID
=
2
,
957
(Tag
=
44
, Frequency
=
1
)
Key: 王开 ID
=
3
,
115
(Tag
=
44
, Frequency
=
1
)
Key: 生就 ID
=
2
,
674
(Tag
=
44
, Frequency
=
1
)
Key: 石向 ID
=
2
,
720
(Tag
=
44
, Frequency
=
4
)
Key: 维和 ID
=
3
,
152
(Tag
=
44
, Frequency
=
1
)
Key: 美的 ID
= 2,075 (Tag=44, Frequency=3
)
Key: 老是 ID
=
1
,
852
(Tag
=
44
, Frequency
=
1
)
Key: 良将 ID
=
1
,
938
(Tag
=
44
, Frequency
=
1
)
Key: 若是 ID
=
2
,
556
(Tag
=
44
, Frequency
=
1
)
Key: 行将 ID
=
3
,
450
(Tag
=
44
, Frequency
=
1
)
Key: 远在 ID
=
3
,
847
(Tag
=
44
, Frequency
=
3
)
Key: 长发 ID
=
388
(Tag
=
44
, Frequency
=
1
)
Key: 鲁迅文学奖 ID
=
2
,
005
(Tag
=
44
, Frequency
=
1
)
Key: 茅盾文学奖 ID
=
2
,
059
(Tag
=
44
, Frequency
=
3
)
其中有我们刚才说的"美的",也就是说第二个例句 "叶/莲/美的/一位/亲戚" 会因为这个Tag=m的"美的"词条而成功的被切分为两条。
从这个例句,我们感觉,Tag=m相当于论文里的V,既"人名的末字与下文成词"。可是真的是这样么?
当我继续搜索标签为44的"天王"这个词条的时候,我注意到了199801人民日报语料中只有一条句子和拆分有关:
"
前几天王老头刚收到小孩寄来的照片
"
这句话是"人名的上文与姓氏成词",也就是对应于论文的U。
这回就乱套了,"m"既对应U又对应V。按照上面拆分的代码,不管任何情况,将m的第一个字拆出来。
做为V还好说,第一个字是名字的末字。但是对于U来说,可就完全不见得了,对于U,应该是拆除了最后一个字的部分。二者交集只有一种特例情况,就是m的词长是2个字。这样拆第一个字和拆最后一个字事实上一样。观察上面Tag=m的词条,我们会发现,除了"茅盾文学奖"和"鲁迅文学奖"这两个莫名其妙的词条外,其余的词条全都是两个字的。词长上满足刚才说的特例。
难道就没有3个字的U和V了么?我相信肯定会有满足U或者V的三字词、四字词,这才是更通用的情况,而FreeICTCLAS里面将U,V特例化成了只允许为2字的词。估计也是因为这个原因,也就没有对应它为U或者V,而是用了另一个字母小写m来表示。
总结一下,FreeICTCLAS实际上并没有真正的实现论文中所说的U,V这两个需要切分的Tag,取而代之的是一个在2个字成词的特例情况下等效的m来针对两个字成词的特例进行处理。
二、关于前缀、后缀的思考
关于FreeICTCLAS中判定词的Pattern有下列几种:
//
BBCD:姓+姓+名1+名2;
//
BBE: 姓+姓+单名;
//
BBZ: 姓+姓+双名成词;
//
BCD: 姓+名1+名2;
//
BE: 姓+单名;
//
BEE: 姓+单名+单名;韩磊磊
//
BG: 姓+后缀
//
BXD: 姓+姓双名首字成词+双名末字
//
BZ: 姓+双名成词;
//
B: 姓
//
CD: 名1+名2;
//
EE: 单名+单名;
//
FB: 前缀+姓
//
XD: 姓双名首字成词+双名末字
//
Y: 姓单名成词
下面是nr.dct关于前缀的词条:
Tag = F, 前缀
Key: 大 ID
=
588
(Tag
=
5
, Frequency
=
3
)
Key: 老 ID
=
1
,
834
(Tag
=
5
, Frequency
=
56
)
Key: 小 ID
=
3
,
359
(Tag
=
5
, Frequency
=
68
)
下面是nr.dct关于后缀的词条:
Tag = G, 后缀
Key: 哥 ID
=
1
,
014
(Tag
=
6
, Frequency
=
2
)
Key: 公 ID
=
1
,
071
(Tag
=
6
, Frequency
=
13
)
Key: 姐 ID
=
1
,
579
(Tag
=
6
, Frequency
=
4
)
Key: 老 ID
=
1
,
834
(Tag
=
6
, Frequency
=
32
)
Key: 某 ID
=
2
,
157
(Tag
=
6
, Frequency
=
40
)
Key: 嫂 ID
=
2
,
573
(Tag
=
6
, Frequency
=
14
)
Key: 氏 ID
=
2
,
758
(Tag
=
6
, Frequency
=
14
)
Key: 帅 ID
=
2
,
827
(Tag
=
6
, Frequency
=
18
)
Key: 总 ID
=
4
,
269
(Tag
=
6
, Frequency
=
2
)
关于前缀、后缀也有些不解。
既然可以如下成词:
//
FB: 前缀+姓
//
BG: 姓+后缀
那么"张老师"、"周总理"这种两个字的后缀为什么没有收录进来呢?
另外,前缀也有可能是2个字的,比如"馄饨侯"、"泥人张"、"年糕陈",在前缀中也没有收录进来。
那么"老师","总理"这么常用的后缀在词库里是什么呢?
Key: 总理 ID
=
4
,
281
(Tag
=
11
, Frequency
=
105
)
(Tag
=
12
, Frequency
=
110
)

Key: 老师 ID
=
1
,
851
(Tag
=
12
, Frequency
=
27
)
11是L(实际上是论文中的K),人名的上文;12是M(实际上是论文中的L),人名的下文。
在语料库寻找一番后,注意到,"周总理"被标注为"周/nr 总理/n",并不将二者合并为一个人名,做为同位语处理,后者为名词。因此估计所有两个字的前缀、后缀都是这样进行的标注,而这里所谓的前缀、后缀只是针对单字情况下的前缀、后缀。
三、关于(B)姓氏中出现的错误词条
(B)中有一些词条是不合理的。比如"建军"被标注为B,显然应该拆分为建C军D;而
Key: 孔子 ID
=
1
,
779
(Tag
=
1
, Frequency
=
5
)
应该把"子"做为后缀G,"孔"为姓氏B,形成BG搭配。在Tag为B中,有大量的这类的例子。我感觉这是在根据语料库学习过程中,预处理程序做的不够好,从而导致了这种现象。论文中提到,北大标注的语料库没有能够区分姓和名,这应该是导致nr.dct词典里姓名标注错误的主要原因。而且我怀疑,前面的前缀、后缀是人工总结的,而不是自动抽取出来的。
四、关于论文中的(K)人名的上文、(L)人名的下文、(M)两个中国人名之间的成分
这三个角色如果按照论文中的字母去找,一定会出问题。
需要注意,词典中没有标签为K,人名的上文,的词条。却多出来一个N,论文中没有对应条目。我将N对应的词条贴出来:
[
13
] Key: 帮助 ID
=
181
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 保 ID
=
189
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 保山 ID
=
192
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 背着 ID
=
212
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 并 ID
=
280
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 部署 ID
=
326
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 称 ID
=
430
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 称赞 ID
=
431
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 出局 ID
=
489
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 代表 ID
=
630
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 的 ID
=
685
(Tag
=
13
, Frequency
=
2
)
[
13
] Key: 对 ID
=
779
(Tag
=
13
, Frequency
=
19
)
[
13
] Key: 分析 ID
=
874
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 夫人 ID
=
905
(Tag
=
13
, Frequency
=
26
)
[
13
] Key: 赶到 ID
=
959
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 告诉 ID
=
1
,
012
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 给 ID
=
1
,
036
(Tag
=
13
, Frequency
=
2
)
[
13
] Key: 共诛 ID
=
1
,
085
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 和 ID
=
1
,
227
(Tag
=
13
, Frequency
=
76
)
[
13
] Key: 欢迎 ID
=
1
,
324
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 会见 ID
=
1
,
365
(Tag
=
13
, Frequency
=
3
)
[
13
] Key: 及 ID
=
1
,
410
(Tag
=
13
, Frequency
=
2
)
[
13
] Key: 将 ID
=
1
,
509
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 讲话 ID
=
1
,
524
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 交代 ID
=
1
,
530
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 接到 ID
=
1
,
555
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 来到 ID
=
1
,
813
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 老伴 ID
=
1
,
836
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 女儿 ID
=
2
,
235
(Tag
=
13
, Frequency
=
2
)
[
13
] Key: 陪 ID
=
2
,
274
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 陪同 ID
=
2
,
276
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 妻子 ID
=
2
,
332
(Tag
=
13
, Frequency
=
3
)
[
13
] Key: 请 ID
=
2
,
439
(Tag
=
13
, Frequency
=
3
)
[
13
] Key: 饰 ID
=
2
,
756
(Tag
=
13
, Frequency
=
2
)
[
13
] Key: 受 ID
=
2
,
788
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 送行 ID
=
2
,
877
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 题词 ID
=
2
,
973
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 同 ID
=
3
,
021
(Tag
=
13
, Frequency
=
5
)
[
13
] Key: 托 ID
=
3
,
078
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 文 ID
=
3
,
179
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 先锋 ID
=
3
,
294
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 向 ID
=
3
,
348
(Tag
=
13
, Frequency
=
9
)
[
13
] Key: 研究 ID
=
3
,
540
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 演 ID
=
3
,
560
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 邀请 ID
=
3
,
587
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 以 ID
=
3
,
659
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 以及 ID
=
3
,
660
(Tag
=
13
, Frequency
=
4
)
[
13
] Key: 应 ID
=
3
,
720
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 由 ID
=
3
,
762
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 与 ID
=
3
,
800
(Tag
=
13
, Frequency
=
19
)
[
13
] Key: 原名 ID
=
3
,
837
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 在 ID
=
3
,
886
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 赞助 ID
=
3
,
897
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 找 ID
=
3
,
964
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 争取 ID
=
4
,
011
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 直到 ID
=
4
,
059
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 侄女 ID
=
4
,
068
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 致 ID
=
4
,
083
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 主持 ID
=
4
,
176
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 祝 ID
=
4
,
201
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 总书记 ID
=
4
,
285
(Tag
=
13
, Frequency
=
1
)
[
13
] Key: 、 ID
=
4
,
336
(Tag
=
13
, Frequency
=
3
,
404
)
[
13
] Key:
"
ID= 4,347 (Tag=13, Frequency=2)
[
13
] Key: ( ID
=
4
,
354
(Tag
=
13
, Frequency
=
18
)
[
13
] Key: ) ID
=
4
,
355
(Tag
=
13
, Frequency
=
6
)
[
13
] Key: , ID
=
4
,
356
(Tag
=
13
, Frequency
=
11
)
[
13
] Key: / ID
=
4
,
357
(Tag
=
13
, Frequency
=
15
)
我感觉这个N实际上是论文里的M, "两个中国人名之间的成分";
而现在的M,实际上是论文里的L, "人名的下文";
现在的L,实际上是论文里的K, "人名的上文"。
也就是说它们三个都错后了一个字母。
五、总结
那么我们重新整理最初的根据tag进行的词条数目统计表,并理解一下词典里的内容:
Tag Count:
Tag
=
B(
1
), Count
=
513
, (B)姓氏
Tag
=
C(
2
), Count
=
955
, (C)双名的首字
Tag
=
D(
3
), Count
=
1
,
043
, (D)双名的末字
Tag
=
E(
4
), Count
=
574
, (E)单名
Tag
=
F(
5
), Count
=
3
, (F)前缀
Tag
=
G(
6
), Count
=
9
, (G)后缀
Tag
= L( 11), Count = 1,198, (K)人名的上文
Tag = M( 12), Count = 1,684, (L)人名的下文
Tag = N( 13), Count = 67
, (M)两个中国人名之间的成分
Tag
=
X(
23
), Count
=
84
, (X)姓与双名首字成词
Tag
=
Y(
24
), Count
=
47
, (Y)姓与单名成词
Tag
=
Z(
25
), Count
=
388
, (Z)双名本身成词
Tag
= m( 44), Count = 58, (U)人名的上文与姓氏成词 &
(V)人名的末字与下文成词
Tag
=
*
(
100
), Count
=
1
, 始##始
Tag
=
*
(
101
), Count
=
1
, 末##末
到现在,nr.dct的内容基本上清晰了。但是也提出了更复杂的要求,原ICTCLAS并没有实现3字以上词的切分,我们可能需要考虑实现;另外,如何有效地生成我们自己的nr.dct也是一个题目,这不同于core的一元词频或者二元转移词频,简单的扫描一遍就可以得出结果。从现有的nr.dct中我们可以看到由于预处理程序不完善,有不少错误的词存在在里面,我们需要在于处理中加入更多的分析和规则判断,从而让根据语料库学习的人名识别词典更加精准。