对于CRFs模型,一直没有找到比较直观简单的Java实现。没有办法,就参考了博客:

中文分词入门之字标注法4

    该博客的细节非常清楚。在windows系统中使用CRF++,可以下载编译好的程序,地址如下:

https://crfpp.googlecode.com/files/CRF%2B%2B-0.58.zip

如果因为被墙,可以从这个地址下载:http://pan.baidu.com/s/1jNfum

最好还是能自己找到原始地址去,推荐一个过墙的东东:http://honx.in/i/VKEHUuz5NEy3Un5i ,我一直在用,感觉还行。


最后直接给出用pku的数据集在backoff测试的结果:

封闭测试:(用训练集测试)

=== SUMMARY:

=== TOTAL INSERTIONS: 987

=== TOTAL DELETIONS: 1312

=== TOTAL SUBSTITUTIONS: 2277

=== TOTAL NCHANGE: 4576

=== TOTAL TRUE WORD COUNT: 1109947

=== TOTAL TEST WORD COUNT: 1109622

=== TOTAL TRUE WORDS RECALL: 0.997

=== TOTAL TEST WORDS PRECISION: 0.997

=== F MEASURE: 0.997

=== OOV Rate: 0.000

=== OOV Recall Rate: --

=== IV Recall Rate: 0.997

### pku_crf_training.word.utf8 987 1312 2277 4576 1109947 1109622 0.997 0.997 0.997 0.000 -- 0.997

开放测试:(用gold/test测试)

=== SUMMARY:

=== TOTAL INSERTIONS: 1492

=== TOTAL DELETIONS: 3150

=== TOTAL SUBSTITUTIONS: 4966

=== TOTAL NCHANGE: 9608

=== TOTAL TRUE WORD COUNT: 104372

=== TOTAL TEST WORD COUNT: 102714

=== TOTAL TRUE WORDS RECALL: 0.922

=== TOTAL TEST WORDS PRECISION: 0.937

=== F MEASURE: 0.930

=== OOV Rate: 0.058

=== OOV Recall Rate: 0.562

=== IV Recall Rate: 0.944

### pku_crf_test.word.utf8 1492 3150 4966 9608 104372 102714 0.922 0.937 0.930 0.058 0.562 0.944


开放测试也有93.7%的正确率,这还是没有经过调优。都是CRF++默认的的特征模板。