Lucene 3下最快的中文分词器

lucene 3下最快的中文分词器 - zzzhc's Blog

 


Lucene 3下最快的中文分词器

 

Jun 19th, 2011

 

 

包包分词器 - 一个基于字典的快速中文分词器

source code

features

  • 简单 1000LOC
  • 高效 7M+ chars/second
  • 支持中文,英语,数字
  • 自动识别未登录词
  • 支持OffsetAttribute
  • 支持TypeAttribute
  • 支持PositionIncrementAttribute

usage

 

 
1
2
3
4
5
6
7
8
Dict dict = new Dict(); dict.addAllSpecialTypes(); BufferedReader dictReader = new BufferedReader(new InputStreamReader(  new FileInputStream("dict.txt"), "UTF-8")); dict.load(dictReader); dictReader.close(); dict.optimize(); DictAnalyzer dictAnalyzer = new DictAnalyzer(dict); 

 

 

benchmark

ant benchmark

 

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
supported features:  CharTerm Offset PositionIncrement Term Type  IKAnalyzer Y Y N Y N  MMSegAnalyzer Y Y N Y Y  PaodingAnalyzer Y Y N Y Y StandardAnalyzer Y Y Y Y Y  BaoBaoAnalyzer Y Y Y Y Y  test 1, sample length=26265  name chars time tokens speed(chars/second)  PaodingAnalyzer 26265 0.610 12542 43036.87  MMSegAnalyzer 26265 0.314 14007 83566.52  IKAnalyzer 26265 0.262 16016 100177.91 StandardAnalyzer 26265 0.141 22366 185727.87  BaoBaoAnalyzer 26265 0.038 18185 695682.16  test 2, sample length=262650  name chars time tokens speed(chars/second)  PaodingAnalyzer 262650 0.187 125420 1402139.61  IKAnalyzer 262650 0.163 160160 1613693.16  MMSegAnalyzer 262650 0.158 140070 1664009.53  BaoBaoAnalyzer 262650 0.041 181850 6362134.44 StandardAnalyzer 262650 0.020 223660 12905789.80  test 3, sample length=2626500  name chars time tokens speed(chars/second)  IKAnalyzer 2626500 2.251 1601600 1166564.72  PaodingAnalyzer 2626500 1.462 1254200 1796381.55  MMSegAnalyzer 2626500 1.043 1400700 2519010.94  BaoBaoAnalyzer 2626500 0.352 1818500 7458959.20 StandardAnalyzer 2626500 0.202 2236600 13015280.16 

 

 

 

 

 

Jun 19th, 2011

 

lucene

 

 

你可能感兴趣的:(Lucene)