word分词是一个Java实现的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。 能准确识别英文、数字,以及日期、时间等数量词,能识别人名、地名、组织机构名等未登录词。 同时提供了Lucene、Solr、ElasticSearch插件。
word分词器分词效果评估主要评估下面7种分词算法:
正向最大匹配算法:MaximumMatching
逆向最大匹配算法:ReverseMaximumMatching
正向最小匹配算法:MinimumMatching
逆向最小匹配算法:ReverseMinimumMatching
双向最大匹配算法:BidirectionalMaximumMatching
双向最小匹配算法:BidirectionalMinimumMatching
双向最大最小匹配算法:BidirectionalMaximumMinimumMatching
所有的双向算法都使用ngram来消歧,分词效果评估分别评估bigram和trigram。
评估采用的测试文本有253 3709行,共2837 4490个字符,标准文本和测试文本一行行对应,标准文本中的词以空格分隔,评估标准为严格一致,评估核心代码如下:
/** * 分词效果评估 * @param resultText 实际分词结果文件路径 * @param standardText 标准分词结果文件路径 * @return 评估结果 */ public static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分词结果和标准一模一样 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分词结果和标准不一样 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { LOGGER.error("分词效果评估失败:", ex); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; }
/** * 中文分词效果评估结果 * @author 杨尚川 */ public class EvaluationResult implements Comparable{ private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return segmentationAlgorithm.name()+"("+segmentationAlgorithm.getDes()+"):" +"\n" +"分词速度:"+segSpeed+" 字符/毫秒" +"\n" +"行数完美率:"+getLinePerfectRate()+"%" +" 行数错误率:"+getLineWrongRate()+"%" +" 总的行数:"+totalLineCount +" 完美行数:"+perfectLineCount +" 错误行数:"+wrongLineCount +"\n" +"字数完美率:"+getCharPerfectRate()+"%" +" 字数错误率:"+getCharWrongRate()+"%" +" 总的字数:"+totalCharCount +" 完美字数:"+perfectCharCount +" 错误字数:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } }
word分词使用trigram评估结果:
BidirectionalMaximumMinimumMatching(双向最大最小匹配算法): 分词速度:265.62566 字符/毫秒 行数完美率:55.352688% 行数错误率:44.647312% 总的行数:2533709 完美行数:1402476 错误行数:1131233 字数完美率:46.23227% 字数错误率:53.76773% 总的字数:28374490 完美字数:13118171 错误字数:15256319 BidirectionalMaximumMatching(双向最大匹配算法): 分词速度:335.62155 字符/毫秒 行数完美率:50.16934% 行数错误率:49.83066% 总的行数:2533709 完美行数:1271145 错误行数:1262564 字数完美率:40.692997% 字数错误率:59.307003% 总的字数:28374490 完美字数:11546430 错误字数:16828060 ReverseMaximumMatching(逆向最大匹配算法): 分词速度:686.71045 字符/毫秒 行数完美率:46.723125% 行数错误率:53.27688% 总的行数:2533709 完美行数:1183828 错误行数:1349881 字数完美率:36.67598% 字数错误率:63.32402% 总的字数:28374490 完美字数:10406622 错误字数:17967868 MaximumMatching(正向最大匹配算法): 分词速度:733.9535 字符/毫秒 行数完美率:46.661713% 行数错误率:53.338287% 总的行数:2533709 完美行数:1182272 错误行数:1351437 字数完美率:36.72861% 字数错误率:63.271393% 总的字数:28374490 完美字数:10421556 错误字数:17952934 BidirectionalMinimumMatching(双向最小匹配算法): 分词速度:432.87375 字符/毫秒 行数完美率:45.863907% 行数错误率:54.136093% 总的行数:2533709 完美行数:1162058 错误行数:1371651 字数完美率:35.942123% 字数错误率:64.05788% 总的字数:28374490 完美字数:10198395 错误字数:18176095 ReverseMinimumMatching(逆向最小匹配算法): 分词速度:1033.58636 字符/毫秒 行数完美率:41.776066% 行数错误率:58.223934% 总的行数:2533709 完美行数:1058484 错误行数:1475225 字数完美率:31.678978% 字数错误率:68.32102% 总的字数:28374490 完美字数:8988748 错误字数:19385742 MinimumMatching(正向最小匹配算法): 分词速度:1175.4431 字符/毫秒 行数完美率:36.853836% 行数错误率:63.146164% 总的行数:2533709 完美行数:933769 错误行数:1599940 字数完美率:26.859812% 字数错误率:73.14019% 总的字数:28374490 完美字数:7621334 错误字数:20753156
word分词使用bigram评估结果:
BidirectionalMaximumMinimumMatching(双向最大最小匹配算法): 分词速度:233.49121 字符/毫秒 行数完美率:55.31531% 行数错误率:44.68469% 总的行数:2533709 完美行数:1401529 错误行数:1132180 字数完美率:45.834396% 字数错误率:54.165604% 总的字数:28374490 完美字数:13005277 错误字数:15369213 BidirectionalMaximumMatching(双向最大匹配算法): 分词速度:303.59401 字符/毫秒 行数完美率:52.007233% 行数错误率:47.992767% 总的行数:2533709 完美行数:1317712 错误行数:1215997 字数完美率:42.424194% 字数错误率:57.575806% 总的字数:28374490 完美字数:12037649 错误字数:16336841 BidirectionalMinimumMatching(双向最小匹配算法): 分词速度:349.67215 字符/毫秒 行数完美率:46.766422% 行数错误率:53.23358% 总的行数:2533709 完美行数:1184925 错误行数:1348784 字数完美率:36.52718% 字数错误率:63.47282% 总的字数:28374490 完美字数:10364401 错误字数:18010089 ReverseMaximumMatching(逆向最大匹配算法): 分词速度:598.04272 字符/毫秒 行数完美率:46.723125% 行数错误率:53.27688% 总的行数:2533709 完美行数:1183828 错误行数:1349881 字数完美率:36.67598% 字数错误率:63.32402% 总的字数:28374490 完美字数:10406622 错误字数:17967868 MaximumMatching(正向最大匹配算法): 分词速度:676.7993 字符/毫秒 行数完美率:46.661713% 行数错误率:53.338287% 总的行数:2533709 完美行数:1182272 错误行数:1351437 字数完美率:36.72861% 字数错误率:63.271393% 总的字数:28374490 完美字数:10421556 错误字数:17952934 ReverseMinimumMatching(逆向最小匹配算法): 分词速度:806.9586 字符/毫秒 行数完美率:41.776066% 行数错误率:58.223934% 总的行数:2533709 完美行数:1058484 错误行数:1475225 字数完美率:31.678978% 字数错误率:68.32102% 总的字数:28374490 完美字数:8988748 错误字数:19385742 MinimumMatching(正向最小匹配算法): 分词速度:1020.9208 字符/毫秒 行数完美率:36.853836% 行数错误率:63.146164% 总的行数:2533709 完美行数:933769 错误行数:1599940 字数完美率:26.859812% 字数错误率:73.14019% 总的字数:28374490 完美字数:7621334 错误字数:20753156
Ansj0.9的评估结果如下:
Ansj ToAnalysis 精准分词: 分词速度:495.9188 字符/毫秒 行数完美率:58.609295% 行数错误率:41.390705% 总的行数:2533709 完美行数:1484989 错误行数:1048720 字数完美率:50.97614% 字数错误率:49.023857% 总的字数:28374490 完美字数:14464220 错误字数:13910270 Ansj NlpAnalysis NLP分词: 分词速度:350.7527 字符/毫秒 行数完美率:58.60353% 行数错误率:41.396465% 总的行数:2533709 完美行数:1484843 错误行数:1048866 字数完美率:50.75546% 字数错误率:49.244545% 总的字数:28374490 完美字数:14401602 错误字数:13972888 Ansj BaseAnalysis 基本分词: 分词速度:532.65424 字符/毫秒 行数完美率:54.028584% 行数错误率:45.97142% 总的行数:2533709 完美行数:1368927 错误行数:1164782 字数完美率:46.84512% 字数错误率:53.15488% 总的字数:28374490 完美字数:13292064 错误字数:15082426 Ansj IndexAnalysis 面向索引的分词: 分词速度:564.6103 字符/毫秒 行数完美率:53.510803% 行数错误率:46.489197% 总的行数:2533709 完美行数:1355808 错误行数:1177901 字数完美率:46.355087% 字数错误率:53.644913% 总的字数:28374490 完美字数:13153019 错误字数:15221471
Ansj1.4的评估结果如下:
Ansj ToAnalysis 精准分词: 分词速度:581.7306 字符/毫秒 行数完美率:58.60302% 行数错误率:41.39698% 总的行数:2533709 完美行数:1484830 错误行数:1048879 字数完美率:50.968987% 字数错误率:49.031013% 总的字数:28374490 完美字数:14462190 错误字数:13912300 Ansj NlpAnalysis NLP分词: 分词速度:138.81165 字符/毫秒 行数完美率:58.1515% 行数错误率:41.8485% 总的行数:2533687 完美行数:1473377 错误行数:1060310 字数完美率:49.806484% 字数错误率:50.19352% 总的字数:28374398 完美字数:14132290 错误字数:14242108 Ansj BaseAnalysis 基本分词: 分词速度:627.68475 字符/毫秒 行数完美率:55.3174% 行数错误率:44.6826% 总的行数:2533709 完美行数:1401582 错误行数:1132127 字数完美率:48.177986% 字数错误率:51.822014% 总的字数:28374490 完美字数:13670258 错误字数:14704232 Ansj IndexAnalysis 面向索引的分词: 分词速度:715.55176 字符/毫秒 行数完美率:50.89444% 行数错误率:49.10556% 总的行数:2533709 完美行数:1289517 错误行数:1244192 字数完美率:42.965115% 字数错误率:57.034885% 总的字数:28374490 完美字数:12191132 错误字数:16183358
Ansj分词评估程序如下:
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Collections; import java.util.List; import org.ansj.domain.Term; import org.ansj.splitWord.analysis.BaseAnalysis; import org.ansj.splitWord.analysis.IndexAnalysis; import org.ansj.splitWord.analysis.NlpAnalysis; import org.ansj.splitWord.analysis.ToAnalysis; /** * Ansj分词器分词效果评估 * @author 杨尚川 */ public class AnsjEvaluation { public static void main(String[] args) throws Exception{ // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址: // http://pan.baidu.com/s/1hqihzjY List<EvaluationResult> list = new ArrayList<>(); // 对文本进行分词 float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis"); // 对分词结果进行评估 EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj BaseAnalysis 基本分词"); result.setSegSpeed(rate); list.add(result); // 对文本进行分词 rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis"); // 对分词结果进行评估 result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj ToAnalysis 精准分词"); result.setSegSpeed(rate); list.add(result); // 对文本进行分词 rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis"); // 对分词结果进行评估 result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj NlpAnalysis NLP分词"); result.setSegSpeed(rate); list.add(result); // 对文本进行分词 rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis"); // 对分词结果进行评估 result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj IndexAnalysis 面向索引的分词"); result.setSegSpeed(rate); list.add(result); //输出评估结果 Collections.sort(list); System.out.println(""); for(EvaluationResult r : list){ System.out.println(r+"\n"); } } private static float seg(final String input, final String output, final String type) throws Exception{ float rate = 0; try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){ long size = Files.size(Paths.get(input)); System.out.println("size:"+size); System.out.println("文件大小:"+(float)size/1024/1024+" MB"); int textLength=0; int progress=0; long start = System.currentTimeMillis(); String line = null; while((line = reader.readLine()) != null){ if("".equals(line.trim())){ writer.write("\n"); continue; } textLength += line.length(); switch(type){ case "BaseAnalysis": for(Term term : BaseAnalysis.parse(line)){ writer.write(term.getName()+" "); } break; case "ToAnalysis": for(Term term : ToAnalysis.parse(line)){ writer.write(term.getName()+" "); } break; case "NlpAnalysis": try{ for(Term term : NlpAnalysis.parse(line)){ writer.write(term.getName()+" "); } }catch(Exception e){} break; case "IndexAnalysis": for(Term term : IndexAnalysis.parse(line)){ writer.write(term.getName()+" "); } break; } writer.write("\n"); progress += line.length(); if( progress > 500000){ progress = 0; System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%"); } } long cost = System.currentTimeMillis() - start; rate = textLength/(float)cost; System.out.println("字符数目:"+textLength); System.out.println("分词耗时:"+cost+" 毫秒"); System.out.println("分词速度:"+rate+" 字符/毫秒"); } return rate; } /** * 分词效果评估 * @param resultText 实际分词结果文件路径 * @param standardText 标准分词结果文件路径 * @return 评估结果 */ private static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分词结果和标准一模一样 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分词结果和标准不一样 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { System.err.println("分词效果评估失败:" + ex.getMessage()); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; } /** * 分词结果 */ private static class EvaluationResult implements Comparable{ private String analyzer; private float segSpeed; private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public String getAnalyzer() { return analyzer; } public void setAnalyzer(String analyzer) { this.analyzer = analyzer; } public float getSegSpeed() { return segSpeed; } public void setSegSpeed(float segSpeed) { this.segSpeed = segSpeed; } public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return analyzer+":" +"\n" +"分词速度:"+segSpeed+" 字符/毫秒" +"\n" +"行数完美率:"+getLinePerfectRate()+"%" +" 行数错误率:"+getLineWrongRate()+"%" +" 总的行数:"+totalLineCount +" 完美行数:"+perfectLineCount +" 错误行数:"+wrongLineCount +"\n" +"字数完美率:"+getCharPerfectRate()+"%" +" 字数错误率:"+getCharWrongRate()+"%" +" 总的字数:"+totalCharCount +" 完美字数:"+perfectCharCount +" 错误字数:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } } }
MMSeg4j1.9.1的评估结果如下:
MMSeg4j ComplexSeg: 分词速度:794.24805 字符/毫秒 行数完美率:38.817604% 行数错误率:61.182396% 总的行数:2533688 完美行数:983517 错误行数:1550171 字数完美率:29.604435% 字数错误率:70.39557% 总的字数:28374428 完美字数:8400089 错误字数:19974339 MMSeg4j SimpleSeg: 分词速度:1026.1058 字符/毫秒 行数完美率:37.570095% 行数错误率:62.429905% 总的行数:2533688 完美行数:951909 错误行数:1581779 字数完美率:28.455273% 字数错误率:71.54473% 总的字数:28374428 完美字数:8074021 错误字数:20300407 MMSeg4j MaxWordSeg: 分词速度:813.0676 字符/毫秒 行数完美率:34.27573% 行数错误率:65.72427% 总的行数:2533688 完美行数:868440 错误行数:1665248 字数完美率:25.20896% 字数错误率:74.79104% 总的字数:28374428 完美字数:7152898 错误字数:21221530
MMSeg4j1.9.1分词评估程序如下:
import com.chenlb.mmseg4j.ComplexSeg; import com.chenlb.mmseg4j.Dictionary; import com.chenlb.mmseg4j.MMSeg; import com.chenlb.mmseg4j.MaxWordSeg; import com.chenlb.mmseg4j.Seg; import com.chenlb.mmseg4j.SimpleSeg; import com.chenlb.mmseg4j.Word; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.StringReader; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Collections; import java.util.List; /** * MMSeg4j分词器分词效果评估 * @author 杨尚川 */ public class MMSeg4jEvaluation { public static void main(String[] args) throws Exception{ // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址: // http://pan.baidu.com/s/1hqihzjY List<EvaluationResult> list = new ArrayList<>(); Dictionary dic = Dictionary.getInstance(); // 对文本进行分词 float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic)); // 对分词结果进行评估 EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("MMSeg4j ComplexSeg"); result.setSegSpeed(rate); list.add(result); // 对文本进行分词 rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic)); // 对分词结果进行评估 result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("MMSeg4j SimpleSeg"); result.setSegSpeed(rate); list.add(result); // 对文本进行分词 rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic)); // 对分词结果进行评估 result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("MMSeg4j MaxWordSeg"); result.setSegSpeed(rate); list.add(result); //输出评估结果 Collections.sort(list); System.out.println(""); for(EvaluationResult r : list){ System.out.println(r+"\n"); } } private static float seg(final String input, final String output, final Seg seg) throws Exception{ float rate = 0; try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){ long size = Files.size(Paths.get(input)); System.out.println("size:"+size); System.out.println("文件大小:"+(float)size/1024/1024+" MB"); int textLength=0; int progress=0; long start = System.currentTimeMillis(); String line = null; while((line = reader.readLine()) != null){ if("".equals(line.trim())){ writer.write("\n"); continue; } textLength += line.length(); writer.write(seg(line, seg)); writer.write("\n"); progress += line.length(); if( progress > 500000){ progress = 0; System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%"); } } long cost = System.currentTimeMillis() - start; rate = textLength/(float)cost; System.out.println("字符数目:"+textLength); System.out.println("分词耗时:"+cost+" 毫秒"); System.out.println("分词速度:"+rate+" 字符/毫秒"); } return rate; } private static String seg(String text, Seg seg) throws IOException { StringBuilder result = new StringBuilder(); MMSeg mmSeg = new MMSeg(new StringReader(text), seg); Word word = null; while((word=mmSeg.next())!=null) { result.append(word.getString()).append(" "); } return result.toString().trim(); } /** * 分词效果评估 * @param resultText 实际分词结果文件路径 * @param standardText 标准分词结果文件路径 * @return 评估结果 */ private static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分词结果和标准一模一样 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分词结果和标准不一样 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { System.err.println("分词效果评估失败:" + ex.getMessage()); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; } /** * 分词结果 */ private static class EvaluationResult implements Comparable{ private String analyzer; private float segSpeed; private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public String getAnalyzer() { return analyzer; } public void setAnalyzer(String analyzer) { this.analyzer = analyzer; } public float getSegSpeed() { return segSpeed; } public void setSegSpeed(float segSpeed) { this.segSpeed = segSpeed; } public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return analyzer+":" +"\n" +"分词速度:"+segSpeed+" 字符/毫秒" +"\n" +"行数完美率:"+getLinePerfectRate()+"%" +" 行数错误率:"+getLineWrongRate()+"%" +" 总的行数:"+totalLineCount +" 完美行数:"+perfectLineCount +" 错误行数:"+wrongLineCount +"\n" +"字数完美率:"+getCharPerfectRate()+"%" +" 字数错误率:"+getCharWrongRate()+"%" +" 总的字数:"+totalCharCount +" 完美字数:"+perfectCharCount +" 错误字数:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } } }
ik-analyzer2012_u6的评估结果如下:
IKAnalyzer 智能切分: 分词速度:178.3516 字符/毫秒 行数完美率:37.55943% 行数错误率:62.440567% 总的行数:2533686 完美行数:951638 错误行数:1582048 字数完美率:27.978464% 字数错误率:72.02154% 总的字数:28374416 完美字数:7938726 错误字数:20435690 IKAnalyzer 细粒度切分: 分词速度:182.97859 字符/毫秒 行数完美率:18.872742% 行数错误率:81.12726% 总的行数:2533686 完美行数:478176 错误行数:2055510 字数完美率:10.936535% 字数错误率:89.06347% 总的字数:28374416 完美字数:3103178 错误字数:25271238
ik-analyzer2012_u6分词评估程序如下:
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.StringReader; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Collections; import java.util.List; import org.wltea.analyzer.core.IKSegmenter; import org.wltea.analyzer.core.Lexeme; /** * IKAnalyzer分词器分词效果评估 * @author 杨尚川 */ public class IKAnalyzerEvaluation { public static void main(String[] args) throws Exception{ // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址: // http://pan.baidu.com/s/1hqihzjY List<EvaluationResult> list = new ArrayList<>(); // 对文本进行分词 float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true); // 对分词结果进行评估 EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("IKAnalyzer 智能切分"); result.setSegSpeed(rate); list.add(result); // 对文本进行分词 rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false); // 对分词结果进行评估 result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("IKAnalyzer 细粒度切分"); result.setSegSpeed(rate); list.add(result); //输出评估结果 Collections.sort(list); System.out.println(""); for(EvaluationResult r : list){ System.out.println(r+"\n"); } } private static float seg(final String input, final String output, final boolean useSmart) throws Exception{ float rate = 0; try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){ long size = Files.size(Paths.get(input)); System.out.println("size:"+size); System.out.println("文件大小:"+(float)size/1024/1024+" MB"); int textLength=0; int progress=0; long start = System.currentTimeMillis(); String line = null; while((line = reader.readLine()) != null){ if("".equals(line.trim())){ writer.write("\n"); continue; } textLength += line.length(); writer.write(seg(line, useSmart)); writer.write("\n"); progress += line.length(); if( progress > 500000){ progress = 0; System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%"); } } long cost = System.currentTimeMillis() - start; rate = textLength/(float)cost; System.out.println("字符数目:"+textLength); System.out.println("分词耗时:"+cost+" 毫秒"); System.out.println("分词速度:"+rate+" 字符/毫秒"); } return rate; } private static String seg(String text, boolean useSmart) throws IOException { StringBuilder result = new StringBuilder(); IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart); Lexeme word = null; while((word=ik.next())!=null) { result.append(word.getLexemeText()).append(" "); } return result.toString().trim(); } /** * 分词效果评估 * @param resultText 实际分词结果文件路径 * @param standardText 标准分词结果文件路径 * @return 评估结果 */ private static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分词结果和标准一模一样 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分词结果和标准不一样 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { System.err.println("分词效果评估失败:" + ex.getMessage()); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; } /** * 分词结果 */ private static class EvaluationResult implements Comparable{ private String analyzer; private float segSpeed; private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public String getAnalyzer() { return analyzer; } public void setAnalyzer(String analyzer) { this.analyzer = analyzer; } public float getSegSpeed() { return segSpeed; } public void setSegSpeed(float segSpeed) { this.segSpeed = segSpeed; } public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return analyzer+":" +"\n" +"分词速度:"+segSpeed+" 字符/毫秒" +"\n" +"行数完美率:"+getLinePerfectRate()+"%" +" 行数错误率:"+getLineWrongRate()+"%" +" 总的行数:"+totalLineCount +" 完美行数:"+perfectLineCount +" 错误行数:"+wrongLineCount +"\n" +"字数完美率:"+getCharPerfectRate()+"%" +" 字数错误率:"+getCharWrongRate()+"%" +" 总的字数:"+totalCharCount +" 完美字数:"+perfectCharCount +" 错误字数:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } } }
ansj、mmseg4j和ik-analyzer的评估程序可在附件中下载,word分词只需运行项目根目录下的evaluation.bat脚本即可。
参考资料:
1、word分词器分词效果评估测试数据集和标准数据集
2、word分词器评估程序
3、word分词器主页
4、ansj分词器主页
5、mmseg4j分词器主页
6、ik-analyzer分词器主页