lingpipe 是alias公司开发的一款自然语言处理软件包,目前(2008.04.21)最高版本是3.5([url] /04/28856.html[/url]),功能非常强大,最重要的是文档超级详细,每个模型甚至连参考论文都列出来了,不仅使用方便,也非常适合模型的学习。
SIGHAN06中有一篇paper, 关于Alias-i公司的Bob Carpenter所提交的参评报告”Character Language Models for Chinese Word Segmentation and Named Entity Recognition”看到了他们开发的LingPipe NLP Toolkit,一个自然语言处理的Java开源工具包。可以免费下载,而且开源,支持中文,不仅仅是对代码结构的说明,而且还提供了算法思想文档和相关 的资源,如测试数据集、相关论文等,一个不错的toolkit。
主题分类(Top Classification)、命名实体识别(Named Entity Recognition)、词性标注(Part-of Speech Tagging)、句题检测(Sentence Detection)、查询拼写检查(Query Spell Checking)、兴趣短语检测(Interseting Phrase Detection)、聚类(Clustering)、字符语言建模(Character Language Modeling)、医学文献下载/解析/索引(MEDLINE Download, Parsing and Indexing)、数据库文本挖掘(Database Text Mining)、中文分词(Chinese Word Segmentation)、情感分析(Sentiment Analysis)、语言辨别(Language Identification)等
Feature Overview
LingPipe’s information extraction and data mining tools:
* track mentions of entities (e.g. people or proteins); 实体跟踪(如,人物、蛋白质)
* link entity mentions to database entries; 链接命名实体数据库中记录
* uncover relations between entities and actions; 发现实现和行为间关系
* classify text passages by language, character encoding, genre, topic, or sentiment; 通过语言、字体编码、类型、主题和情感对文本分类
* correct spelling with respect to a text collection; 拼写检查
* cluster documents by implicit topic and discover significant trends over time; and 通过隐藏主题对文档聚类和基于时间序列的趋势发现
* provide part-of-speech tagging and phrase chunking. 提供词性标注和短语组块
import com.aliasi.matrix.SparseFloatVector; import com.aliasi.matrix.Vector; import com.aliasi.symbol.MapSymbolTable; import com.aliasi.symbol.SymbolTable; import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory; import com.aliasi.tokenizer.TokenizerFactory; import com.aliasi.tokenizer.TokenFeatureExtractor; import java.util.HashMap; import java.util.Map; public class ExtractFeatures { public static Vector[] featureVectors(String[] texts, SymbolTable symbolTable) { Vector[] vectors = new Vector[texts.length]; TokenizerFactory tokenizerFactory = new IndoEuropeanTokenizerFactory(); TokenFeatureExtractor featureExtractor = new TokenFeatureExtractor( tokenizerFactory); for (int i = 0; i < texts.length; ++i) { Map featureMap = featureExtractor .features(texts[i]); vectors[i] = toVectorAddSymbols(featureMap, symbolTable, Integer.MAX_VALUE); } return vectors; } public static SparseFloatVector toVectorAddSymbols( Map featureVector, SymbolTable table, int numDimensions) { int size = (featureVector.size() * 3) / 2; Map vectorMap = new HashMap(size); for (Map.Entry entry : featureVector .entrySet()) { String feature = entry.getKey(); Number val = entry.getValue(); int id = table.getOrAddSymbol(feature); vectorMap.put(new Integer(id), val); } return new SparseFloatVector(vectorMap, numDimensions); } public static void main(String[] args) { args = new String[]{"this is a book", "go to school" }; SymbolTable symbolTable = new MapSymbolTable(); Vector[] vectors = featureVectors(args, symbolTable); System.out.println("VECTORS"); for (int i = 0; i < vectors.length; ++i) System.out.println(i + ") " + vectors[i]); System.out.println(" SYMBOL TABLE"); System.out.println(symbolTable); } }
如何使用LingPipe 计算TF-IDF[b]
By jeffye | 五月 25, 2008
Hope that the following java code can help you:
import com.aliasi.spell.TfIdfDistance; import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory; import com.aliasi.tokenizer.TokenizerFactory; public class TfIdfDistanceDemo { public static void main(String[] args) { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY; TfIdfDistance tfIdf = new TfIdfDistance(tokenizerFactory); for (String s : args) tfIdf.trainIdf(s); System.out.printf("n %18s %8s %8sn", "Term", "Doc Freq", "IDF"); for (String term : tfIdf.termSet()) System.out.printf(" %18s %8d %8.2fn",term,tfIdf.docFrequency(term), tfIdf.idf(term)); for (String s1 : args) { for (String s2 : args) { System.out.println("nString1=" + s1); System.out.println("String2=" + s2); System.out.printf("distance=%4.2f proximity=%4.2fn", tfIdf.distance(s1,s2), tfIdf.proximity(s1,s2)); } } } }