

Support Vector Machine

  • SVMlight
An implementation of Vapnik's Support Vector Machine
A Library for Support Vector Machines

Decision Tree

  • C4.5
The "classic" decision-tree tool, developed by J. R. Quinlan  Tutorial

Maximum Entropy

Yet Another Small MaxEnt Toolkit

Conditional Random Field

  • CRF++
A simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data



  • OpenNLP
An organizational center for open source projects related to natural language processing
  • CMU Statistical Language Modeling Toolkit
A suite of UNIX software tools to facilitate the construction and testing of statistical language models
  • The Dragon ToolKit
A Java-based development package for academic use in information retrieval (IR) and text mining. Include many NLP tools
  • LingPipe
A suite of Java libraries for the linguistic analysis of human language, including
  • track mentions of entities (e.g. people or proteins);
  • link entity mentions to database entries;
  • uncover relations between entities and actions;
  • classify text passages by language, character encoding, genre, topic, or sentiment;
  • correct spelling with respect to a text collection;
  • cluster documents by implicit topic and discover significant trends over time; and
  • provide part-of-speech tagging and phrase chunking.
  • Natural Language Toolkit
Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.
  • Antelope
  • Advanced Natural Lange Object-oriented Processing Environment.包括一系列工具(特别c#的stanford parser)


  • Stanford Chinese Word Segmenter
A Java implementation of a CRF-based Chinese Word Segmenter


  • Brill tagger
A error-driven transformation-based tagger implemented by  Eric Brill
  • Stanford POS Tagger
A Java implementation of the log-linear part-of-speech taggers descriped by Kristina Toutanova, et.al.
  • MBT:Memory-based Tagger
  • TreeTagger
A decision tree based tagger from the University of Stuttgart.
  • SVMTool , a POS Tagger based on SVMs
  • QTAG Part of speech tagger
An HMM-based Java POS tagger from Birmingham U.


  • Stanford Named Entity Recognizer
A Java implementation of a Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition
  • LingPipe
Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
  • YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)


  • Porter Stemming
A process for removing the commoner morphological and inflexional endings from words in English by Martin Porter
  • Snowball
A small string processing language designed for creating stemming algorithms for use in Information Retrieval.


  • Stanford Parser
Java implementations of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser.
  • Berkeley Parser



  • Rouge Rouge在Windows下的配置



  • OpenSSL
包括众多加密算法,RSA、DES、MD5、SHA等  Win32安装版


  • zlib
A Massively Spiffy Yet Delicately Unobtrusive Compression Library


  • Apache Logging Services
Creates and maintains open-source software related to the logging of application behavior and released at no charge to the public, including
  • log4j for Java,
  • log4cxx for C++, and
  • log4net for MS .Net framework.
注: log4cxx官方版本有内存泄漏问题


  • ICU
A mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications


  • Xerces
A validating XML parser, including C and Java edition


  • AC in C# : Aho-Corasick string matching in C#

HTML Parser

  • Html Agility Pack , an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
  • Majestic-12 , an open source high-performance .NET C# module that was created to parse HTML for links, indexing and other purposes. 速度快,但不生成dom树


  • An annotated list of resources by Stanford NLP Group
  • KDnuggets 有一些与KDD相关的软件等
