开源机器学习库

原文地址:C++的机器学习开源库作者:webbery508

 一、c++开源机器学习库

1)mlpack is a C++ machine learning library.

2)PLearn is a C++ library aimed at research and development in the field of statistical machine learning algorithms. Its originality is to allow to easily express, directly in C++ in a straightforward manner, complex non-linear functions to be optimized.


3)Waffles- C++ Machine Learning。
4)Torch7 provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation

5)SHARK is a modular C++ library for the design and optimization of adaptive systems. It provides methods for linear and nonlinear optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning algorithms and neural networks, and various other machine learning techniques. SHARK serves as a toolbox to support real world applications as well as research in different domains of computational intelligence and machine learning. The sources are compatible with the following platforms: Windows, Solaris, MacOS X, and Linux.

6)Dlib-ml is an open source library, targetedat both engineers and research scientists, which aims to provide a similarly rich environment fordeveloping machine learning software in the C++ language.

7) Eblearn is an object-oriented C++ library that implements various machine learning models, including energy-based learning, gradient-based learning for machine composed of multiple heterogeneous modules. In particular, the library provides a complete set of tools for building, training, and running convolutional networks.

8)  Machine Learning Open Source Software :Journal of Machine Learning Research:http://jmlr.csail.mit.edu/mloss/.

9) search in google: c++ site:jmlr.csail.mit.edu filetype:pdf  , Machine Learning Toolkit

10) SIGMA: Large-Scale and Parallel Machine-Learning Tool Kit

11) http://sourceforge.net/directory/science-engineering/ai/machinelearning/os:windows/freshness:recently-updated/


-------------   2012.9.12   ---------
12) ELF: ensemble learning framework。特点:c++,监督学习,使用了intel的IPP和MKL,training speed 和accuracy是主要目标。http://elf-project.sourceforge.net/
------------- 2012.11.03  ---------
13)  http://mloss.org/software/ machine learning open sources software。算是一个索引网站吧。
14) http://drwn.anu.edu.au/index.html
来源:http://blog.csdn.net/genliu777/article/details/7396760
 
 
二、机器学习的开源工具
以下工具绝大多数都是开源的,基于GPL、Apache等开源协议,使用时请仔细阅读各工具的license statement

I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine

2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的顶级开源项目,基于Apache 2.0协议,完全用java编写,具有perl, c/c++, dotNet等多个port
http://lucene.apache.org/
http://www.nutch.org/

3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most idely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html

II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四个工具

2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
Franz Josef Och先后在德国Aachen大学,ISI(南加州大学信息科学研究所)和Google工作。GIZA++现已有Windows移植版本,对IBM 的model 1-5有很好支持。

3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models

4. OpenNLP:
http://opennlp.sourceforge.net/
包括Maxent等20多个工具

btw: 这些SMT的工具还都喜欢用埃及相关的名字命名,像什么GIZA、PHARAOH、Cairo等等。Och在ISI时开发了GIZA++,PHARAOH也是由来自ISI的Philipp Koehn 开发的,关系还真是复杂啊

5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm

6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS),提供bin, src和doc。
WordNet的在线版本是http://wordnet.princeton.edu/perl/webwn

7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong开发,是一个类似于WordNet的东东

8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.

10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.

11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering

III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och编写。此外,OpenNLP项目里有一个java的MaxEnt工具,使用GIS估计参数,由东北大学的张乐(目前在英国留学)port为C++版本

2. LibSVM
由国立台湾大学(ntu)的Chih-Jen Lin开发,有C++,Java,perl,C#等多个语言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.

3. SVM Light
由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题
http://svmlight.joachims.org/

4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
这个软件包只提供executable/library两种形式,不提供源代码下载

5. CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域

6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light,均由cornell的Thorsten Joachims开发。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.

 

IV. Misc:
1. Notepad++: 一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net

2. WinMerge: 用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/

3. OpenPerlIDE: 开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS .NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut/paste,使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还有...... Visual Studio .NET is really kool:D

4. Berkeley DB
http://www.sleepycat.com/
Berkeley DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系统中发展起来的,它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.

11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R统计软件与MatLab类似,都是用在科学计算领域的。

转自:http://kapoc.blogdriver.com/kapoc/1268927.html
 
三、OpenCV的机器学习函数(部分)
机器学习库(MLL, Machine Learning Library)是一个类和函数的集合,主要用于数据的统计学分类、回归和聚类。

大多数聚类和回归算法是以C++类的形式实现。因为每种算法都有不同的功能集合(如处理缺省数据或分类输入变量的能力等),这些类里只有少量一些共同点。这些共同点通过类CvStatModel来定义,其他所有的ML类均由此派生。

CvStatModel
机器学习中,统计模型的基类
class CvStatModel
{
public:
  
  
    virtual ~CvStatModel();
    virtual void clear()=0;
  
  
    virtual void save( const char* filename, const char* name=0 )=0;
    virtual void load( const char* filename, const char* name=0 )=0;
    virtual void write( CvFileStorage* storage, const char* name )=0;
    virtual void read( CvFileStorage* storage, CvFileNode* node )=0;
};

在这个声明中,有些方法被注释掉了。实际上,这些方法是那些没有统一API(除默认构造函数外),然而在语法和语义上却有很多相似性的功能,于是将它作为基类的一部分。这些方法介绍如下。

CvStatModel::CvStatModel
默认构造函数
CvStatModel::CvStatModel();

ML中的每一个统计模型类都有一个没有参数的构造函数。这个构造器在模型构造的训练(train())和加载(load())两个阶段非常有用。

CvStatModel::CvStatModel(...)
训练构造函数
CvStatModel::CvStatModel( const CvMat* train_data ... );

大多数ML类都提供构造和训练一步完成的构造函数。这一构造函数等价于使用默认构造函数,接着使用训练train()方法使用传递的参数训练模型。

CvStatModel::~CvStatModel
虚析构函数
CvStatModel::~CvStatModel();

析构函数被定义为虚函数,所以,可以安全的写如下代码:
CvStatModel* model;
if( use_svm )
    model = new CvSVM(... );
else
    model = new CvDTree(... );
...
delete model;

通常,每个派生类的析构函数不做任何事情,但是,调用重载函数clear()来释放所有内存。

CvStatModel::clear
释放内存并重置模型的状态void CvStatModel::clear();

这一函数做和析构函数一样的工作,也就是释放类成员占有的所有内存空间。但是,对象自己不被析构,并且它可以重新使用。这一方法被派生类的析构函数、训练函数、load()、read()等调用,或者显式地由用户调用。

CvStatModel::save
保存模型到文件void CvStatModel::save( const char* filename, const char* name=0 );

函数save将模型的全部状态存储到指定文件名或默认名的xml或yaml文件中(这些依赖于特定的类)。cxcore的数据持续化功能在这里被应用。

CvStatModel::load
从文件加载模型
void CvStatModel::load( const char* filename, const char* name=0 );

函数load从指定名的xml或yaml文件中读取模型的所有状态。在此过程中,原模型的状态被clear()函数清空。注意到这个函数是虚函数,所以,所有模型都可以使用这个虚函数。然而,不同于C版的OpenCV(可使用cvLoad()来加载),在这种情况下模型类型必须知道,因为一个适当类的实例模型必须要事先构造。这一限制将在以后版本的ML中取消。

CvStatModel::write
写模型到文件存储中void CvStatModel::write( CvFileStorage* storage, const char* name );
函数write将模型的状态存入指定名的文件中。这一函数在save()中被调用。

CvStatModel::read
从文件存储中读取模型void CvStatMode::read( CvFileStorage* storage, CvFileNode* node );

函数read将模型完整的从指定文件中恢复过来。node必须由用户定位,例如,使用函数cvGetFileNodeByName()。这一方法被load()调用。原模型被clear()函数清空。

CvStatModel::train
训练模型
bool CvStatMode::train( const CvMat* train_data, [int tflag,] ..., const CvMat* responses, ...,
    [const CvMat* var_idx,] ..., [const CvMat* sample_idx,] ...
    [const CvMat* var_type,] ..., [const CvMat* missing_mask,] ... );

函数train使用输入特征向量集合和对应的反应输出值(responses)来训练统计模型,输入输出向量或值都作为矩阵传输。默认情况下,输入的特征向量作为train_data的行存储,也就是,一条训练向量的所有成分(特征)是连续存储的。然而,一些算法能处理转置表示的矩阵,即整个输入集合的每个特定属性的所有值连续存放。如果两种布局方式都支持,则将有一个tflag参数来指定方向:
tflag=CV_ROW_SAMPLE 说明样本的特征向量按行存储,每行一个样本;
tflag=CV_COL_SAMPLE 说明样本的特征向量按列存储,每列一个样本。

train_data必须是32fC1(32位浮点型单通道)数据格式。结果通常存储在1维的行或列向量中,类型为32sC1(只在分类问题中)或32fC1格式,每条输入向量对应一个值(有些算法,如各种各样的神经网络,每条对应一个向量作为结果)。

对于分类问题,结果是离散的类别标识;对于回归问题,结果是逼近函数的输出值。有些算法可以处理其中一种问题,而有些两者皆可处理。在最后一种情况下,输出结果的以何种形式,由var_type设定:
CV_VAR_CATEGORICAL 指定输入值为离散的类别标识;
CV_VAR_ORDERED(=CV_VAR_NUMERICAL) 指定输出结果为有序的,也就是两个不同的数据被作为数据比较,这是一个回归问题。
输入变量的类型也可以使用var_type来设定。不过,大部分算法只能处理连续数据的输入变量。

在ML中,很多模型可以使用选定的特征子集和/或在一个选定的训练样本子集上训练,为了便于用户使用,函数train使用var_idx参数来确定感兴趣的特征,使用sample_idx参数来确定感兴趣的样本。这两个向量都是整数向量(32sC1),也就是,从0开始的索引列表,或者8位的标识激活变量或样本的遮罩(masks)。用户传递NULL指针给这两个参数则表示所有的属性或样本都将被用于训练。

另外,一些算法可以处理缺省数据,是指特定训练样本的特定特征没有值(例如,他们忘了测量病人A在星期一的体温)。参数missing_mask,一个8位、和train_data相同大小的矩阵,被用来标记缺失的数据值(mask的非零元素值表示缺数据)。

通常,在调用训练过程之前,原先的模型被clear()清除。然而,有些算法会使用新的数据来有选择的更新模型,而非重置它。

CvStatModel::predict
预测样本的结果float CvStatMode::predict( const CvMat* sample[, ] ) const;

该函数用于预测一个新样本的反应。在分类情况下,方法返回类别标识,在回归情况下,返回输出函数值。输入样本必须含有和train函数的train_data同样多的组成量(维数)。如果var_idx参数被传递给train,则它将被记忆并在predict函数中精确的使用这些需要的成分。

描述符const意味着预测不会影响模型的内部状态,所以,这一函数可以安全的应用与各个不同的线程。
 
来源:http://www.aiseminar.cn/bbs/forum.php?mod=viewthread&tid=798

你可能感兴趣的:(数据挖掘/机器学习,Java)