一些第三方库

一些第三方库_第1张图片

在工作和学习中,借助第三方开源代码库是常见的事情,“站在巨人的肩膀上”嘛,相信大家都不会陌生,赞叹开源、共享的伟大。

一方面为了做个总结,另一方面,就是好东西要与大家分享,我在 Github 上维护了一个页面 https://github.com/fandywang/thirdparty_intro,包含了个人比较关注的第三方代码库,如下(持续更新中)

Google 开源库

  • zh-google-styleguide - Google 开源项目风格指南.
  • protobuf - Protocol Buffers - Google’s data interchange format.
  • gflags - Commandline flags module for C++.
  • glog - Logging library for C++.
  • gtest - Google C++ Testing Framework.
  • googlemock - Google C++ Mocking Framework.
  • leveldb - A fast and lightweight key/value database library by Google.
    cpy-leveldb - Python bindings for LevelDB using leveldb c api.
  • The Chromium Projects - The Chromium projects include Chromium and Chromium OS, the open-source projects behind the Google Chrome browser and Google Chrome OS, respectively.

C++ base 库

  • toft - C++ Base Library for Linux server side development.
  • thirdparty - Put thirdparty library here for toft ant foxy.
    chen3feng
  • folly - Folly is an open-source C++ library developed and used at Facebook.

算法和数据结构

  • darts-clone - A clone of the Darts (Double-ARray Trie System).
  • Darts - Double-ARray Trie System. 中文翻译文档
  • sparsehash - An extremely memory-efficient hash_map implementation。
  • cityhash - The CityHash family of hash functions.
  • stringencoders - A collection of high performance c-string transformations, frequently 2x faster than standard implementations (if they exist at all).
  • Numpy - NumPy is the fundamental package for scientific computing with Python.

自然语言处理库

  • NLTK - NLTK – the Natural Language Toolkit – is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing.
    NLTK Book
  • jieba - 结巴中文分词.
  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • LTP - 语言技术平台(Language Technology Platform,LTP)是哈工大社会计算与信息检索研究中心历时十年研制的一整套开放中文自然语言处理系统。
  • Stanford CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities.
  • openNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
  • SRILM - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
  • IRSTLM - The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs.
  • KenLM - KenLM estimates unpruned language models with modified Kneser-Ney smoothing.
  • Moses - Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair.
  • GIZA++ - GIZA++ is a statical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model.
  • genius - genius中文分词,是基于crf条件随机场的分组件.
  • sego - Go中文分词.
  • pinyin - Go语言汉字转拼音工具.
  • ReVerb - ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.
  • Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources - 斯坦福自然语言组的NLP及计算语言学的资料汇总:包括各种工具,代码,语料库,字典,课程的链接及简单介绍。http://t.cn/zOfVAzs
  • webdict - WEBDICT 词表计划目标是通过机器学习算法以及人工标注构建一个包含大量网络词汇的、无版权限制的中文词库,从而提高中文网络文本自然语言分析以及开源中文输入法的效果。http://webdict.info/
  • sego - Go中文分词 词典用前缀树实现, 分词器算法为基于词频的最短路径加动态规划。支持普通和搜索引擎两种分词模式,支持用户词典、词性标注,可运行JSON RPC服务。

信息检索库

  • Lemur - The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
  • Lucene - The Apache Lucene project develops open-source search software.
  • Solr - Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.
  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • wukong - 悟空全文搜索引擎.
  • Scrapy - a fast high-level screen scraping and web crawling framework for Python.
  • distribute_crawler - 使用scrapy,redis, mongodb,graphite实现的一个分布式网络爬虫,底层存储mongodb集群,分布式使用redis实现, 爬虫状态显示使用graphite实现。

机器学习库

  • LASSO - LASSO is a parallel machine learning system that learns a regression model from large data. It works in either of two modes: IPM-mode and MPI-mode.
  • libsvm - A Library for Support Vector Machines.
    支持向量机通俗导论(理解SVM的三层境界) 来自研究者July. 在本文中,你将看到,理解SVM分三层境界,
    第一层: 了解SVM(你只需要对SVM有个大致的了解,知道它是个什么东西便已足够);
    第二层: 深入SVM(你将跟我一起深入SVM的内部原理,通晓其各处脉络,以为将来运用它时游刃有余);
    第三层: 证明SVM(当你了解了所有的原理之后,你会有大笔一挥,尝试证明它的冲动)。
  • liblinear - A Library for Large Linear Classification.
  • RankLib - RankLib is a library of learning to rank algorithms.
  • svmlight - SVMlight is an implementation of Support Vector Machines (SVMs) in C.
  • plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation
  • GibbsLDA++ - A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
  • Yahoo_LDA - Yahoo!’s topic modelling framework using Latent Dirichlet Allocation
  • word2vec - Tool for computing continuous distributed representations of words.
    Parallelizing word2vec in Python
  • Maximum Entropy Modeling Toolkit for Python and C++ - This package provides a (Conditional) Maximum Entropy Modeling Toolkit for Python and C++.
  • maxent - A simple C++ library for maximum entropy classification.
  • easyME - This is a simple implementation of Maximum Entropy model. Algorithms implemented include: GIS, SCGIS, LBFGS, Gaussian smoothing and Exponential smoothing.
  • libLBFGS - This library is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal.
  • OWL-QN - The Orthant-Wise Limited-memory Quasi-Newton algorithm (OWL-QN) is a numerical optimization procedure for finding the optimum of an objective of the form {smooth function} plus {L1-norm of the parameters}. It has been used for training log-linear models (such as logistic regression) with L1-regularization.
  • CRF++ - CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
  • CRFsuite - A fast implementation of Conditional Random Fields (CRFs).
  • Wapiti - Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models.
  • sofia-ml - Suite of Fast Incremental Algorithms for Machine Learning. Includes methods for learning classification and ranking models, using Pegasos SVM, SGD-SVM, ROMMA, Passive-Aggressive Perceptron, Perceptron with Margins, and Logistic Regression.
  • mahout - The Apache Mahout machine learning library’s goal is to build scalable machine learning libraries.
  • MLTK - MLTK – the Machine Learning Toolkit – is a suite of C++ open source modules of Machine Learning.
  • FP-growth - An implementation of the FP-growth algorithm in pure Python.
  • MLcomp - MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
  • PyBrain - PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.
  • parameter_server - A distributed machine learning framework.
  • vowpal_wabbit - John Langford’s original release of Vowpal Wabbit – a fast online learning algorithm.
  • Theano - Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
  • Caffe - Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.

数据交换协议

  • protobuf - Protocol Buffers - Google’s data interchange format.
  • jsoncpp - JSON data format manipulation library.
  • tinyxml2 - TinyXML-2 is a simple, small, efficient, C++ XML parser that can be easily integrating into other programs.
  • thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

数据库

  • MySQL++ - MySQL++ is a C++ wrapper for MySQL’s C API.
  • MongodDB - MongoDB (from “humongous”) is an open-source document database, and the leading NoSQL database. Written in C++.
  • memcached - Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
  • leveldb - A fast and lightweight key/value database library by Google.
  • SSDB - A fast NoSQL database server with zset data type, an alternative to Redis.
    SSDB is a high performace key-value(key-string, key-zset, key-hashmap) NoSQL persistent storage server, using Google LevelDB as storage engine. SSDB is stable, production-ready and is widely used by many Internet companies such as QIHU 360.
  • RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads.
    RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
  • fatcache - Memcache on SSD. Think of fatcache as a cache for your big data.
  • THUIRDB - THUIRDB是一个C++语言实现的基础库,用于在单机上实现高性能key-value持久化存储和高速查询。THUIRDB Paper

网络编程

  • thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • server1 - a c++ network server/client framework.
  • muduo-protorpc - Google Prorobuf RPC based on Muduo.

Web 开发

  • Flask - Flask is a microframework for Python based on Werkzeug and Jinja2. It’s intended for getting started very quickly and was developed with best intentions in mind.
    中文docs
  • Bootstrap - Sleek, intuitive, and powerful front-end framework for faster and easier web development.
  • Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.

分布式计算

  • Hadoop - The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • ZooKeeper - ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • Storm - Distributed and fault-tolerant realtime computation.
    Storm 维基 - 提供了有关 Storm、它的理论基础的大量优秀文档,以及有关获取 Storm 和设置新项目的各种教程。您还将找到一些有关 Storm 的许多方面的实用文档,包括 Storm 在本地模式、集群模式和在 Amazon 上的使用。
    GitHub 上提供了 Storm 的一个 thorough class tree exists,详细介绍了 Storm 的类和接口。
    使用 Twitter Storm 处理实时的大数据 - 流式处理大数据简介 简介: Storm 是一个开源的、大数据处理系统,与其他系统不同,它旨在用于分布式实时处理且与语言无关。了解 Twitter Storm、它的架构,以及批处理和流式处理解决方案的发展形势。
    Storm 入门教程 - 来自量子恒道官方博客
    storm-starter - Learn to use Storm!
    StreamCpp - A small C++ wrapper for Storm. Some documentation can be found at http://demeter.inf.ed.ac.uk/cross/stormcpp.html
    storm-kafka - storm-kafka provides a regular spout implementation and a TransactionalSpout implementation for Apache Kafka 0.7.
  • Spark - Lightning-Fast Cluster Computing.
  • Puppet - Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to orchestration and reporting. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.
  • Skynet - Skynet is a framework for distributed services in Go.
  • Kafka - 分布式消息队列系统,A high-throughput distributed messaging system. Kafka paper: Building LinkedIn’s Real-time Activity Data Pipeline
    Kafka Clients
    librdkafka
    kafka-python
    Kafka papers and presentations
  • METAQ - METAQ 是 alibaba 公司开发的 一款完全的队列模型消息中间件,服务器使用Java语言编写,可在多种软硬件平台上部署。客户端支持Java、C++编程语言。单台服务器可支持1万以上个消息队列,通过扩容服务器,队列数几乎可任意横向扩展。每个队列都是持久化、长度无限(取决于磁盘空间大小)、并且可从队列任意位置开始消费。
  • Celery — Distributed Task Queue - 这个框架几乎是 Python 下异步消息架构的终极解决方案.
  • mapreduce-lite - A C++ implementaton of MapReduce without distributed filesystem.
  • GraphChi - GraphChi[huahua] is a spin-off of the GraphLab[rador’s retriever] project.
    GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in similar vertex-centric model as GraphLab. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and changing the graph structure while computing.
    GraphChi ppt.
    GraphChi Paper.
    GraphChi Video.
    GraphChi’s C++ version. -disk-based large-scale graph computation. Big Data - small machine.
  • Giraph - Large-scale graph processing on Hadoop.
  • Celery — Distributed Task Queue - Celery is a simple, flexible and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.
    It’s a task queue with focus on real-time processing, while also supporting task scheduling.
    这个框架几乎是 Python 下异步消息架构的终极解决方案.

正则表达式

  • re2 - an efficient, principled regular expression library.

编译工具

  • SCons - SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.
  • CMake - the cross-platform, open-source build system.
  • blade - Blade is designed to be a modernize building system.
    Mac OS X port of Typhoon Blade
  • bobo - Bobo is an easy to use building tool inspired by blade.

Code Review

  • rietveld - Code Review, hosted on Google App Engine.
  • Review Board - Take the pain out of code review.

vim

  • spf13-vim - spf13-vim is a distribution of vim plugins and resources for Vim, GVim and MacVim. It is a completely cross platform distribution that stays true to the feel of vim while providing modern features like a plugin management system, autocomplete, tags and tons more.
  • Maximum Awesome - Config files for vim and tmux, lovingly tended by a small subculture of peace-loving hippies. Built for Mac OS X.
  • VimClojure - A filetype, syntax and indent plugin for Clojure.

Go 学习

  • glog - Leveled execution logs for Go.
  • groupcache - groupcache is a caching and cache-filling library, intended as a replacement for memcached in many cases.
  • go-slab - A slab allocator library in the Go Programming Language.
  • Go语言资料收集 -

Python 学习

  • pycrumbs - Bits and Bytes of Python from the Internet.

自动化部署引擎

  • Docker - Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
    Docker 是一个开源自动化部署引擎,它可以将任何应用封装成一个简单、便携、不依赖于其他组件的容器,从而轻松地将其部署在各种虚拟环境中,以便进行各种调试。它既保证了应用的私有性,同时缩短了调试部署的周期,使得测试-封装-部署变得更加容易和便捷。不过现在Docker还在加紧开发中,相信等它开发完毕后,它会给人们的开发带来前所未有的便捷。

python机器学习库

Python在科学计算领域,有两个重要的扩展模块:Numpy和Scipy。其中Numpy是一个用python实现的科学计算包。包括:

一个强大的N维数组对象Array;
比较成熟的(广播)函数库;
用于整合C/C++和Fortran代码的工具包;
实用的线性代数、傅里叶变换和随机数生成函数。
SciPy是一个开源的Python算法库和数学工具包,SciPy包含的模块有最优化、线性代数、积分、插值、特殊函数、快速傅里叶变换、信号处理和图像处理、常微分方程求解和其他科学与工程中常用的计算。其功能与软件MATLAB、Scilab和GNU Octave类似。

Numpy和Scipy常常结合着使用,Python大多数机器学习库都依赖于这两个模块,绘图和可视化依赖于matplotlib模块,matplotlib的风格与matlab类似。Python机器学习库非常多,而且大多数开源,主要有:

  1. scikit-learn

    scikit-learn 是一个基于SciPy和Numpy的开源机器学习模块,包括分类、回归、聚类系列算法,主要算法有SVM、逻辑回归、朴素贝叶斯、Kmeans、DBSCAN等,目前由INRI 资助,偶尔Google也资助一点。

    项目主页:

    https://pypi.python.org/pypi/scikit-learn/

    http://scikit-learn.org/

    https://github.com/scikit-learn/scikit-learn

  2. NLTK

    NLTK(Natural Language Toolkit)是Python的自然语言处理模块,包括一系列的字符处理和语言统计模型。NLTK 常用于学术研究和教学,应用的领域有语言学、认知科学、人工智能、信息检索、机器学习等。 NLTK提供超过50个语料库和词典资源,文本处理库包括分类、分词、词干提取、解析、语义推理。可稳定运行在Windows, Mac OS X和Linux平台上.

    项目主页:

    http://sourceforge.net/projects/nltk/

    https://pypi.python.org/pypi/nltk/

    http://nltk.org/

  3. Mlpy

    Mlpy是基于NumPy/SciPy的Python机器学习模块,它是Cython的扩展应用。包含的机器学习算法有:

    • 回归

      least squares, ridge regression, least angle regression, elastic net, kernel ridge regression, support vector machines (SVM), partial least squares (PLS)

    • 分类

    linear discriminant analysis (LDA), Basic perceptron, Elastic Net, logistic regression, (Kernel) Support Vector Machines (SVM), Diagonal Linear Discriminant Analysis (DLDA), Golub Classifier, Parzen-based, (kernel) Fisher Discriminant Classifier, k-nearest neighbor, Iterative RELIEF, Classification Tree, Maximum Likelihood Classifier

    • 聚类

      hierarchical clustering, Memory-saving Hierarchical Clustering, k-means

    • 维度约减

    (Kernel) Fisher discriminant analysis (FDA), Spectral Regression Discriminant Analysis (SRDA), (kernel) Principal component analysis (PCA)

    项目主页:

    http://sourceforge.net/projects/mlpy

    https://mlpy.fbk.eu/

  4. Shogun

    Shogun是一个开源的大规模机器学习工具箱。目前Shogun的机器学习功能分为几个部分:feature表示,feature预处理, 核函数表示,核函数标准化,距离表示,分类器表示,聚类方法,分布, 性能评价方法,回归方法,结构化输出学习器。

    SHOGUN 的核心由C++实现,提供 Matlab、 R、 Octave、 Python接口。主要应用在linux平台上。

    项目主页:

    http://www.shogun-toolbox.org/

  5. MDP

    The Modular toolkit for Data Processing (MDP) ,用于数据处理的模块化工具包,一个Python数据处理框架。

    从用户的观点,MDP是能够被整合到数据处理序列和更复杂的前馈网络结构的一批监督学习和非监督学习算法和其他数据处理单元。计算依照速度和内存需求而高效的执行。从科学开发者的观点,MDP是一个模块框架,它能够被容易地扩展。新算法的实现是容易且直观的。新实现的单元然后被自动地与程序库的其余部件进行整合。MDP在神经科学的理论研究背景下被编写,但是它已经被设计为在使用可训练数据处理算法的任何情况中都是有用的。其站在用户一边的简单性,各种不同的随时可用的算法,及应用单元的可重用性,使得它也是一个有用的教学工具。

    项目主页:

    http://mdp-toolkit.sourceforge.net/

    https://pypi.python.org/pypi/MDP/

  6. PyBrain

    PyBrain(Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network)是Python的一个机器学习模块,它的目标是为机器学习任务提供灵活、易应、强大的机器学习算法。(这名字很霸气)

    PyBrain正如其名,包括神经网络、强化学习(及二者结合)、无监督学习、进化算法。因为目前的许多问题需要处理连续态和行为空间,必须使用函数逼近(如神经网络)以应对高维数据。PyBrain以神经网络为核心,所有的训练方法都以神经网络为一个实例。

    项目主页:

    http://www.pybrain.org/

    https://github.com/pybrain/pybrain/

  7. BigML

    BigML 使得机器学习为数据驱动决策和预测变得容易,BigML使用容易理解的交互式操作创建优雅的预测模型。BigML使用BigML.io,捆绑Python。

    项目主页:

    https://bigml.com/

    https://pypi.python.org/pypi/bigml

    http://bigml.readthedocs.org/

  8. PyML

    PyML是一个Python机器学习工具包, 为各分类和回归方法提供灵活的架构。它主要提供特征选择、模型选择、组合分类器、分类评估等功能。

    项目主页:

    http://cmgm.stanford.edu/~asab/pyml/tutorial/

    http://pyml.sourceforge.net/

  9. Milk

    Milk是Python的一个机器学习工具箱,其重点是提供监督分类法与几种有效的分类分析:SVMs(基于libsvm),K-NN,随机森林经济和决策树。它还可以进行特征选择。这些分类可以在许多方面相结合,形成不同的分类系统。

    对于无监督学习,它提供K-means和affinity propagation聚类算法。

    项目主页:

    https://pypi.python.org/pypi/milk/

    http://luispedro.org/software/milk

  10. PyMVPA

    PyMVPA(Multivariate Pattern Analysis in Python)是为大数据集提供统计学习分析的Python工具包,它提供了一个灵活可扩展的框架。它提供的功能有分类、回归、特征选择、数据导入导出、可视化等

    项目主页:

    http://www.pymvpa.org/

    https://github.com/PyMVPA/PyMVPA

  11. Pattern

    Pattern是Python的web挖掘模块,它绑定了 Google、Twitter 、Wikipedia API,提供网络爬虫、HTML解析功能,文本分析包括浅层规则解析、WordNet接口、句法与语义分析、TF-IDF、LSA等,还提供聚类、分类和图网络可视化的功能。

    项目主页:

    http://www.clips.ua.ac.be/pages/pattern

    https://pypi.python.org/pypi/Pattern

  12. pyrallel

    Pyrallel(Parallel Data Analytics in Python)基于分布式计算模式的机器学习和半交互式的试验项目,可在小型集群上运行,适用范围:

    • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).

    • focus on small to medium data (with data locality when possible).

    • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.

    • do not focus on HA / Fault Tolerance (yet).

    • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

    项目主页:

    https://pypi.python.org/pypi/pyrallel

    http://github.com/pydata/pyrallel

  13. Monte

    Monte ( machine learning in pure Python)是一个纯Python机器学习库。它可以迅速构建神经网络、条件随机场、逻辑回归等模型,使用inline-C优化,极易使用和扩展。

    项目主页:

    https://pypi.python.org/pypi/Monte

    http://montepython.sourceforge.net

  14. Orange

    Orange 是一个基于组件的数据挖掘和机器学习软件套装,它的功能即友好,又很强大,快速而又多功能的可视化编程前端,以便浏览数据分析和可视化,基绑定了 Python以进行脚本开发。它包含了完整的一系列的组件以进行数据预处理,并提供了数据帐目,过渡,建模,模式评估和勘探的功能。其由C++ 和 Python开发,它的图形库是由跨平台的Qt框架开发。

    项目主页:

    https://pypi.python.org/pypi/Orange/

    http://orange.biolab.si/

  15. Theano

    Theano 是一个 Python 库,用来定义、优化和模拟数学表达式计算,用于高效的解决多维数组的计算问题。Theano的特点:

    • 紧密集成Numpy

    • 高效的数据密集型GPU计算

    • 高效的符号微分运算

    • 高速和稳定的优化

    • 动态生成c代码

    • 广泛的单元测试和自我验证

    自2007年以来,Theano已被广泛应用于科学运算。theano使得构建深度学习模型更加容易,可以快速实现下列模型:

    • Logistic Regression

    • Multilayer perceptron

    • Deep Convolutional Network

    • Auto Encoders, Denoising Autoencoders

    • Stacked Denoising Auto-Encoders

    • Restricted Boltzmann Machines

    • Deep Belief Networks

    • HMC Sampling

    • Contractive auto-encoders

    Theano,一位希腊美女,Croton最有权势的Milo的女儿,后来成为了毕达哥拉斯的老婆。

    项目主页:

    http://deeplearning.net/tutorial/

    https://pypi.python.org/pypi/Theano

  16. Pylearn2

    Pylearn2建立在theano上,部分依赖scikit-learn上,目前Pylearn2正处于开发中,将可以处理向量、图像、视频等数据,提供MLP、RBM、SDA等深度学习模型。Pylearn2的目标是:

    • Researchers add features as they need them. We avoid getting bogged down by too much top-down planning in advance.
    • A machine learning toolbox for easy scientific experimentation.
    • All models/algorithms published by the LISA lab should have reference implementations in Pylearn2.
    • Pylearn2 may wrap other libraries such as scikits.learn when this is practical
    • Pylearn2 differs from scikits.learn in that Pylearn2 aims to provide great flexibility and make it possible for a researcher to do almost anything, while scikits.learn aims to work as a “black box” that can produce good results even if the user does not understand the implementation
    • Dataset interface for vector, images, video, …
      Small framework for all what is needed for one normal MLP/RBM/SDA/Convolution experiments.
    • Easy reuse of sub-component of Pylearn2.
    • Using one sub-component of the library does not force you to use / learn to use all of the other sub-components if you choose not to.
    • Support cross-platform serialization of learned models.
      Remain approachable enough to be used in the classroom (IFT6266 at the University of Montreal).
      项目主页:

    http://deeplearning.net/software/pylearn2/

    https://github.com/lisa-lab/pylearn2

    还有其他的一些Python的机器学习库,如:

    pmll(https://github.com/pavlov99/pmll)

    pymining(https://github.com/bartdag/pymining)

    ease (https://github.com/edx/ease)

    textmining(http://www.christianpeccei.com/textmining/)

其他

  • Valgrind - Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.

你可能感兴趣的:(第三方库)