常用第三方开源代码库 (thirdparty/common)

在工作和学习中,借助第三方开源代码库是常见的事情,“站在巨人的肩膀上”嘛,相信大家都不会陌生,赞叹开源、共享的伟大。

一方面为了做个总结,另一方面,就是好东西要与大家分享,我在 Github 上维护了一个页面 https://github.com/fandywang/thirdparty_intro,包含了个人比较关注的第三方代码库,如下(持续更新中):

Google 开源库

  • zh-google-styleguide - Google 开源项目风格指南.
  • protobuf - Protocol Buffers - Google's data interchange format.
  • gflags - Commandline flags module for C++.
  • glog - Logging library for C++.
  • gtest - Google C++ Testing Framework.
  • googlemock - Google C++ Mocking Framework.
  • leveldb - A fast and lightweight key/value database library by Google. cpy-leveldb - Python bindings for LevelDB using leveldb c api.
  • The Chromium Projects - The Chromium projects include Chromium and Chromium OS, the open-source projects behind the Google Chrome browser and Google Chrome OS, respectively.

C++ base 库

  • toft - C++ Base Library for Linux server side development. thirdparty - Put thirdparty library here for toft ant foxy. chen3feng
  • folly - Folly is an open-source C++ library developed and used at Facebook.

算法和数据结构

  • darts-clone - A clone of the Darts (Double-ARray Trie System).
  • Darts - Double-ARray Trie System. 中文翻译文档
  • sparsehash - An extremely memory-efficient hash_map implementation。
  • cityhash - The CityHash family of hash functions.
  • stringencoders - A collection of high performance c-string transformations, frequently 2x faster than standard implementations (if they exist at all).
  • Numpy - NumPy is the fundamental package for scientific computing with Python.

自然语言处理库

  • NLTK - NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing. NLTK Book
  • jieba - 结巴中文分词.
  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • LTP - 语言技术平台(Language Technology Platform,LTP)是哈工大社会计算与信息检索研究中心历时十年研制的一整套开放中文自然语言处理系统。
  • Stanford CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities.
  • openNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
  • SRILM - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
  • IRSTLM - The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs.
  • KenLM - KenLM estimates unpruned language models with modified Kneser-Ney smoothing.
  • Moses - Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair.
  • GIZA++ - GIZA++ is a statical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model.
  • genius - genius中文分词,是基于crf条件随机场的分组件.
  • sego - Go中文分词.
  • pinyin - Go语言汉字转拼音工具.
  • ReVerb - ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.
  • Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources - 斯坦福自然语言组的NLP及计算语言学的资料汇总:包括各种工具,代码,语料库,字典,课程的链接及简单介绍。http://t.cn/zOfVAzs

信息检索库

  • Lemur - The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
  • Lucene - The Apache Lucene project develops open-source search software.
  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • wukong - 悟空全文搜索引擎.
  • Scrapy - a fast high-level screen scraping and web crawling framework for Python.
  • distribute_crawler - 使用scrapy,redis, mongodb,graphite实现的一个分布式网络爬虫,底层存储mongodb集群,分布式使用redis实现, 爬虫状态显示使用graphite实现。

机器学习库

  • LASSO - LASSO is a parallel machine learning system that learns a regression model from large data. It works in either of two modes: IPM-mode and MPI-mode.
  • libsvm - A Library for Support Vector Machines. 支持向量机通俗导论(理解SVM的三层境界) 来自研究者July. 在本文中,你将看到,理解SVM分三层境界, 第一层: 了解SVM(你只需要对SVM有个大致的了解,知道它是个什么东西便已足够); 第二层: 深入SVM(你将跟我一起深入SVM的内部原理,通晓其各处脉络,以为将来运用它时游刃有余); 第三层: 证明SVM(当你了解了所有的原理之后,你会有大笔一挥,尝试证明它的冲动)。
  • liblinear - A Library for Large Linear Classification.
  • RankLib - RankLib is a library of learning to rank algorithms.
  • svmlight - SVMlight is an implementation of Support Vector Machines (SVMs) in C.
  • plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation
  • GibbsLDA++ - A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
  • Yahoo_LDA - Yahoo!'s topic modelling framework using Latent Dirichlet Allocation
  • word2vec - Tool for computing continuous distributed representations of words. Parallelizing word2vec in Python
  • Maximum Entropy Modeling Toolkit for Python and C++ - This package provides a (Conditional) Maximum Entropy Modeling Toolkit for Python and C++.
  • maxent - A simple C++ library for maximum entropy classification.
  • easyME - This is a simple implementation of Maximum Entropy model. Algorithms implemented include: GIS, SCGIS, LBFGS, Gaussian smoothing and Exponential smoothing.
  • libLBFGS - This library is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal.
  • OWL-QN - The Orthant-Wise Limited-memory Quasi-Newton algorithm (OWL-QN) is a numerical optimization procedure for finding the optimum of an objective of the form {smooth function} plus {L1-norm of the parameters}. It has been used for training log-linear models (such as logistic regression) with L1-regularization.
  • CRF++ - CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
  • CRFsuite - A fast implementation of Conditional Random Fields (CRFs).
  • Wapiti - Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models.
  • sofia-ml - Suite of Fast Incremental Algorithms for Machine Learning. Includes methods for learning classification and ranking models, using Pegasos SVM, SGD-SVM, ROMMA, Passive-Aggressive Perceptron, Perceptron with Margins, and Logistic Regression.
  • mahout - The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.
  • MLTK - MLTK -- the Machine Learning Toolkit -- is a suite of C++ open source modules of Machine Learning.
  • FP-growth - An implementation of the FP-growth algorithm in pure Python.
  • MLcomp - MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.

数据交换协议

  • protobuf - Protocol Buffers - Google's data interchange format.
  • jsoncpp - JSON data format manipulation library.
  • tinyxml2 - TinyXML-2 is a simple, small, efficient, C++ XML parser that can be easily integrating into other programs.
  • thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

数据库

  • MySQL++ - MySQL++ is a C++ wrapper for MySQL’s C API.
  • MongodDB - MongoDB (from "humongous") is an open-source document database, and the leading NoSQL database. Written in C++.
  • memcached - Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
  • leveldb - A fast and lightweight key/value database library by Google.
  • SSDB - A fast NoSQL database server with zset data type, an alternative to Redis. SSDB is a high performace key-value(key-string, key-zset, key-hashmap) NoSQL persistent storage server, using Google LevelDB as storage engine. SSDB is stable, production-ready and is widely used by many Internet companies such as QIHU 360.
  • RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads. RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
  • fatcache - Memcache on SSD. Think of fatcache as a cache for your big data.
  • THUIRDB - THUIRDB是一个C++语言实现的基础库,用于在单机上实现高性能key-value持久化存储和高速查询。THUIRDB Paper

网络编程

  • thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • server1 - a c++ network server/client framework.
  • muduo-protorpc - Google Prorobuf RPC based on Muduo.

Web 开发

  • Flask - Flask is a microframework for Python based on Werkzeug and Jinja2. It's intended for getting started very quickly and was developed with best intentions in mind. 中文docs
  • Bootstrap - Sleek, intuitive, and powerful front-end framework for faster and easier web development.
  • Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.

分布式计算

  • Hadoop - The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • ZooKeeper - ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • Storm - Distributed and fault-tolerant realtime computation. Storm 维基 - 提供了有关 Storm、它的理论基础的大量优秀文档,以及有关获取 Storm 和设置新项目的各种教程。您还将找到一些有关 Storm 的许多方面的实用文档,包括 Storm 在本地模式、集群模式和在 Amazon 上的使用。 GitHub 上提供了 Storm 的一个 thorough class tree exists,详细介绍了 Storm 的类和接口。 使用 Twitter Storm 处理实时的大数据 - 流式处理大数据简介 简介: Storm 是一个开源的、大数据处理系统,与其他系统不同,它旨在用于分布式实时处理且与语言无关。了解 Twitter Storm、它的架构,以及批处理和流式处理解决方案的发展形势。Storm 入门教程 - 来自量子恒道官方博客 storm-starter - Learn to use Storm!
  • Spark - Lightning-Fast Cluster Computing.
  • Puppet - Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to orchestration and reporting. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.
  • Skynet - Skynet is a framework for distributed services in Go.
  • Kafka - 分布式消息队列系统,A high-throughput distributed messaging system. Kafka paper: Building LinkedIn’s Real-time Activity Data Pipeline Kafka Clients
    librdkafka kafka-python
  • Celery --- Distributed Task Queue - 这个框架几乎是 Python 下异步消息架构的终极解决方案.
  • mapreduce-lite - A C++ implementaton of MapReduce without distributed filesystem.
  • GraphChi - GraphChi[huahua] is a spin-off of the GraphLab[rador's retriever] project. GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in similar vertex-centric model as GraphLab. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and changing the graph structure while computing. GraphChi ppt. GraphChi Paper. GraphChi Video. GraphChi's C++ version. -disk-based large-scale graph computation. Big Data - small machine.
  • Giraph - Large-scale graph processing on Hadoop.
  • Celery --- Distributed Task Queue - Celery is a simple, flexible and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system. It’s a task queue with focus on real-time processing, while also supporting task scheduling. 这个框架几乎是 Python 下异步消息架构的终极解决方案.

正则表达式

  • re2 - an efficient, principled regular expression library.

编译工具

  • SCons - SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.
  • CMake - the cross-platform, open-source build system.
  • blade - Blade is designed to be a modernize building system. Mac OS X port of Typhoon Blade
  • bobo - Bobo is an easy to use building tool inspired by blade.

Code Review

  • rietveld - Code Review, hosted on Google App Engine.
  • Review Board - Take the pain out of code review.

vim

  • spf13-vim - spf13-vim is a distribution of vim plugins and resources for Vim, GVim and MacVim. It is a completely cross platform distribution that stays true to the feel of vim while providing modern features like a plugin management system, autocomplete, tags and tons more.
  • Maximum Awesome - Config files for vim and tmux, lovingly tended by a small subculture of peace-loving hippies. Built for Mac OS X.
  • VimClojure - A filetype, syntax and indent plugin for Clojure.

Go 学习

  • glog - Leveled execution logs for Go.
  • groupcache - groupcache is a caching and cache-filling library, intended as a replacement for memcached in many cases.
  • go-slab - A slab allocator library in the Go Programming Language.
  • Go语言资料收集 -

Python 学习

  • pycrumbs - Bits and Bytes of Python from the Internet.

自动化部署引擎

  • Docker - Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more. Docker 是一个开源自动化部署引擎,它可以将任何应用封装成一个简单、便携、不依赖于其他组件的容器,从而轻松地将其部署在各种虚拟环境中,以便进行各种调试。它既保证了应用的私有性,同时缩短了调试部署的周期,使得测试-封装-部署变得更加容易和便捷。不过现在Docker还在加紧开发中,相信等它开发完毕后,它会给人们的开发带来前所未有的便捷。

其他

  • Valgrind - Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.

P.S.: 俗话说,“知其然,更要知其所以然”,所以,使用开源库的同时,最好能了解其原理,可以通过阅读源代码和文档获取,当然,如果能够自己实现下最好了。

你可能感兴趣的:(常用第三方开源代码库 (thirdparty/common))