python记得还是刚开始工作时接触的呢,现在随着工作的变动,有的项目也不用python,也就渐渐有点忘记了。但是python真的很强大,各种爬虫、网页处理、机器学习、科学计算等的工具包应有尽有。鉴于本人也是搞机器学学方向的,还有很多志同道合的朋友,整理一下方面自己和各位朋友翻阅。
一、Python网页爬虫工具集
Python提供了如下一些很不错的网页爬虫工具框架,既能爬取数据,也能获取和清洗数据:
1. Scrapy
Scrapy, a fast high-level screen scraping and web crawling framework for Python.
推荐大牛pluskid早年的一篇文章:《Scrapy 轻松定制网络爬虫》
官方主页:http://scrapy.org/Github代码页:https://github.com/scrapy/scrapy
2. Beautiful Soup
推荐书籍:《集体智慧编程》,这本书有讲解Beautiful Soup
客观的说,Beautifu Soup不完全是一套爬虫工具,需要配合urllib使用,而是一套HTML/XML数据分析,清洗和获取工具。
官方主页:http://www.crummy.com/software/BeautifulSoup/
3. Python-Goose:
Html Content / Article Extractor, web scrapping lib in Python
Goose最早是用Java写得,后来用Scala重写,是一个Scala项目。Python-Goose用Python重写,依赖了Beautiful Soup。给定一个文章的URL, 获取文章的标题和内容很方便。
Github主页:https://github.com/grangier/python-goose
二、Python文本处理工具集
文本处理一般包括词性标注,句法分析,关键词提取,文本分类,情感分析等等,这是针对中文的,如果是对于英文来说,只需要基本的tokenize。这方面python提供了一下一些工具 包。
1. NLTK (Natural Language Toolkit)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.
推荐:官方的《Natural Language Processing with Python》,以介绍NLTK里的功能用法为主,同时附带一些Python知识,《用Python进行自然语言处理》是中文翻译-NLTK配套书;
《Python Text Processing with NLTK 2.0 Cookbook》,这本书要深入一些,会涉及到NLTK的代码结构,同时会介绍如何定制自己的语料和模型等,相当不错。
官方主页:http://www.nltk.org/
Github代码页:https://github.com/nltk/nltk
2. Pattern
Pattern is a web mining module for the Python programming language.It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.
官方主页:http://www.clips.ua.ac.be/pattern
3. TextBlob
Simplified Text Processing,TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
官方主页:http://textblob.readthedocs.org/en/dev/
Github代码页:https://github.com/sloria/textblob
4. MBSP for Python
MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.
官方主页:http://www.clips.ua.ac.be/pages/MBSP
5. Gensim:
Topic modeling for humans
官方主页:http://radimrehurek.com/gensim/index.html
github代码页:https://github.com/piskvorky/gensim
6. langid.py:
Stand-alone language identification system
Github主页:https://github.com/saffsd/langid.py
7. Jieba
“结巴”中文分词:做最好的Python中文分词组件 “Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.
一个国内的Python文本处理工具包了:结巴分词,其功能包括支持三种分词模式(精确模式、全模式、搜索引擎模式),支持繁体分词,支持自定义词典等,是目前一个非常不错的Python中文分词解决方案。
Github主页:https://github.com/fxsjy/jieba
8. xTAS
xtas, the eXtensible Text Analysis Suite, a distributed text analysis package based on Celery and Elasticsearch.
Github代码页:https://github.com/NLeSC/xtas
三、Python科学计算工具包
推荐一个系列《用Python做科学计算》,将会涉及到NumPy, SciPy, Matplotlib,可以做参考。
1. NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
1)a powerful N-dimensional array object
2)sophisticated (broadcasting) functions
3)tools for integrating C/C++ and Fortran code
4)useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
官方主页:http://www.numpy.org/
2. SciPy
Scientific Computing Tools for Python,SciPy refers to several related but distinct entities:
1)The SciPy Stack, a collection of open source software for scientific computing in Python, and particularly a specified set of core packages.
2)The community of people who use and develop this stack.
3)Several conferences dedicated to scientific computing in Python – SciPy, EuroSciPy and SciPy.in.
4)The SciPy library, one component of the SciPy stack, providing many numerical routines.
官方主页:http://www.scipy.org/
3. Matplotlib
matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.
官方主页:http://matplotlib.org/
4. iPython
IPython provides a rich architecture for interactive computing with:
1)Powerful interactive shells (terminal and Qt-based).
2)A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
3)Support for interactive data visualization and use of GUI toolkits.
4)Flexible, embeddable interpreters to load into your own projects.
5)Easy to use, high performance tools for parallel computing
官方主页:http://ipython.org/
四、Python 机器学习 & 数据挖掘 工具包
1. scikit-learn: Machine Learning in Python
scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
官方主页:http://scikit-learn.org/
2. Pandas: Python Data Analysis Library
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
官方主页:http://pandas.pydata.org/
3. mlpy – Machine Learning Python
mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries.
mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.
官方主页:http://mlpy.sourceforge.net/
4. MDP:The Modular toolkit for Data Processing
Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.
官方主页:http://mdp-toolkit.sourceforge.net/
5. PyBrain
PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.
PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.
官方主页:http://www.pybrain.org/
6. PyML – machine learning in Python
PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods. It is supported on Linux and Mac OS X.
项目主页:http://pyml.sourceforge.net/
7. Milk
Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These
classifiers can be combined in many ways to form different classification systems.
官方主页:http://luispedro.org/software/milk
8. PyMVPA: MultiVariate Pattern Analysis (MVPA) in Python
PyMVPA is a Python package intended to ease statistical learning analyses of large datasets. It offers an extensible framework with a high-level interface to a broad range of algorithms for classification, regression, feature selection, data import and export. It is designed to integrate well with related software packages, such as scikit-learn, and MDP. While it is not limited to the neuroimaging domain, it is eminently suited for such datasets. PyMVPA is free software and requires nothing but free-software to run.
官方主页:http://www.pymvpa.org/
9. Pyrallel – Parallel Data Analytics in Python
Experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.
Github代码页:http://github.com/pydata/pyrallel
10. Monte – gradient based learning in Python
Monte (python) is a Python framework for building gradient based learning machines, like neural networks, conditional random fields, logistic regression, etc. Monte contains modules (that hold parameters, a cost-function and a gradient-function) and trainers (that can adapt a module’s parameters by minimizing its cost-function on training data).
Modules are usually composed of other modules, which can in turn contain other modules, etc. Gradients of decomposable systems like these can be computed with back-propagation.
官方主页:http://montepython.sourceforge.net
11. Theano
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
1)tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
2)transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
3)efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
4)speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
5)dynamic C code generation – Evaluate expressions faster.
6) extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
Theano has been powering large-scale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (IFT6266 at the University of Montreal).
12. Pylearn2
Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).
参考:
我爱自然语言处理:http://www.52nlp.cn/python-网页爬虫-文本处理-科学计算-机器学习-数据挖掘