还记得我们的第一篇 Spark 博文里的这张图吗?其实我觉得 spark 有两个层次的概念:
我之所以一开始很看好 spark,其中很大原因是第二条,也是下面这张图表达的意思:Spark, more than a framework.
之后准备写一些这些库的博文,我用过的只有 sql,dataframe,streaming,mlib,所以可能不会写 sparkr,bagel,graphx 的东西,不过这些网上也有不少优质文章的!
今天再回顾回顾 mahine learning in Spark,进行一个小小的总结,之后再慢慢把一些实用的例子写下来。
spark 里的机器学习目前[1.6.0]有两个库:
spark.mllib ,是原始的基于 rdd api 设计和实现的机器学习库
spark.ml,是基于 dataframe api 设计和实现的机器学习库
官方推荐使用 spark.ml 来实现你的机器学习算法,并且已经说明 dataframe, datasets API 将会替换 rdd,所以自然 ml 也会在不远的将来替换 mllib 了。事实上,在 spark 2.0 中,就已经开始尝试先把 dataframe 和 datasets 统一成一个 datasets api 了。参考:slide Spark 2 0
从官方开发计划来说,我肯定是强烈推荐使用 spark.ml 库的,而且官方也说明了的,在 spark 2.x 之后,mllib 基本只做维护,其他新的 features 都是基于 ml 来开发了;不过除去这个方面,spark.ml 提供基于 dataframe/datasets 的 pipeline 执行流程,还有基于 spark sql 的高效的 code generation,所以不论从哪个方面来说,我觉得使用 dataframe/datasets,使用 spark.ml 都是一个比较明智的选择。
不过目前 ml 有一个不足的地方,ml 支持的算法目前没有 mllib 的多。
Pipeline Components:
Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts. Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. That’s runing a sequence of algorithms to process and learn from data.
In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
Spark ML represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.
We illustrate this for the simple text document workflow. The figure below is for the training time usage of a Pipeline.
Above, the top row represents a Pipeline with three stages. The first two (Tokenizer and HashingTF) are Transformers (blue), and the third (LogisticRegression) is an Estimator (red).
The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames. The Pipeline.fit() method is called on the original DataFrame, which has raw text documents and labels. The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame. The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer. This PipelineModel is used at test time; the figure below illustrates this usage.
In the figure above, the PipelineModel has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline have become Transformers. When the PipelineModel’s transform() method is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage.
Pipelines and PipelineModels help to ensure that training and test data go through identical feature processing steps.