spark.mllib包含基于弹性数据集(RDD)的低阶Spark机器学习API。它提供的机器学习技术有:相关性、分类和回归、协同过滤、聚类和数据降维(correlation, classification and regression, collaborative filtering, clustering, and dimensionality reduction)。
spark.ml提供建立在DataFrame的机器学习API,DataFrame是Spark SQL的核心部分。这个包提供开发和管理机器学习管道(pipeline)的功能,可以用来进行特征提取、转换、选择器和机器学习算法,比如分类和回归和聚类(Feature Extractors, Transformers, Selectors, and machine learning techniques like classification and regression, and clustering)。
Machine Learning is about learning from existing data to make predictions about the future. It’s based on creating models from input data sets for data-driven decision making.
Recommendation engines use the attributes of an item or a user or the behavior of a user or their peers, to make the predictions.
Collaborative Filtering
Model based methods
Spark MLlib库提供给了几个实现的算法,比如,线性SVM、逻辑回归、决策树和贝叶斯算法。另外,一些集成模型,比如随机森林和gradient-boosting树。
Steps to implement machine learning solutions
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
data processing functions and data analytics utilities and tools available in this library
Machine learning pipelines are used for the creation, tuning, and inspection of machine learning workflow programs.
Machine Learning (ML) workflows通常涉及一系列的数据处理和学习阶段。
Machine learning data pipeline设计为一系列的stage,这些stage或者是Transformer,或者是Estimator。这些设定好的Stage,将会按顺序执行,输入数据会随着不同的stage而在pipeline中转换。
ML development frameworks需要支持分布式计算,以及支持集成管道组件的工具。其他还有容错、资源管理、可扩展和可维护。
生产中,还需要支持模型的import/export,交叉验证(cross-validation to choose prarameters),从不同数据源聚合数据。能提供特征提取、特征选择、特征数据,提供模型持久化功能,以便以后复用。
machine learning workflows以及composition of dataflow operators在其他一些系统上亦不鲜见。
典型的data value chain处理包括
Step | Name | Description |
ML1 | Data Ingestion | Loading the data from different data sources. |
ML2 | Data Cleaning | Data is pre-processed to get it ready for the machine learning data analysis |
ML3 | Feature Extraction | Also known as Feature Engineering, this step is about extracting the features from the data sets |
ML4 | Model Training | The machine learning model is trained in the next step using the training data sets |
ML5 | Model Validation | Next, the machine learning model is evaluated based on different prediction parameters, for its effectiveness. We also tune the model during the validation step. This step is used to pick the best model |
ML6 | Model Testing | The next step is to test the mode before it is deployed |
ML7 | Model deployment | Final step is to deploy the selected model to execute in production environment. |
Data Cleaning
Data wrangling tools like Trifacta, OpenRefine or ActiveClean are used for data cleaning needs.
Model Selection
Model selection is done by using data to choose parameters for Transformers and Estimators. This is a critical step in the machine learning pipeline process. Classes like ParamGridBuilder and CrossValidator provide APIs for selecting the ML model.