Spark机器学习API包含两个package:
官方网站更清除了喇
spark.mllib包含基于弹性数据集(RDD)的低阶Spark机器学习API。它提供的机器学习技术有:相关性、分类和回归、协同过滤、聚类和数据降维(correlation, classification and regression, collaborative filtering, clustering, and dimensionality reduction)。
spark.ml提供建立在DataFrame的机器学习API,DataFrame是Spark SQL的核心部分。这个包提供开发和管理机器学习管道(pipeline)的功能,可以用来进行特征提取、转换、选择器和机器学习算法,比如分类和回归和聚类(Feature Extractors, Transformers, Selectors, and machine learning techniques like classification and regression, and clustering)。
Machine Learning is about learning from existing data to make predictions about the future. It’s based on creating models from input data sets for data-driven decision making.
下面简单的了解下各机器学习模型,并进行比较:
数据预处理、清洗和分析的工作是非常重要的,与解决业务问题的实际的学习模型和算法一样重要。
典型的机器学习解决方案的一般步骤:
Recommendation engines use the attributes of an item or a user or the behavior of a user or their peers, to make the predictions.
因素如
两大类+一模型
Collaborative Filtering
Model based methods
异常监测是机器学习中另外一个应用非常广泛的技术,因为其可以快速和准确地解决金融行业的棘手问题。
金融服务业需要在几百毫秒内判断出一笔在线交易是否非法。
神经网络技术被用来进行销售点的异常监测。比如像PayPal等公司使用不同的机器学习算法(比如,线性回归,神经网络和深度学习)来进行风险管理。
Spark MLlib库提供给了几个实现的算法,比如,线性SVM、逻辑回归、决策树和贝叶斯算法。另外,一些集成模型,比如随机森林和gradient-boosting树。
Steps to implement machine learning solutions
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
data processing functions and data analytics utilities and tools available in this library
Machine learning pipelines are used for the creation, tuning, and inspection of machine learning workflow programs.
Machine Learning (ML) workflows通常涉及一系列的数据处理和学习阶段。
Machine learning data pipeline设计为一系列的stage,这些stage或者是Transformer,或者是Estimator。这些设定好的Stage,将会按顺序执行,输入数据会随着不同的stage而在pipeline中转换。
ML development frameworks需要支持分布式计算,以及支持集成管道组件的工具。其他还有容错、资源管理、可扩展和可维护。
生产中,还需要支持模型的import/export,交叉验证(cross-validation to choose prarameters),从不同数据源聚合数据。能提供特征提取、特征选择、特征数据,提供模型持久化功能,以便以后复用。
machine learning workflows以及composition of dataflow operators在其他一些系统上亦不鲜见。
scikit-learn,GraphLab都使用的pipeline来构件机器学习系统。
典型的data value chain处理包括
机器学习管道类似
Step | Name | Description |
---|---|---|
ML1 | Data Ingestion | Loading the data from different data sources. |
ML2 | Data Cleaning | Data is pre-processed to get it ready for the machine learning data analysis |
ML3 | Feature Extraction | Also known as Feature Engineering, this step is about extracting the features from the data sets |
ML4 | Model Training | The machine learning model is trained in the next step using the training data sets |
ML5 | Model Validation | Next, the machine learning model is evaluated based on different prediction parameters, for its effectiveness. We also tune the model during the validation step. This step is used to pick the best model |
ML6 | Model Testing | The next step is to test the mode before it is deployed |
ML7 | Model deployment | Final step is to deploy the selected model to execute in production environment. |
Data Cleaning
使得数据结构化,使之能在机器学习模型上应用。
Data wrangling tools like Trifacta, OpenRefine or ActiveClean are used for data cleaning needs.
Model Selection
Model selection is done by using data to choose parameters for Transformers and Estimators. This is a critical step in the machine learning pipeline process. Classes like ParamGridBuilder and CrossValidator provide APIs for selecting the ML model.
7组件