Spark MLLib简介

  • Spark MLLib ML
  • Machine Learning Data Science
  • Steps in a Machine Learning Program
  • Recommandation Engine
  • Fraud Detection
  • Spark MLlib
  • Spark ML Data Pipelines
    • ML Pipeline Components

Spark MLLib & ML

Spark机器学习API包含两个package:

  • spark.mllib
  • spark.ml

官方网站更清除了喇

spark.mllib包含基于弹性数据集(RDD)的低阶Spark机器学习API。它提供的机器学习技术有:相关性、分类和回归、协同过滤、聚类和数据降维(correlation, classification and regression, collaborative filtering, clustering, and dimensionality reduction)。

spark.ml提供建立在DataFrame的机器学习API,DataFrame是Spark SQL的核心部分。这个包提供开发和管理机器学习管道(pipeline)的功能,可以用来进行特征提取、转换、选择器和机器学习算法,比如分类和回归和聚类(Feature Extractors, Transformers, Selectors, and machine learning techniques like classification and regression, and clustering)。

Machine Learning & Data Science

Machine Learning is about learning from existing data to make predictions about the future. It’s based on creating models from input data sets for data-driven decision making.

下面简单的了解下各机器学习模型,并进行比较:

  • 监督学习模型:监督学习模型对已标记的训练数据集训练出结果,然后对未标记的数据集进行预测;监督学习又包含两个子模型:回归模型和分类模型
  • 无监督学习模型:无监督学习模型是用来从原始数据(无训练数据)中找到隐藏的模式或者关系,因而无监督学习模型是基于无标记数据集;k-means,PCA
  • 半监督学习模型:半监督学习模型用在监督和非监督机器学习中做预测分析,其既有标记数据又有未标记数据。典型的场景是混合少量标记数据和大量未标记数据。半监督学习一般使用分类和回归的机器学习方法;
  • 增强学习模型:增强学习模型通过不同的行为来寻找目标回报函数最大化。

Steps in a Machine Learning Program

数据预处理、清洗和分析的工作是非常重要的,与解决业务问题的实际的学习模型和算法一样重要。

典型的机器学习解决方案的一般步骤:

  • 特征工程(Featurization)
  • 模型训练(Training)
  • 模型评估(Model Evaluation)

Recommandation Engine

Recommendation engines use the attributes of an item or a user or the behavior of a user or their peers, to make the predictions.

因素如

  • Peer based
  • Customer behavior
  • Corporate deals or offers
  • Item clutering
  • Market/Store factors

两大类+一模型

  • Content-based filtering
    • How similar an item is to other items based on usage and ratings
  • Collaborative Filtering

    • based on making predictions to find a specific item or user based on similarity with other items or users
    • The assumption is users who display similar profile or behavior have similar preferences for items.
  • Model based methods

    • often incorporate methods from collaborative and content-based filtering
    • Deep Learning

Fraud Detection

异常监测是机器学习中另外一个应用非常广泛的技术,因为其可以快速和准确地解决金融行业的棘手问题。

金融服务业需要在几百毫秒内判断出一笔在线交易是否非法。

神经网络技术被用来进行销售点的异常监测。比如像PayPal等公司使用不同的机器学习算法(比如,线性回归,神经网络和深度学习)来进行风险管理。

Spark MLlib库提供给了几个实现的算法,比如,线性SVM、逻辑回归、决策树和贝叶斯算法。另外,一些集成模型,比如随机森林和gradient-boosting树。

Steps to implement machine learning solutions

  • Select the programming language
  • Select the appropriate algorithm or algorithms
  • Select problem
  • Research the algorithms
  • Unit test all the functions in the ML solution

Spark MLlib

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

data processing functions and data analytics utilities and tools available in this library

  • Frequent itemset mining via FP-growth and association rules(FP-growth算法执行频繁项集挖掘和关联分析)
  • Sequential pattern mining via PrefixSpan(PrefixSpan算法执行序列模式挖掘)
  • Summary statistics and hypothesis testing(提供概括性统计和假设检验)
  • Feature transformations(特征转换)
  • Model evaluation and hyper-parameter tuning(模型评估和超参数调优)

Spark ML Data Pipelines

Machine learning pipelines are used for the creation, tuning, and inspection of machine learning workflow programs.

  • ML pipelines help us focus more on the big data requirements and machine learning tasks in our projects instead of spending time and effort on the infrastructure and distributed computing areas.
  • help us with the exploratory stages of machine learning problems where we need to develop iterations of features and model combinations.

Machine Learning (ML) workflows通常涉及一系列的数据处理和学习阶段。

Machine learning data pipeline设计为一系列的stage,这些stage或者是Transformer,或者是Estimator。这些设定好的Stage,将会按顺序执行,输入数据会随着不同的stage而在pipeline中转换。

ML development frameworks需要支持分布式计算,以及支持集成管道组件的工具。其他还有容错、资源管理、可扩展和可维护。

生产中,还需要支持模型的import/export,交叉验证(cross-validation to choose prarameters),从不同数据源聚合数据。能提供特征提取、特征选择、特征数据,提供模型持久化功能,以便以后复用。

machine learning workflows以及composition of dataflow operators在其他一些系统上亦不鲜见。

scikit-learn,GraphLab都使用的pipeline来构件机器学习系统。

典型的data value chain处理包括

  • Discover
  • Ingest
  • Process
  • Persist
  • Integrate
  • Analyze
  • Expose

机器学习管道类似

Step Name Description
ML1 Data Ingestion Loading the data from different data sources.
ML2 Data Cleaning Data is pre-processed to get it ready for the machine learning data analysis
ML3 Feature Extraction Also known as Feature Engineering, this step is about extracting the features from the data sets
ML4 Model Training The machine learning model is trained in the next step using the training data sets
ML5 Model Validation Next, the machine learning model is evaluated based on different prediction parameters, for its effectiveness. We also tune the model during the validation step. This step is used to pick the best model
ML6 Model Testing The next step is to test the mode before it is deployed
ML7 Model deployment Final step is to deploy the selected model to execute in production environment.

Data Cleaning

使得数据结构化,使之能在机器学习模型上应用。

  • 数据缺失
  • 数据错误
  • 数据无关
  • 数据倾斜
  • 数据稀松(Sparse or coarse-grained data)

Data wrangling tools like Trifacta, OpenRefine or ActiveClean are used for data cleaning needs.

Model Selection

Model selection is done by using data to choose parameters for Transformers and Estimators. This is a critical step in the machine learning pipeline process. Classes like ParamGridBuilder and CrossValidator provide APIs for selecting the ML model.

ML Pipeline Components

7组件

  • 1.Datasets
    • 由DataFrame充当,按照命名列的形式组织的数据集合,可存text,feature vector, true label, predictions
  • 2.Pipelines
    • ML workflows抽象成为管道,包含一系列Stage
  • 2.1Pipeline Stages
    • 两种类型,Transformer,Estimator
  • 2.1.1Transformers
    • 将一个DataFrame转换成另一个DataFrame的算法
    • 每个Transformer都有一个transformation()方法,负责转换
  • 2.1.2Estimators
    • 从提供的数据中学习的机器学习算法
    • 输入为DataFrame,输出为Transformer
    • 负责模型训练
    • 例如,LogisticRegression是一个estimator,会生成LogisticRegressionModel这个transformer
  • 3.Evaluators
  • 4.Params (and ParamMaps)
    • Machine learning components use a common API for specifying parameters. An example of a parameter is the maximum number of iterations that the model should use.

你可能感兴趣的:(Spark)