Spark机器学习(一) -- Machine Learning Library (MLlib)

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

MLlib是Spark机器学习库。它的目标是构造实用的、可扩展的、简单的机器学习。它的通用组成部分分为学习算法和工具包,包括:分类、回归、聚集、协同过滤、降维,也提供了lower-level级别的原型优化和higher-level级别的pipeline API。

It divides into two packages:

  • spark.mllib contains the original API built on top of RDDs.

  • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

它分为两个包:

  • spark.mllib :包括构建在 RDDs之上的原型API。

  • spark.ml :提供构建在 DataFrames 上的 higher-level API ,而DataFrames 是为了构造ML管道的。

Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.

推荐使用 spark.ml ,因为基于DataFrames的API 更加通用和灵活。但是我们将继续支持spark.mllib 和spark.ml一起发展。用户可以舒畅的使用spark.mllib特性,并且期望更多特色的到来。开发人员安装了可以贡献新的算法给spark.ml,当然这些算法应与ML pipeline概念相适应。

e.g:extractors(提取器) 和 transformers(转换器)

We list major functionality from both below, with links to detailed guides.

我们在下面列出了主要的功能,通过连接进入详细指南。

spark.mllib: data types, algorithms, utilities

  • Data types

  • Basic statistics

    • summary statistics

    • correlations

    • stratified sampling

    • hypothesis testing

    • streaming significance testing

    • random data generation

  • Classification and regression

    • linear models (SVMs, logistic regression, linear regression)

    • naive Bayes

    • decision trees

    • ensembles of trees (Random Forests and Gradient-Boosted Trees)

    • isotonic regression

  • Collaborative filtering

    • alternating least squares (ALS)

  • Clustering

    • k-means

    • Gaussian mixture

    • power iteration clustering (PIC)

    • latent Dirichlet allocation (LDA)

    • bisecting k-means

    • streaming k-means

  • Dimensionality reduction

    • singular value decomposition (SVD)

    • principal component analysis (PCA)

  • Feature extraction and transformation

  • Frequent pattern mining

    • FP-growth

    • association rules

    • PrefixSpan

  • Evaluation metrics

  • PMML model export

  • Optimization (developer)

    • stochastic gradient descent

    • limited-memory BFGS (L-BFGS)



你可能感兴趣的:(spark,大数据,机器学习,MLlib,MLlib)