Spark MLlib(1):MLlib is Apache Spark's scalable machine learning library.

Ease of Use

Usable in Java, Scala, Python, and R.

MLlib fits into Spark's APIs and interoperates with NumPy in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.

MLlib适用于Spark的API,并与Python中的NumPy(从Spark 0.9开始)和R库(从Spark 1.5开始)互操作。 您可以使用任何Hadoop数据源(例如HDFS,HBase或本地文件),从而轻松插入Hadoop工作流程。

Performance

High-quality algorithms, 100x faster than MapReduce.

Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.

Spark擅长迭代计算,使MLlib能够快速运行。 同时,我们关注算法性能:MLlib包含利用迭代的高质量算法,并且可以产生比MapReduce上有时使用的一次通过近似更好的结果。

Runs Everywhere

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources.

Spark运行在Hadoop,Apache Mesos,Kubernetes,独立或云端,针对不同的数据源。

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

您可以使用其独立群集模式,EC2,Hadoop YARN,Mesos或Kubernetes运行Spark。 访问HDFS,Apache Cassandra,Apache HBase,Apache Hive和数百个其他数据源中的数据。

Algorithms

MLlib contains many algorithms and utilities.

ML algorithms include:

  • Classification: logistic regression, naive Bayes,...
  • Regression: generalized linear regression, survival regression,...
  • Decision trees, random forests, and gradient-boosted trees
  • Recommendation: alternating least squares (ALS)
  • Clustering: K-means, Gaussian mixtures (GMMs),...
  • Topic modeling: latent Dirichlet allocation (LDA)
  • Frequent itemsets, association rules, and sequential pattern mining

ML workflow utilities include:

  • Feature transformations: standardization, normalization, hashing,...
  • ML Pipeline construction
  • Model evaluation and hyper-parameter tuning
  • ML persistence: saving and loading models and Pipelines

Other utilities include:

  • Distributed linear algebra: SVD, PCA,...
  • Statistics: summary statistics, hypothesis testing,...

算法
MLlib包含许多算法和实用程序。
ML算法包括:
分类:逻辑回归,朴素贝叶斯,......
回归:广义线性回归,生存回归,......
决策树,随机森林和梯度提升树
建议:交替最小二乘法(ALS)
聚类:K均值,高斯混合(GMM),......
主题建模:潜在Dirichlet分配(LDA)
频繁项目集,关联规则和顺序模式挖掘
ML工作流程工具包括:
特征转换:标准化,规范化,散列,......
ML管道施工
模型评估和超参数调整
ML持久性:保存和加载模型和管道
其他工具包括:
分布式线性代数:SVD,PCA,......
统计:汇总统计,假设检验,......

Community

MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.

MLlib是Apache Spark项目的一部分。 因此,每个Spark版本都会对其进行测试和更新。

If you have questions about the library, ask on the Spark mailing lists.

如果您对库有疑问,请在Spark邮件列表中查询。

MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an algorithm to MLlib, read how to contribute to Spark and send us a patch!

MLlib仍然是一个快速发展的项目,欢迎贡献。 如果您想向MLlib提交算法,请阅读如何为Spark做出贡献并向我们发送补丁!

Getting Started

To get started with MLlib:

  • Download Spark. MLlib is included as a module.
  • Read the MLlib guide, which includes various usage examples.
  • Learn how to deploy Spark on a cluster if you'd like to run in distributed mode. You can also run locally on a multicore machine without any setup.
  • 要开始使用MLlib:
    下载Spark。 MLlib作为模块包含在内。
    阅读MLlib指南,其中包括各种用法示例。
    如果您想在分布式模式下运行,请了解如何在群集上部署Spark。 您也可以在多核计算机上本地运行而无需任何设置。

你可能感兴趣的:(Spark,MLlib)