Spark机器学习(三) Labeled point-- Data Types

Labeled point

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....

labeled point 是一个局部向量,要么是密集型的要么是稀疏型的,用一个label/response进行关联。在MLlib里,labeled points 被用来监督学习算法。我们使用一个double数来存储一个label,因此我们能够使用labeled points进行回归和分类。在二进制分类里,一个label可以是 0(负数)或者 1(正数)。在多级分类中,labels可以是class的索引,从0开始:0,1,2,......

Scala

A labeled point is represented by the case class LabeledPoint.

一个labeled point 通过case class LabeledPoint 被响应

Refer to the LabeledPoint Scala docs for details on the API.

详细信息请参见LabeledPoint Scala docs API

import org.apache.spark.mllib.linalg.Vectors 
import org.apache.spark.mllib.regression.LabeledPoint 
// Create a labeled point with a positive label and a dense feature vector.
// 使用一个正的label和具有密集特性的向量来创建一个labeled point  
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) 
// Create a labeled point with a negative label and a sparse feature vector. 
// 用一个负的label和一个稀疏型向量来定义一个labeled point。
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Sparse data

It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:

在实际应用中使用稀疏型训练数据非常常见。MLlib支持读取以LIBSVM格式存储的训练样例,默认的格式是使用 LIBSVM 和 LIBLINEAR 。 它是一种文本格式,使用下面的格式存储,每行表示一个labeled稀疏型向量:

label index1:value1 index2:value2 ...

where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.

索引从1开始one-based并且升序排序。加载以后,索引的特性被转换为zero-based。

Scala

MLUtils.loadLibSVMFile reads training examples stored in LIBSVM format.

MLUtils.loadLibSVMFile 读取存储为LIBSVM格式的训练样例。 

Refer to the MLUtils Scala docs for details on the API.

详细信息请参见MLUtils Scala docs API。

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.util.MLUtils 
import org.apache.spark.rdd.RDD 
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")


你可能感兴趣的:(scala,scala,spark,spark,大数据,MLlib,MLlib)