A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0
(negative) or 1
(positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ...
.
labeled point 是一个局部向量,要么是密集型的要么是稀疏型的,用一个label/response进行关联。在MLlib里,labeled points 被用来监督学习算法。我们使用一个double数来存储一个label,因此我们能够使用labeled points进行回归和分类。在二进制分类里,一个label可以是 0(负数)或者 1(正数)。在多级分类中,labels可以是class的索引,从0开始:0,1,2,......
Scala
A labeled point is represented by the case class LabeledPoint
.
一个labeled point 通过case class LabeledPoint
被响应
Refer to the LabeledPoint
Scala docs for details on the API.
详细信息请参见LabeledPoint
Scala docs API
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint // Create a labeled point with a positive label and a dense feature vector. // 使用一个正的label和具有密集特性的向量来创建一个labeled point val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) // Create a labeled point with a negative label and a sparse feature vector. // 用一个负的label和一个稀疏型向量来定义一个labeled point。 val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
Sparse data
It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM
format, which is the default format used by LIBSVM
and LIBLINEAR
. It is a text format in which each line represents a labeled sparse feature vector using the following format:
在实际应用中使用稀疏型训练数据非常常见。MLlib支持读取以LIBSVM格式存储的训练样例,默认的格式是使用 LIBSVM
和 LIBLINEAR 。 它是一种文本格式,使用下面的格式存储,每行表示一个labeled稀疏型向量:
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.
索引从1开始one-based并且升序排序。加载以后,索引的特性被转换为zero-based。
Scala
MLUtils.loadLibSVMFile
reads training examples stored in LIBSVM format.
MLUtils.loadLibSVMFile
读取存储为LIBSVM格式的训练样例。
Refer to the MLUtils
Scala docs for details on the API.
详细信息请参见MLUtils
Scala docs API。
import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")