此单元包含处理特征的算法,大致可以分为:
抽取:从原数据抽取特征
转换:Scaling,转化,修改特征
选择:从大特征集选区子集
This section covers algorithms for working with features, roughly divided into these groups:
Table of Contents
Term Frequency-Inverse Document Frequency (TF-IDF) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
TF: HashingTF
is a Transformer
which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. The algorithm combines Term Frequency (TF) counts with the hashing trick for dimensionality reduction.
IDF: IDF
is an Estimator
which fits on a dataset and produces an IDFModel
. The IDFModel
takes feature vectors (generally created from HashingTF
) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
Please refer to the MLlib user guide on TF-IDF for more details on Term Frequency and Inverse Document Frequency.
In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer
. For each sentence (bag of words), we use HashingTF
to hash the sentence into a feature vector. We use IDF
to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
词频-逆文档频率是一个比较普遍的文本预处理步骤。在Spark ML中,它被分为了两个部分:TF和IDF。
TF:哈希TF是一个转换操作,输入是词集合sets of terms,然后将这些集合转换成固定长度的特征向量。在文本处理中,一个词的集合a “set of terms”可能是一包词a bag of words。此算法结合了词频和哈希技巧,目的是降维。
IDF:IDF是一个Estimator,拟合数据集,创建IDF模型。IDF模型输入是特征向量(通常由上一步产生),然后scale每列。IDF模型会降低经常出现在文档集合中的列。
我们以一个句子的集合作为开始。将每个句子切分为词,使用Tokenizer。对于每个句子a bag of words,我们使用哈希DF将句子哈希成一个特征向量。使用IDF rescale这些特征向量——此举在使用文本作为特征时会提升performance。我们的特征向量可以传递给一个学习算法。
【思考】将URL切分为词,每个URL使用哈希DF将句子哈希成一个特征向量,使用IDF rescale这些特征向量,然后将特征向量传递给学习算法。
Estimator
: An Estimator
is an algorithm which can be fit on a DataFrame
to produce a Transformer
. E.g., a learning algorithm is an Estimator
which trains on a DataFrame
and produces a model.
Estimator:指一个算法,可以拟合一个数据框,用来产生Transformer。比如,学习算法就是一个Estimator,作用是通过数据框训练,获得模型。
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = sqlContext.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
//[0,Hi I heard about Spark,WrappedArray(hi, i, heard, about, spark)]
//[0,I wish Java could use case classes,WrappedArray(i, wish, java, could, use, case, classes)]
//[1,Logistic regression models are neat,WrappedArray(logistic, regression, models, are, neat)]
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
//.setNumFeatures(20)
//[0,Hi I heard about Spark,WrappedArray(hi, i, heard, about, spark),(20,[0,5,9,17],[1.0,1.0,1.0,2.0])]//少一个因为发生了哈希碰撞
//[0,I wish Java could use case classes,WrappedArray(i, wish, java, could, use, case, classes),(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0])]//出现3也是因为哈希碰撞
//[1,Logistic regression models are neat,WrappedArray(logistic, regression, models, are, neat),(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])]
//默认numFeatures
//[0,Hi I heard about Spark,WrappedArray(hi, i, heard, about, spark),(262144,[24417,49304,73197,91137,234657],[1.0,1.0,1.0,1.0,1.0])]
//[0,I wish Java could use case classes,WrappedArray(i, wish, java, could, use, case, classes),(262144,[20719,24417,55551,116873,147765,162369,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])]
//[1,Logistic regression models are neat,WrappedArray(logistic, regression, models, are, neat),(262144,[13671,91006,132713,167122,190884],[1.0,1.0,1.0,1.0,1.0])]
val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
//[(262144,[24417,49304,73197,91137,234657],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]),0]
//[(262144,[20719,24417,55551,116873,147765,162369,192310],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]),0]
//[(262144,[13671,91006,132713,167122,190884],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]),1]
rescaledData.select("features", "label").take(3).foreach(println)
Word2Vec computes distributed vector representation of words. The main advantage of the distributed representations is that similar words are close in the vector space, which makes generalization to novel patterns easier and model estimation more robust. Distributed vector representation is showed to be useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.
Word2Vec计算了代表单词的分布式向量。主要优势是相似词汇在向量空间更close,让模型泛化能力更好,更robust。分布式向量在很多nlp应用中很有用,比如:机器翻译、模糊语义、NER(命名实体识别(NER)其目的是识别语料中人名、地名、组织机构名等命名实体,识别文本中具有特定意义的实体)、tagging(The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset)。
Word2Vec
is an Estimator
which takes sequences of words representing documents and trains a Word2VecModel
. The model maps each word to a unique fixed-size vector. The Word2VecModel
transforms each document into a vector using the average of all words in the document; this vector can then be used for as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.
Word2Vec输入代表文档的单词序列来训练模型。此模型将每个单词映射成一个独一无二的固定大小的向量。Word2VecModel
通过文档中所有单词的平均值(每个单词都是一个向量)将每个文档transform成一个向量,这个向量可以在预测,文档相似度计算等中用来做特征。
In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.
每个文档用a sequence of words代表。每个文档被transform成了一个特征向量。这个特征向量可以传递给学习算法。
import org.apache.spark.ml.feature.Word2Vec
// Input data: Each row is a bag of words from a sentence or document.
//[WrappedArray(Hi, I, heard, about, Spark)]
//[WrappedArray(I, wish, Java, could, use, case, classes)]
//[WrappedArray(Logistic, regression, models, are, neat)]
val documentDF = sqlContext.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
//[WrappedArray(Hi, I, heard, about, Spark),[-0.028139343485236168,0.04554025698453188,-0.013317196490243079]]
//[WrappedArray(I, wish, Java, could, use, case, classes),[0.06872416580361979,-0.02604914902310286,0.02165239889706884]]
//[WrappedArray(Logistic, regression, models, are, neat),[0.023467857390642166,0.027799883112311366,0.0331136979162693]]
val result = model.transform(documentDF)
result.select("result").take(3).foreach(println)
CountVectorizer
and CountVectorizerModel
aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer
can be used as an Estimator
to extract the vocabulary and generates a CountVectorizerModel
. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.
During the fitting process, CountVectorizer
will select the top vocabSize
words ordered by term frequency across the corpus. An optional parameter “minDF” also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.
CountVectorizer
和CountVectorizerModel
帮助转换文本文档的集合成为token数量的向量。当没有a-priori字典的时候,可以使用CountVectorizer
来提取词汇,生成CountVectorizerModel
模型。模型产生文档over词汇表的稀疏表示,然后可以将其传递给其他算法,比如IDA。
在拟合过程中,CountVectorizer
会选取最大vocabSize
的词汇,以词在文档集合中的频率排序。可选项参数minDF也会影响拟合过程,这个参数指定的是被包含在词汇表中的词汇出现在多少个文档中的最小值。
【注】A priori knowledge, a type of deduction justified by arguments of a certain kind
Examples
Assume that we have the following DataFrame with columns id
and texts
:
id | texts
----|----------
0 | Array("a", "b", "c")
1 | Array("a", "b", "b", "c", "a")
each row intexts
is a document of type Array[String]. Invoking fit of CountVectorizer
produces a CountVectorizerModel
with vocabulary (a, b, c), then the output column “vector” after transformation contains:
texts列的每一行都是一个文档,用Array类型存储。在CountVectorizer
调用fit会产生CountVectorizerModel
模型,这个模型有词汇表(a, b, c),输出列vector内容如下(a,b,c被标号0,1,2,每个向量表示词汇表中的每个词汇出现的次数):
id | texts | vector
----|---------------------------------|---------------
0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
each vector represents the token counts of the document over the vocabulary.
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API.import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = sqlContext.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).select("features").show()
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.
RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: \s+) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.
Tokenization是将句子拆分为terms(通常是单词)。Tokenizer类提供了这个功能。
RegexTokenizer基于正则表达式匹配提供了更强大的分词。参数“pattern”一般默认指的是切分输入文本的分隔符。或者开发者也可以设置参数“gaps”为false,那么指定的“pattern”就是“tokens”,然后会找到所有匹配“pattern”的结果。
【注】gaps:whether regex splits on gaps (true) or matches tokens (false).
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
//[0,Hi I heard about Spark]
//[1,I wish Java could use case classes]
//[2,Logistic,regression,models,are,neat]
val sentenceDataFrame = sqlContext.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern(" ").setGaps(false)
val tokenized = tokenizer.transform(sentenceDataFrame)
//[WrappedArray(hi, i, heard, about, spark),0]
//[WrappedArray(i, wish, java, could, use, case, classes),1]
//[WrappedArray(logistic,regression,models,are,neat),2]
tokenized.select("words", "label").take(3).foreach(println)
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
//[WrappedArray( , , , ),0]
//[WrappedArray( , , , , , ),1]
//[WrappedArray(),2]
regexTokenized.select("words", "label").take(3).foreach(println)
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
StopWordsRemover
takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords
parameter. We provide a list of stop words by default, accessible by calling getStopWords
on a newly instantiated StopWordsRemover
instance. A boolean parameter caseSensitive
indicates if the matches should be case sensitive (false by default).
停用词是应该被排除在输入之外的,因为这些词出现的太过于频繁,且没啥太大意义。
StopWordsRemover
的输入是字符串序列(比如可以是Tokenizer的输出),然后删除停用掉的词汇。停用词汇list可以通过stopwords参数指定。默认Spark机器学习库会提供一个停用词汇list,在StopWordsRemover实例上调用getStopWords即可获得。参数caseSensitive是一个布尔参数,false是默认值。
Examples
Assume that we have the following DataFrame with columns id
and raw
:
id | raw
----|----------
0 | [I, saw, the, red, baloon]
1 | [Mary, had, a, little, lamb]
Applying StopWordsRemover
with raw
as the input column and filtered
as the output column, we should get the following:
id | raw | filtered
----|-----------------------------|--------------------
0 | [I, saw, the, red, baloon] | [saw, red, baloon]
1 | [Mary, had, a, little, lamb]|[Mary, little, lamb]
In filtered
, the stop words “I”, “the”, “had”, and “a” have been filtered out.
import org.apache.spark.ml.feature.StopWordsRemover
val remover = new StopWordsRemover()
.setInputCol("raw")
.setOutputCol("filtered")
val dataSet = sqlContext.createDataFrame(Seq(
(0, Seq("I", "saw", "the", "red", "baloon")),
(1, Seq("Mary", "had", "a", "little", "lamb"))
)).toDF("id", "raw")
remover.transform(dataSet).show()
An n-gram is a sequence of nn tokens (typically words) for some integer nn. The NGram
class can be used to transform input features into nn-grams.
NGram
takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n
is used to determine the number of terms in each nn-gram. The output will consist of a sequence of nn-grams where each nn-gram is represented by a space-delimited string of nn consecutive words. If the input sequence contains fewer than n
strings, no output is produced.
NGram的输入可以是分词的输出。参数n用来决定每一个n-gram中term的数量。输出是一个包含多个n-grams的序列,每个n-gram中的term中间用空格分开。如果输入序列中不包含多于n个的字符串,没有输出。
Refer to the NGram Scala docs for more details on the API.
import org.apache.spark.ml.feature.NGram
//[0,WrappedArray(Hi, I, heard, about, Spark)]
//[1,WrappedArray(I, wish, Java, could, use, case, classes)]
//[2,WrappedArray(Logistic, regression, models, are, neat)]
val wordDataFrame = sqlContext.createDataFrame(Seq(
(0, Array("Hi", "I", "heard", "about", "Spark")),
(1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
(2, Array("Logistic", "regression", "models", "are", "neat"))
)).toDF("label", "words")
val ngram = new NGram().setInputCol("words").setOutputCol("ngrams")
//[0,WrappedArray(Hi, I, heard, about, Spark),WrappedArray(Hi I, I heard, heard about, about Spark)]
//[1,WrappedArray(I, wish, Java, could, use, case, classes),WrappedArray(I wish, wish Java, Java could, could use, use case, case classes)]
//[2,WrappedArray(Logistic, regression, models, are, neat),WrappedArray(Logistic regression, regression models, models are, are neat)]
val ngramDataFrame = ngram.transform(wordDataFrame)
//List(Hi I, I heard, heard about, about Spark)
//List(I wish, wish Java, Java could, could use, use case, case classes)
//List(Logistic regression, regression models, models are, are neat)
ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)
Binarization is the process of thresholding numerical features to binary (0/1) features.
Binarizer
takes the common parameters inputCol
and outputCol
, as well as the threshold
for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.
将数据分布变为二项分布:比如阈值是0.5,低于阈值的就是0,高于阈值的就是1
import org.apache.spark.ml.feature.Binarizer
val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")
val binarizer: Binarizer = new Binarizer()
.setInputCol("feature")
.setOutputCol("binarized_feature")
.setThreshold(0.5)
val binarizedDataFrame = binarizer.transform(dataFrame)
val binarizedFeatures = binarizedDataFrame.select("binarized_feature")
binarizedFeatures.collect().foreach(println)
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
PCA将有可能相互关联的变量转换成一个集合,集合中的值是线性不相关的变量,这些值被成为主成分。一个PCA类可以训练模型,将向量project到低维空间。
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),//(5, [1, 3], [1.0, 7.0])
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3)//The number of principal components
.fit(df)
val pcaDF = pca.transform(df)
val result = pcaDF.select("pcaFeatures")
//[(5,[1,3],[1.0,7.0]),[1.6485728230883807,-4.013282700516296,-5.524543751369388]]
//[[2.0,0.0,3.0,4.0,5.0],[-4.645104331781534,-1.1167972663619026,-5.524543751369387]]
//[[4.0,0.0,0.0,6.0,7.0],[-6.428880535676489,-5.337951427775355,-5.524543751369389]]
result.show()
Polynomial expansion is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A PolynomialExpansion class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
多项式展开是扩展特征到多项式空间的过程,通过原维度结合成n-度获得。
import org.apache.spark.ml.feature.PolynomialExpansion
import org.apache.spark.mllib.linalg.Vectors
val data = Array(
Vectors.dense(-2.0, 2.3),
Vectors.dense(0.0, 0.0),
Vectors.dense(0.6, -1.1)
)
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val polynomialExpansion = new PolynomialExpansion()
.setInputCol("features")
.setOutputCol("polyFeatures")
.setDegree(3)
val polyDF = polynomialExpansion.transform(df)
polyDF.select("polyFeatures").take(3).foreach(println)
The Discrete Cosine Transform transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain. A DCT class provides this functionality, implementing the DCT-II and scaling the result by 1/2–√1/2 such that the representing matrix for the transform is unitary. No shift is applied to the transformed sequence (e.g. the 00th element of the transformed sequence is the 00th DCT coefficient and not the N/2N/2th).
离散余弦变换
import org.apache.spark.ml.feature.DCT
import org.apache.spark.mllib.linalg.Vectors
val data = Seq(
Vectors.dense(0.0, 1.0, -2.0, 3.0),
Vectors.dense(-1.0, 2.0, 4.0, -7.0),
Vectors.dense(14.0, -2.0, -5.0, 1.0))
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val dct = new DCT()
.setInputCol("features")
.setOutputCol("featuresDCT")
.setInverse(false)
val dctDf = dct.transform(df)
dctDf.select("featuresDCT").show(3)
StringIndexer
encodes a string column of labels to a column of label indices. The indices are in [0, numLabels)
, ordered by label frequencies. So the most frequent label gets index 0
. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator
or Transformer
make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol
.
Examples
Assume that we have the following DataFrame with columns id
and category
:
id | category
----|----------
0 | a
1 | b
2 | c
3 | a
4 | a
5 | c
category
is a string column with three labels: “a”, “b”, and “c”. Applying StringIndexer
with category
as the input column and categoryIndex
as the output column, we should get the following:
id | category | categoryIndex
----|----------|---------------
0 | a | 0.0
1 | b | 2.0
2 | c | 1.0
3 | a | 0.0
4 | a | 0.0
5 | c | 1.0
“a” gets index 0
because it is the most frequent, followed by “c” with index 1
and “b” with index 2
.
Additionaly, there are two strategies regarding how StringIndexer
will handle unseen labels when you have fit a StringIndexer
on one dataset and then use it to transform another:
Examples
Let’s go back to our previous example but this time reuse our previously defined StringIndexer
on the following dataset:
id | category
----|----------
0 | a
1 | b
2 | c
3 | d
If you’ve not set how StringIndexer
handles unseen labels or set it to “error”, an exception will be thrown. However, if you had called setHandleInvalid("skip")
, the following dataset will be generated:
id | category | categoryIndex
----|----------|---------------
0 | a | 0.0
1 | b | 2.0
2 | c | 1.0
Notice that the row containing “d” does not appear.
Refer to the StringIndexer Scala docs for more details on the API.import org.apache.spark.ml.feature.StringIndexer
val df = sqlContext.createDataFrame(
Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
val indexed = indexer.fit(df).transform(df)
indexed.show()
Symmetrically to StringIndexer
, IndexToString
maps a column of label indices back to a column containing the original labels as strings. The common use case is to produce indices from labels with StringIndexer
, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString
. However, you are free to supply your own labels.
Examples
Building on the StringIndexer
example, let’s assume we have the following DataFrame with columns id
and categoryIndex
:
id | categoryIndex
----|---------------
0 | 0.0
1 | 2.0
2 | 1.0
3 | 0.0
4 | 0.0
5 | 1.0
Applying IndexToString
with categoryIndex
as the input column, originalCategory
as the output column, we are able to retrieve our original labels (they will be inferred from the columns’ metadata):
id | categoryIndex | originalCategory
----|---------------|-----------------
0 | 0.0 | a
1 | 2.0 | b
2 | 1.0 | c
3 | 0.0 | a
4 | 0.0 | a
5 | 1.0 | c
Refer to the
IndexToString Scala docs
for more details on the API.
import org.apache.spark.ml.feature.{StringIndexer, IndexToString}
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val converter = new IndexToString()
.setInputCol("categoryIndex")
.setOutputCol("originalCategory")
val converted = converter.transform(indexed)
converted.select("id", "originalCategory").show()
One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
独热编码将含标签下标映射成二项的向量组成的列。可以让需要连续型数据的算法使用分类的特征。
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
/*[0,a,0.0]
[1,b,2.0]
[2,c,1.0]
[3,a,0.0]
[4,a,0.0]*/
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
/*| id| categoryVec|
+---+-------------+
| 0|(2,[0],[1.0])|
| 1| (2,[],[])|
| 2|(2,[1],[1.0])|
| 3|(2,[0],[1.0])|
| 4|(2,[0],[1.0])|
| 5|(2,[1],[1.0])|*/
encoded.select("id", "categoryVec").show()
VectorIndexer
helps index categorical features in datasets of Vector
s. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
maxCategories
.maxCategories
are declared categorical.Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.
In the example below, we read in a dataset of labeled points and then use VectorIndexer
to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as DecisionTreeRegressor
that handle categorical features.
Refer to the VectorIndexer Scala docs for more details on the API.
import org.apache.spark.ml.feature.VectorIndexer
val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val indexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexed")
.setMaxCategories(10)
val indexerModel = indexer.fit(data)
val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet
println(s"Chose ${categoricalFeatures.size} categorical features: " +
categoricalFeatures.mkString(", "))
// Create new column "indexed" with categorical values transformed to indices
val indexedData = indexerModel.transform(data)
indexedData.show()
Normalizer
is a Transformer
which transforms a dataset of Vector
rows, normalizing each Vector
to have unit norm. It takes parameter p
, which specifies the p-norm used for normalization. (p=2p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.
The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit L2L2 norm and unit L∞L∞ norm.
Refer to the Normalizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Normalizer
val dataFrame = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Normalize each Vector using $L^1$ norm.
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(1.0)
val l1NormData = normalizer.transform(dataFrame)
l1NormData.show()
// Normalize each Vector using $L^\infty$ norm.
val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.PositiveInfinity)
lInfNormData.show()
StandardScaler
transforms a dataset of Vector
rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:
withStd
: True by default. Scales the data to unit standard deviation.withMean
: False by default. Centers the data with mean before scaling. It will build a dense output, so this does not work on sparse input and will raise an exception.StandardScaler
is an Estimator
which can be fit
on a dataset to produce a StandardScalerModel
; this amounts to computing summary statistics. The model can then transform a Vector
column in a dataset to have unit standard deviation and/or zero mean features.
Note that if the standard deviation of a feature is zero, it will return default 0.0
value in the Vector
for that feature.
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
Refer to the StandardScaler Scala docs for more details on the API.
import org.apache.spark.ml.feature.StandardScaler
val dataFrame = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)
// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)
// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()
MinMaxScaler
transforms a dataset of Vector
rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
min
: 0.0 by default. Lower bound after transformation, shared by all features.max
: 1.0 by default. Upper bound after transformation, shared by all features.MinMaxScaler
computes summary statistics on a data set and produces a MinMaxScalerModel
. The model can then transform each feature individually such that it is in the given range.
The rescaled value for a feature E is calculated as,
Rescaled(ei)=ei−EminEmax−Emin∗(max−min)+min(1)(1)Rescaled(ei)=ei−EminEmax−Emin∗(max−min)+min
E_{max} == E_{min}
,
Rescaled(e_i) = 0.5 * (max + min)
Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].
Refer to the MinMaxScaler Scala docs and the MinMaxScalerModel Scala docs for more details on the API.
import org.apache.spark.ml.feature.MinMaxScaler
val dataFrame = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
// Compute summary statistics and generate MinMaxScalerModel
val scalerModel = scaler.fit(dataFrame)
// rescale each feature to range [min, max].
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()
Bucketizer
transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter:
splits
: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of splits
are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)
and Array(0.0, 1.0, 2.0)
.Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the Double.NegativeInfinity
and Double.PositiveInfinity
as the bounds of your splits to prevent a potenial out of Bucketizer bounds exception.
Note also that the splits that you provided have to be in strictly increasing order, i.e. s0 < s1 < s2 < ... < sn
.
More details can be found in the API docs for Bucketizer.
The following example demonstrates how to bucketize a column of Double
s into another index-wised column.
Refer to the Bucketizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Bucketizer
val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity)
val data = Array(-0.5, -0.3, 0.0, 0.2)
val dataFrame = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val bucketizer = new Bucketizer()
.setInputCol("features")
.setOutputCol("bucketedFeatures")
.setSplits(splits)
// Transform original data into its bucket index.
val bucketedData = bucketizer.transform(dataFrame)
bucketedData.show()
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v
and transforming vector, w
, to yield a result vector.
⎛⎝⎜⎜v1⋮vN⎞⎠⎟⎟∘⎛⎝⎜⎜w1⋮wN⎞⎠⎟⎟=⎛⎝⎜⎜v1w1⋮vNwN⎞⎠⎟⎟(v1⋮vN)∘(w1⋮wN)=(v1w1⋮vNwN)
This example below demonstrates how to transform vectors using a transforming vector value.
Refer to the ElementwiseProduct Scala docs for more details on the API.
import org.apache.spark.ml.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
// Create some vector data; also works for sparse vectors
val dataFrame = sqlContext.createDataFrame(Seq(
("a", Vectors.dense(1.0, 2.0, 3.0)),
("b", Vectors.dense(4.0, 5.0, 6.0)))).toDF("id", "vector")
val transformingVector = Vectors.dense(0.0, 1.0, 2.0)
val transformer = new ElementwiseProduct()
.setScalingVec(transformingVector)
.setInputCol("vector")
.setOutputCol("transformedVector")
// Batch transform the vectors to create new column:
transformer.transform(dataFrame).show()
SQLTransformer
implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like "SELECT ... FROM __THIS__ ..."
where "__THIS__"
represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer
supports statements like:
SELECT a, a + b AS a_b FROM __THIS__
SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5
SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b
Examples
Assume that we have the following DataFrame with columns id
, v1
and v2
:
id | v1 | v2
----|-----|-----
0 | 1.0 | 3.0
2 | 2.0 | 5.0
This is the output of the SQLTransformer
with statement "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__"
:
id | v1 | v2 | v3 | v4
----|-----|-----|-----|-----
0 | 1.0 | 3.0 | 4.0 | 3.0
2 | 2.0 | 5.0 | 7.0 |10.0
Refer to the SQLTransformer Scala docs for more details on the API.
import org.apache.spark.ml.feature.SQLTransformer
val df = sqlContext.createDataFrame(
Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")
val sqlTrans = new SQLTransformer().setStatement(
"SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show()
VectorAssembler
is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler
accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.
Examples
Assume that we have a DataFrame with the columns id
, hour
, mobile
, userFeatures
, and clicked
:
id | hour | mobile | userFeatures | clicked
----|------|--------|------------------|---------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0
userFeatures
is a vector column that contains three user features. We want to combine hour
, mobile
, and userFeatures
into a single feature vector called features
and use it to predict clicked
or not. If we set VectorAssembler
’s input columns to hour
, mobile
, and userFeatures
and output column to features
, after transformation we should get the following DataFrame:
id | hour | mobile | userFeatures | clicked | features
----|------|--------|------------------|---------|-----------------------------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5]
Refer to the VectorAssembler Scala docs for more details on the API.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vectors
val dataset = sqlContext.createDataFrame(
Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
).toDF("id", "hour", "mobile", "userFeatures", "clicked")
val assembler = new VectorAssembler()
.setInputCols(Array("hour", "mobile", "userFeatures"))
.setOutputCol("features")
val output = assembler.transform(dataset)
println(output.select("features", "clicked").first())
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned categorical features. The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts. The lower and upper bin bounds will be -Infinity
and +Infinity
, covering all real values. This attempts to find numBuckets
partitions based on a sample of the given input data, but it may find fewer depending on the data sample values.
Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.
Examples
Assume that we have a DataFrame with the columns id
, hour
:
id | hour
----|------
0 | 18.0
----|------
1 | 19.0
----|------
2 | 8.0
----|------
3 | 5.0
----|------
4 | 2.2
hour
is a continuous feature with Double
type. We want to turn the continuous feature into categorical one. Given numBuckets = 3
, we should get the following DataFrame:
id | hour | result
----|------|------
0 | 18.0 | 2.0
----|------|------
1 | 19.0 | 2.0
----|------|------
2 | 8.0 | 1.0
----|------|------
3 | 5.0 | 1.0
----|------|------
4 | 2.2 | 0.0
Refer to the QuantileDiscretizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.QuantileDiscretizer
val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = sc.parallelize(data).toDF("id", "hour")
val discretizer = new QuantileDiscretizer()
.setInputCol("hour")
.setOutputCol("result")
.setNumBuckets(3)
val result = discretizer.fit(df).transform(df)
result.show()
VectorSlicer
is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
VectorSlicer
accepts a vector column with a specified indices, then outputs a new vector column whose values are selected via those indices. There are two types of indices,
Integer indices that represents the indices into the vector, setIndices()
;
String indices that represents the names of features into the vector, setNames()
. This requires the vector column to have an AttributeGroup
since the implementation matches on the name field of an Attribute
.
Specification by integer and string are both acceptable. Moreover, you can use integer index and string name simultaneously. At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names. Note that if names of features are selected, an exception will be threw out when encountering with empty input attributes.
The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
Examples
Suppose that we have a DataFrame with the column userFeatures
:
userFeatures
------------------
[0.0, 10.0, 0.5]
userFeatures
is a vector column that contains three user features. Assuming that the first column of userFeatures
are all zeros, so we want to remove it and only the last two columns are selected. The VectorSlicer
selects the last two elements with setIndices(1, 2)
then produces a new vector column named features
:
userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
Suppose also that we have a potential input attributes for the userFeatures
, i.e. ["f1", "f2", "f3"]
, then we can use setNames("f2", "f3")
to select them.
userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
["f1", "f2", "f3"] | ["f2", "f3"]
Refer to the VectorSlicer Scala docs for more details on the API.
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
val data = Array(Row(Vectors.dense(-2.0, 2.3, 0.0)))
val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
val dataRDD = sc.parallelize(data)
val dataset = sqlContext.createDataFrame(dataRDD, StructType(Array(attrGroup.toStructField())))
val slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setIndices(Array(1)).setNames(Array("f3"))
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3"))
val output = slicer.transform(dataset)
println(output.select("userFeatures", "features").first())
RFormula
selects columns specified by an R model formula. It produces a vector column of features and a double column of labels. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If not already present in the DataFrame, the output label column will be created from the specified response variable in the formula.
Examples
Assume that we have a DataFrame with the columns id
, country
, hour
, and clicked
:
id | country | hour | clicked
---|---------|------|---------
7 | "US" | 18 | 1.0
8 | "CA" | 12 | 0.0
9 | "NZ" | 15 | 0.0
If we use RFormula
with a formula string of clicked ~ country + hour
, which indicates that we want to predict clicked
based on country
and hour
, after transformation we should get the following DataFrame:
id | country | hour | clicked | features | label
---|---------|------|---------|------------------|-------
7 | "US" | 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0
8 | "CA" | 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0
9 | "NZ" | 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0
Refer to the RFormula Scala docs for more details on the API.
import org.apache.spark.ml.feature.RFormula
val dataset = sqlContext.createDataFrame(Seq(
(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")
val formula = new RFormula()
.setFormula("clicked ~ country + hour")
.setFeaturesCol("features")
.setLabelCol("label")
val output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
ChiSqSelector
stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
卡方特征选择器适用于用分类特征打标签的数据,选出标签最依赖的特征。
Examples
Assume that we have a DataFrame with the columns id
, features
, and clicked
, which is used as our target to be predicted:
id | features | clicked
---|-----------------------|---------
7 | [0.0, 0.0, 18.0, 1.0] | 1.0
8 | [0.0, 1.0, 12.0, 0.0] | 0.0
9 | [1.0, 0.0, 15.0, 0.1] | 0.0
If we use ChiSqSelector
with a numTopFeatures = 1
, then according to our label clicked
the last column in our features
chosen as the most useful feature:
特征中最后一列被选择成了最有用的特征。
id | features | clicked | selectedFeatures
---|-----------------------|---------|------------------
7 | [0.0, 0.0, 18.0, 1.0] | 1.0 | [1.0]
8 | [0.0, 1.0, 12.0, 0.0] | 0.0 | [0.0]
9 | [1.0, 0.0, 15.0, 0.1] | 0.0 | [0.1]
Refer to the
ChiSqSelector Scala docs
for more details on the API.
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.mllib.linalg.Vectors
val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
val df = sc.parallelize(data).toDF("id", "features", "clicked")
val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol("features")
.setLabelCol("clicked")
.setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
result.show()