Spark ML Feature

[参考官方文档]
[参考link]

特征提取

TF-IDF

scala_demo :

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
Word2Vec

scala_demo :

import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
  "Hi I heard about Spark".split(" "),
  "I wish Java could use case classes".split(" "),
  "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
  .setInputCol("text")
  .setOutputCol("result")
  .setVectorSize(3)
  .setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
  println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
CountVectorizer

CountVectorizerCountVectorizerModel用于将a collection of text documents转化成vectors of token counts。
当没有一个现有的字典(a-priori dictionary)可用的时候,CountVectorizer可以被用来作为一个Estimator来extract the vocabulary, 而且产生一个CountVectorizerModel
这个模型为文档在这个词汇表上产生一个稀疏的表示, 然后传给其他的算法模型,比如LDA。
在fitting的过程中, CountVectorizer会按照term frequency across the corpus进行排名,来选择top vocabSize words。
一个可选的参数minDF同样会影响fitting的过程,这个参数指定文档中一个term必须出现的最小次数(or fraction if < 1.0) ,然后才能包含在vocabulary。
另一个选项:二进制开关参数(binary toggle parameter)用来控制输出的vector,如果设成true的话,所有的nonzero计数都会统一被设成1。这对于特别有用离散的概率模型,为binary counts目标进行建模, 而不是为integer counts目标建模。
举例 :

 id | texts
----|----------
 0  | Array("a", "b", "c")
 1  | Array("a", "b", "b", "c", "a")

每一条样本的texts字段是一个Array[String]类型的文档。调用CountVectorizer的fit()方法生成一个包含字母表(a, b, c)的CountVectorizerModel对象。然后在transformation之后在原DataFrame里面生成新的一列"vector":

 id | texts                           | vector
----|---------------------------------|---------------
 0  | Array("a", "b", "c")            | (3,[0,1,2],[1.0,1.0,1.0])
 1  | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])

每个vector展示这个document在字母表上的token counts。
scala_demo :

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(3)
  .setMinDF(2)
  .fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")

cvModel.transform(df).show(false)

特征转换

标记生成器(Tokenizer)
停用词移除器(StopWordsRemover)
n-gram
二值化
PCA
多项式展开(PolynomialExpansion)
离散余弦变换(Discrete Cosine Transform DCT)
StringIndexer
IndexToString
OneHotEncoder
VectorIndexer
Interaction
Normalizer
StandardScaler
MinMaxScaler
MaxAbsScaler
Bucketizer
ElementwiseProduct
SQLTransformer
VectorAssembler
QuantileDiscretizer
Imputer

特征选择

VectorSlicer
RFormula
ChiSqSelector
局部敏感哈希
欧几里德距离的随机投影
Jaccard距离最小hash
特征转换
近似相似性join
近似最近邻搜索
LSH操作
LSH算法

你可能感兴趣的:(Spark ML Feature)