[参考官方文档]
[参考link]
特征提取
TF-IDF
scala_demo :
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
Word2Vec
scala_demo :
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
CountVectorizer
CountVectorizer
和CountVectorizerModel
用于将a collection of text documents转化成vectors of token counts。
当没有一个现有的字典(a-priori dictionary)可用的时候,CountVectorizer
可以被用来作为一个Estimator来extract the vocabulary, 而且产生一个CountVectorizerModel
。
这个模型为文档在这个词汇表上产生一个稀疏的表示, 然后传给其他的算法模型,比如LDA。
在fitting的过程中, CountVectorizer
会按照term frequency across the corpus进行排名,来选择top vocabSize words。
一个可选的参数minDF
同样会影响fitting的过程,这个参数指定文档中一个term必须出现的最小次数(or fraction if < 1.0) ,然后才能包含在vocabulary。
另一个选项:二进制开关参数(binary toggle parameter)用来控制输出的vector,如果设成true的话,所有的nonzero计数都会统一被设成1。这对于特别有用离散的概率模型,为binary counts目标进行建模, 而不是为integer counts目标建模。
举例 :
id | texts
----|----------
0 | Array("a", "b", "c")
1 | Array("a", "b", "b", "c", "a")
每一条样本的texts
字段是一个Array[String]类型的文档。调用CountVectorizer
的fit()方法生成一个包含字母表(a, b, c)的CountVectorizerModel
对象。然后在transformation之后在原DataFrame里面生成新的一列"vector"
:
id | texts | vector
----|---------------------------------|---------------
0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
每个vector展示这个document在字母表上的token counts。
scala_demo :
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).show(false)