CountVectorizer和CountVectorizerModel旨在通过计数来将一个文档转换为向量。当不存在先验字典时,Countvectorizer可作为Estimator来提取词汇,并生成一个CountVectorizerModel。该模型产生文档关于词语的稀疏表示,其表示可以传递给其他算法如LDA。 在fitting过程中,countvectorizer将根据语料库中的词频排序从高到低进行选择,词汇表的最大含量由vocabsize参数来指定。一个可选的参数minDF也影响fitting过程,它指定词汇表中的词语至少要在多少个不同文档中出现。
1、vocabSize 词典表的大小
/**
* 词典的大小 默认为math.pow(2,18)
* 关于词的选择:先对词典做wordcount然后去top vocabSize个特征放入词典
* Default: 2^18^
* @group param
*/
val vocabSize: IntParam =
new IntParam(this, "vocabSize", "max size of the vocabulary", ParamValidators.gt(0))
2、minDF 语料中词的逆词频
/**
* DF代表该特征在多少个语料库中出现过 minDF代表出现在字典中DF的下限
* 如果传入的是Int类型 则代表特征出现文档的次数
* Default: 1.0
* @group param
*/
val minDF: DoubleParam = new DoubleParam(this, "minDF", "Specifies the minimum number of" +
" different documents a term must appear in to be included in the vocabulary." +
" If this is an integer >= 1, this specifies the number of documents the term must" +
" appear in; if this is a double in [0,1), then this specifies the fraction of documents.",
ParamValidators.gtEq(0.0))
3、minTF 文档中词频
/**
val minTF: DoubleParam = new DoubleParam(this, "minTF", "Filter to ignore rare words in" +
" a document. For each document, terms with frequency/count less than the given threshold are" +
" ignored. If this is an integer >= 1, then this specifies a count (of times the term must" +
" appear in the document); if this is a double in [0,1), then this specifies a fraction (out" +
" of the document's token count). Note that the parameter is only used in transform of" +
" CountVectorizerModel and does not affect fitting.", ParamValidators.gtEq(0.0))
4、binary
/**
* 适用于离散概率模型 模拟二进制事件 即发生和不发生
* @group param
*/
val binary: BooleanParam =
new BooleanParam(this, "binary", "If True, all non zero counts are set to 1.")
5、查看fit,理解为装箱的过程
@Since("2.0.0")
override def fit(dataset: Dataset[_]): CountVectorizerModel = {
transformSchema(dataset.schema, logging = true)
val vocSize = $(vocabSize)
val input = dataset.select($(inputCol)).rdd.map(_.getAs[Seq[String]](0))
// 如果传入的minDF大于等于1 则使用传入参数 否则使用传入minDF * dataFrame.count()
val minDf = if ($(minDF) >= 1.0) {
$(minDF)
} else {
$(minDF) * input.cache().count()
}
// input:RDD[Seq[String]] 类型 每个文档计算wordcount 然后使用flatMap展开
// 输出类型为RDD[(String, Long)] 再调用reduceByKey计算word的DF逆词频
val wordCounts: RDD[(String, Long)] = input.flatMap { case (tokens) =>
val wc = new OpenHashMap[String, Long]
tokens.foreach { w =>
wc.changeValue(w, 1L, _ + 1L)
}
wc.map { case (word, count) => (word, (count, 1)) }
}.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
(wc1 + wc2, df1 + df2)
}.filter { case (word, (wc, df)) =>
df >= minDf
}.map { case (word, (count, dfCount)) =>
(word, count)
}.cache()
val fullVocabSize = wordCounts.count()
// 使用top算子去固定数量的词放入到词典中
val vocab = wordCounts
.top(math.min(fullVocabSize, vocSize).toInt)(Ordering.by(_._2))
.map(_._1)
require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
copyValues(new CountVectorizerModel(uid, vocab).setParent(this))
}
Tips:总体实现还是比较简单易懂的;这里主要是关注词典表的大小,如果你新数据的数据比训练样本大很多,会导致很多词无法在model vecab里面出现,导致该特征值被忽略计算,会导致数据失真。
6、查看transform,生成向量的过程
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
if (broadcastDict.isEmpty) {
val dict = vocabulary.zipWithIndex.toMap
broadcastDict = Some(dataset.sparkSession.sparkContext.broadcast(dict))
}
val dictBr = broadcastDict.get
val minTf = $(minTF)
val vectorizer = udf { (document: Seq[String]) =>
// 保存该预料的词频
val termCounts = new OpenHashMap[Int, Double]
// 该预料中单词的个数
var tokenCount = 0L
document.foreach { term =>
dictBr.value.get(term) match {
case Some(index) => termCounts.changeValue(index, 1.0, _ + 1.0)
case None => // ignore terms not in the vocabulary
}
tokenCount += 1
}
// 计算有效的词频
val effectiveMinTF = if (minTf >= 1.0) minTf else tokenCount * minTf
//
val effectiveCounts = if ($(binary)) {
// 如果是二进制类型 值只有 0和1
termCounts.filter(_._2 >= effectiveMinTF).map(p => (p._1, 1.0)).toSeq
} else {
termCounts.filter(_._2 >= effectiveMinTF).toSeq
}
// 生成稀疏向量
Vectors.sparse(dictBr.value.size, effectiveCounts)
}
dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
}
7、例子:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).show(false)
通常CountVectorizer会和IDF配合使用,去获取词的分布
也可以将CountVectorizer处理后的数据喂给LDA继续使用。
后分析ngram。