spark mllib CountVectorizer源码解析

CountVectorizer和CountVectorizerModel旨在通过计数来将一个文档转换为向量。当不存在先验字典时,Countvectorizer可作为Estimator来提取词汇,并生成一个CountVectorizerModel。该模型产生文档关于词语的稀疏表示,其表示可以传递给其他算法如LDA。 在fitting过程中,countvectorizer将根据语料库中的词频排序从高到低进行选择,词汇表的最大含量由vocabsize参数来指定。一个可选的参数minDF也影响fitting过程,它指定词汇表中的词语至少要在多少个不同文档中出现。
1、vocabSize 词典表的大小

/**
   * 词典的大小 默认为math.pow(2,18)
   * 关于词的选择:先对词典做wordcount然后去top vocabSize个特征放入词典
   * Default: 2^18^
   * @group param
   */
  val vocabSize: IntParam =
    new IntParam(this, "vocabSize", "max size of the vocabulary", ParamValidators.gt(0))

2、minDF 语料中词的逆词频

 /**
   * DF代表该特征在多少个语料库中出现过 minDF代表出现在字典中DF的下限
   * 如果传入的是Int类型 则代表特征出现文档的次数
   * Default: 1.0
   * @group param
   */
   val minDF: DoubleParam = new DoubleParam(this, "minDF", "Specifies the minimum number of" +
    " different documents a term must appear in to be included in the vocabulary." +
    " If this is an integer >= 1, this specifies the number of documents the term must" +
    " appear in; if this is a double in [0,1), then this specifies the fraction of documents.",
    ParamValidators.gtEq(0.0))

3、minTF 文档中词频
/**

  • TF代表一个词在文档中出现的频率 用词出现的次数/该文档总共的词个数
  • Default: 1.0
  • @group param
    */
 val minTF: DoubleParam = new DoubleParam(this, "minTF", "Filter to ignore rare words in" +
    " a document. For each document, terms with frequency/count less than the given threshold are" +
    " ignored. If this is an integer >= 1, then this specifies a count (of times the term must" +
    " appear in the document); if this is a double in [0,1), then this specifies a fraction (out" +
    " of the document's token count). Note that the parameter is only used in transform of" +
    " CountVectorizerModel and does not affect fitting.", ParamValidators.gtEq(0.0))

4、binary

/**
   * 适用于离散概率模型 模拟二进制事件 即发生和不发生
   * @group param
   */
  val binary: BooleanParam =
    new BooleanParam(this, "binary", "If True, all non zero counts are set to 1.")

5、查看fit,理解为装箱的过程

@Since("2.0.0")
  override def fit(dataset: Dataset[_]): CountVectorizerModel = {
    transformSchema(dataset.schema, logging = true)
    val vocSize = $(vocabSize)
    val input = dataset.select($(inputCol)).rdd.map(_.getAs[Seq[String]](0))
    // 如果传入的minDF大于等于1 则使用传入参数 否则使用传入minDF * dataFrame.count()
    val minDf = if ($(minDF) >= 1.0) {
      $(minDF)
    } else {
      $(minDF) * input.cache().count()
    }
    // input:RDD[Seq[String]] 类型 每个文档计算wordcount 然后使用flatMap展开
    // 输出类型为RDD[(String, Long)] 再调用reduceByKey计算word的DF逆词频
    val wordCounts: RDD[(String, Long)] = input.flatMap { case (tokens) =>
      val wc = new OpenHashMap[String, Long]
      tokens.foreach { w =>
        wc.changeValue(w, 1L, _ + 1L)
      }
      wc.map { case (word, count) => (word, (count, 1)) }
    }.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
      (wc1 + wc2, df1 + df2)
    }.filter { case (word, (wc, df)) =>
      df >= minDf
    }.map { case (word, (count, dfCount)) =>
      (word, count)
    }.cache()
    val fullVocabSize = wordCounts.count()

    // 使用top算子去固定数量的词放入到词典中
    val vocab = wordCounts
      .top(math.min(fullVocabSize, vocSize).toInt)(Ordering.by(_._2))
      .map(_._1)

    require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
    copyValues(new CountVectorizerModel(uid, vocab).setParent(this))
  }

Tips:总体实现还是比较简单易懂的;这里主要是关注词典表的大小,如果你新数据的数据比训练样本大很多,会导致很多词无法在model vecab里面出现,导致该特征值被忽略计算,会导致数据失真。

6、查看transform,生成向量的过程

override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema, logging = true)
    if (broadcastDict.isEmpty) {
      val dict = vocabulary.zipWithIndex.toMap
      broadcastDict = Some(dataset.sparkSession.sparkContext.broadcast(dict))
    }
    val dictBr = broadcastDict.get
    val minTf = $(minTF)
    val vectorizer = udf { (document: Seq[String]) =>
      // 保存该预料的词频
      val termCounts = new OpenHashMap[Int, Double]
      // 该预料中单词的个数
      var tokenCount = 0L
      document.foreach { term =>
        dictBr.value.get(term) match {
          case Some(index) => termCounts.changeValue(index, 1.0, _ + 1.0)
          case None => // ignore terms not in the vocabulary
        }
        tokenCount += 1
      }
      // 计算有效的词频
      val effectiveMinTF = if (minTf >= 1.0) minTf else tokenCount * minTf
      //
      val effectiveCounts = if ($(binary)) {
        // 如果是二进制类型 值只有 0和1
        termCounts.filter(_._2 >= effectiveMinTF).map(p => (p._1, 1.0)).toSeq
      } else {
        termCounts.filter(_._2 >= effectiveMinTF).toSeq
      }
      // 生成稀疏向量
      Vectors.sparse(dictBr.value.size, effectiveCounts)
    }
    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
  }

7、例子:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(3)
  .setMinDF(2)
  .fit(df)
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")
cvModel.transform(df).show(false)

通常CountVectorizer会和IDF配合使用,去获取词的分布
也可以将CountVectorizer处理后的数据喂给LDA继续使用。
后分析ngram。

你可能感兴趣的:(spark及问题解决,机器学习,大数据,nlp)