CountVectorizer和CountVectorizerModel用来把文本文档的集合转换成token数量的矢量。如果没有字典,CountVectorizer可以抽取vocabulary(词汇)生成CountVectorizerModel。该model为文档生成词汇的稀疏表示,然后可以传给其他算法,比如LDA。
public class CountVectorizerDemo {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local")
.appName("CountVectorizer")
.getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create(0, Arrays.asList("Jason", "David")),
RowFactory.create(1, Arrays.asList("David", "Martin")),
RowFactory.create(2, Arrays.asList("Martin", "Jason")),
RowFactory.create(3, Arrays.asList("Jason", "Daiel")),
RowFactory.create(4, Arrays.asList("Daiel", "Martin")),
RowFactory.create(5, Arrays.asList("Moahmed", "Jason")),
RowFactory.create(6, Arrays.asList("David", "David")),
RowFactory.create(7, Arrays.asList("Jason", "Martin"))
);
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("name", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(data, schema);
CountVectorizerModel model = new CountVectorizer()
.setInputCol("name")
.setOutputCol("features")
//设置词汇表的最大个数
.setVocabSize(3)
//在文档中最少出现次数
.setMinDF(2)
.fit(df);
Arrays.stream(model.vocabulary())
.forEach(System.out::println);
model.transform(df).show();
spark.stop();
}
}
词汇表是
Jason
Martin
David
完成转换以后
+---+----------------+-------------------+
| id| name| features|
+---+----------------+-------------------+
| 0| [Jason, David]|(3,[0,2],[1.0,1.0])|
| 1| [David, Martin]|(3,[1,2],[1.0,1.0])|
| 2| [Martin, Jason]|(3,[0,1],[1.0,1.0])|
| 3| [Jason, Daiel]| (3,[0],[1.0])|
| 4| [Daiel, Martin]| (3,[1],[1.0])|
| 5|[Moahmed, Jason]| (3,[0],[1.0])|
| 6| [David, David]| (3,[2],[2.0])|
| 7| [Jason, Martin]|(3,[0,1],[1.0,1.0])|
+---+----------------+-------------------+
features也是向量,它的后两项,分别是基于字典的索引向量和对应索引的词频向量。