Spark - 学习CountVectorizer

CountVectorizer和CountVectorizerModel用来把文本文档的集合转换成token数量的矢量。如果没有字典,CountVectorizer可以抽取vocabulary(词汇)生成CountVectorizerModel。该model为文档生成词汇的稀疏表示,然后可以传给其他算法,比如LDA。

public class CountVectorizerDemo {

    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("CountVectorizer")
                .getOrCreate();

        List<Row> data = Arrays.asList(
                RowFactory.create(0, Arrays.asList("Jason", "David")),
                RowFactory.create(1, Arrays.asList("David", "Martin")),
                RowFactory.create(2, Arrays.asList("Martin", "Jason")),
                RowFactory.create(3, Arrays.asList("Jason", "Daiel")),
                RowFactory.create(4, Arrays.asList("Daiel", "Martin")),
                RowFactory.create(5, Arrays.asList("Moahmed", "Jason")),
                RowFactory.create(6, Arrays.asList("David", "David")),
                RowFactory.create(7, Arrays.asList("Jason", "Martin"))
        );

        StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("name", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty())
        });
        Dataset<Row> df = spark.createDataFrame(data, schema);

        CountVectorizerModel model = new CountVectorizer()
                .setInputCol("name")
                .setOutputCol("features")
                //设置词汇表的最大个数
                .setVocabSize(3)
                //在文档中最少出现次数
                .setMinDF(2)
                .fit(df);

        Arrays.stream(model.vocabulary())
                .forEach(System.out::println);
        model.transform(df).show();

        spark.stop();
    }
}

词汇表是

Jason
Martin
David

完成转换以后

+---+----------------+-------------------+
| id|            name|           features|
+---+----------------+-------------------+
|  0|  [Jason, David]|(3,[0,2],[1.0,1.0])|
|  1| [David, Martin]|(3,[1,2],[1.0,1.0])|
|  2| [Martin, Jason]|(3,[0,1],[1.0,1.0])|
|  3|  [Jason, Daiel]|      (3,[0],[1.0])|
|  4| [Daiel, Martin]|      (3,[1],[1.0])|
|  5|[Moahmed, Jason]|      (3,[0],[1.0])|
|  6|  [David, David]|      (3,[2],[2.0])|
|  7| [Jason, Martin]|(3,[0,1],[1.0,1.0])|
+---+----------------+-------------------+

features也是向量,它的后两项,分别是基于字典的索引向量和对应索引的词频向量。

你可能感兴趣的:(Spark)