Java ElasticSearch 进行词云统计

利用ElasticSearch的分词和聚合功能来对文本中的关键词进行词云统计

本文主要针对微博上的新闻来进行分词和词频统计,最后生成词云。具体代码如下:

 public List wordCloudCount(Class clazz,String keywords){
        BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
        boolQuery.must(QueryBuilders.queryStringQuery(keywords));
        TermsAggregationBuilder builder = AggregationBuilders.terms("word_count").field("content").size(30);
        Document document = (Document) clazz.getAnnotation(Document.class);
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withIndices(document.indexName())
                .withTypes(document.type())
                .withQuery(boolQuery)
                .addAggregation(builder)
                .build();
        Aggregations aggregation = elasticsearchTemplate.query(searchQuery, new ResultsExtractor<Aggregations>() {
            @Override
            public Aggregations extract(SearchResponse searchResponse) {
                return searchResponse.getAggregations();
            }
        });
        StringTerms typeTerm = (StringTerms) aggregation.asMap().get("word_count");
        List<StringTerms.Bucket> bucketList = typeTerm.getBuckets();
        LinkedList<String> wordList = new LinkedList<>();
        for (StringTerms.Bucket bucket1 : bucketList) {
            String type_name = bucket1.getKeyAsString();
            wordList.add(type_name);
        }
        try {
            FileReader fReader = new FileReader("stopwords.txt");
            BufferedReader bufferedReader = new BufferedReader(fReader);
            List<String> list = new ArrayList<String>();
            String readline = "";
            while ((readline=bufferedReader.readLine())!=null){
                list.add(readline);
            }
            wordList.removeIf(list::contains);
        } catch (IOException e) {
            log.info("读取停用词失败");
        }
        return wordList;
    }

采用ES 的tempalte引擎来进行分词和聚合,需要强调的是,分完词之后要对停用词进行过滤,即stopwords.txt中的停用词,最后返回关键词的频率。

你可能感兴趣的:(Java ElasticSearch 进行词云统计)