python爬取数据热点词生成词云

这是当时在中国mooc学 用python玩转数据 时,写的一个小demo.

程序实现步骤

1.从某一网站爬取数据,比如我是在豆瓣爬取的书评

利用Requests库的get()爬取网页

使用BeatifulSoup库对爬取网页进行解析。

写入文件

2.对所爬取字符串分词

利用分词器 jieba ,逐行用jieba分词,单行代码如:

word_list=pseg.cut(subject)

3.去除停用词

很多如 “的”、“我们”这样的词以及一些符号对主题热点词分析并没有用,所以要删去过滤这些词。代码如:

stop_words =set(line.strip() for line in open('stopwords.txt',encodeing='utf-8')) 

4.选择名词

jieba中的词性标签使用了传统方式,例如’n’是名词,’a’是形容词,’v’是动词。数据中的名词更能代表热点,可以单独选择名词进行后续处理,选择所有的名词放到一个列表中的代码如下:

        for word, flag in word_list:
            if not word in stop_words and flag == 'n':
                commentlist.append(word)

5.根据词频画出词云

将所有名词直接作为WordCloud()函数的参数,默认WordCloud内部通过统计词频对词进行排序,font_path传入字体文件,mask表示词云的图像形状,参数传入为一个图像

    content = ' '.join(commentlist)
    wordcloud = WordCloud(font_path='simhei.ttf', background_color="grey",  mask=mask_image, max_words=40).generate(content)

完整代码

import jieba.posseg as pseg
import matplotlib.pyplot as plt
from os import path
import requests
from scipy.misc import imread
from wordcloud import WordCloud
from bs4 import BeautifulSoup
#本程序对豆瓣图书评论进行抓取,并得出其关键词
def fetch_douban_comments():#对豆瓣评论进行抓取,并写入subject文件
    r = requests.get('https://book.douban.com/subject/1109968/comments/')
    soup = BeautifulSoup(r.text, 'lxml')
    pattern = soup.find_all('p', 'comment-content')
    with open('subjects.txt', 'w', encoding='utf-8') as f:
        for s in pattern:
            f.write(s.string)
def extract_words():
    with open('subjects.txt','r',encoding='utf-8') as f:
        comment_subjects = f.readlines()
    #加载stopword
    stop_words = set(line.strip() for line in open('stopwords.txt', encoding='utf-8'))
    commentlist = []
    for subject in comment_subjects:
        if subject.isspace():continue
        # segment words line by line
        word_list = pseg.cut(subject)#分词
        for word, flag in word_list:
            if not word in stop_words and flag == 'n':
                commentlist.append(word)
    d = path.dirname(__file__)
    mask_image = imread(path.join(d, "mickey.png"))
    content = ' '.join(commentlist)
    wordcloud = WordCloud(font_path='simhei.ttf', background_color="grey",  mask=mask_image, max_words=40).generate(content)
    # Display the generated image:
    plt.imshow(wordcloud)
    plt.axis("off")
    wordcloud.to_file('wordcloud.jpg')
    plt.show()
if __name__ == "__main__":
    fetch_douban_comments()
    extract_words()

结果:

由于所选mask图像是个米老鼠,所以最后词云的形状是这个样子

你可能感兴趣的:(python)