python爬取豆瓣电影评论数据+情感分析可视化

豆瓣电影可视化情感分析

  • python爬取豆瓣电影评论数据+情感分析可视化
    • 技术详解
    • python爬取豆瓣电影评论数据
    • 豆瓣评论数据分析
      • 数据查看
      • 如果列宽不足,则显示其他所有列
      • 显示数据的详细信息
      • 查看空数据
      • 去除空数据
      • 查看数据信息
      • 对日期进行统计
      • 电影上映后每天的评论数量走势
      • 电影上映后每天的评分走势
      • 评分饼图
      • 生成词云
      • 读取出来显示一下
  • 总结

python爬取豆瓣电影评论数据+情感分析可视化

什么是数据分析?
数据分析是指运用适当的统计分析方法或者工具对收集来的大量数据进行整理和归纳,将它们加以汇总和理解并消化,提取有价值信息,从中发现因果关系、内部联系和业务规律,以求最大化地开发数据的功能,形成有效结论的过程,发挥数据的作用。

技术详解

  1. Python编程语言及相关开发环境(python、pycharm、anaconda)
  2. Python数据分析组件(numpy、pandas)
  3. python可视化组件(matplotlib)
  4. 中文分词python库( jieba )
  5. 词云( wordcloud )

python爬取豆瓣电影评论数据

去博客设置页面,选择一款你喜欢的代码片高亮样式,下面展示同样高亮的 代码片.

#!/usr/bin/python3
# -*- coding:utf-8 -*-
# author: 恒仔仔

# ====================================================
# 内容描述:爬取豆瓣电影评论数据
# ====================================================

import urllib.request
from bs4 import BeautifulSoup
import random
import time
import csv
from tqdm import tqdm
import string


def getHTML(url,movieid):
    """获取url页面"""
    id = movieid
    user_agents = list({
       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
       'Opera/8.0 (Windows NT 5.1; U; en)',
       'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
       'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
       'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
       'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
       "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
       'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
       'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
       'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
       'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
       "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
       "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
       "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
       "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
       "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
       "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
       "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
       "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
       "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
       "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
       "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
       "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
       "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
       "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
       "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14"})

    headers = {
        # 注意:使用登陆账号的cookie最多能爬取500条数据,使用不登录账号的cookie最多只能爬取200条数据
        # 防止账号被永久封禁,请自行添加 IP 代理,或者不登陆账号,爬取少量数据做分析即可
        'Cookie': '你自己的cookie',
        'User-Agent': str(random.choice(user_agents)),
        'Referer': 'https: // movie.douban.com / subject / ' + id + '/ comments?status = P',
        'Connection': 'keep-alive'
    }
    request = urllib.request.Request(url, headers=headers)
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')

    return content


def getComment(url,movieid):
    """解析HTML页面"""
    html = getHTML(url,movieid)
    bs = BeautifulSoup(html, 'html.parser')

    # 评论作者
    one_page_authors = []
    authors = bs.select(".comment-info a")
    for author in authors:
        one_page_authors.append(author.text)

    # 评论内容
    one_page_comments = []
    comments = bs.select(".comment .short")
    for comment in comments:
        # 去掉所有标点符号
        content_str = ''.join(c for c in comment.text if c not in string.punctuation) \
            .replace(" ", "").replace("\n", "")
        one_page_comments.append(content_str)

    # 评论评分
    one_page_rates = []
    rates = bs.select(".rating")
    for rate in rates:
        rate_str = str(rate.get("class")).split(" ")[0]
        rate_score = int([int(i) for i in rate_str if i.isdigit()][0])
        one_page_rates.append(rate_score)

    # 评论title
    one_page_titles = []
    titles = bs.select(".rating")
    for title in titles:
        one_page_titles.append(title.get("title"))

    # 评论日期
    one_page_dates = []
    dates = bs.select(".comment-time")
    for date in dates:
        one_page_dates.append(date.get("title"))

    # 评论是否有用
    one_page_uses = []
    uses = bs.select(".votes")
    for u in uses:
        one_page_uses.append(u.text)

    return [one_page_authors, one_page_comments, one_page_rates, one_page_titles, one_page_dates, one_page_uses]


def generateURL(movieid):
    """ 生成所有的 待爬取的 URL """
    urls = []
    id = movieid
    # 好评数据
    page_number = 25
    for page in range(page_number):
        url = 'https://movie.douban.com/subject/' + id + '/comments?start=' + str(20 * page) + '&limit=20&sort=new_score&status=P&percent_type=h'
        urls.append(url)

    page_number = 25
    # 中评论数据
    for page in range(page_number):
        url = 'https://movie.douban.com/subject/' + id + '/comments?start=' + str(20 * page) + '&limit=20&sort=new_score&status=P&percent_type=m'
        urls.append(url)

    page_number = 25
    # 差评数据
    for page in range(page_number):
        url = 'https://movie.douban.com/subject/' + id + '/comments?start='+ str(20 * page) + '&limit=20&sort=new_score&status=P&percent_type=l'
        urls.append(url)

    page_number = 5
    # 最新数据
    for page in range(page_number):
        url = 'https://movie.douban.com/subject/' + id + '/comments?start=' + str(20 * page) + '&limit=20&sort=time&status=P&percent_type=l'
        urls.append(url)

    # 想看
    page_number = 25
    for page in range(page_number):
        url = 'https://movie.douban.com/subject/' + id + '/comments?start=' + str(20 * page) + '&limit=20&sort=new_score&status=F'
        urls.append(url)

    return urls


if __name__ == '__main__':
    file = open('movie.csv', mode="w", encoding="utf-8", newline="")
    csv_writer = csv.writer(file)
    movieid = str() #输入电影id号
    # 拿到所有的待爬取的URL
    urls = generateURL(movieid)
    print(urls)
    times = list(range(8, 16))

    for url in tqdm(urls):
        print(url)
        # 每个URL就是一页评论数据
        [authors, comments, rates, titles, dates, uses] = getComment(url,movieid)

        result_list = []
        # 输出结果到文件中
        for i in range(len(authors)):
            result_list.append([authors[i], comments[i], rates[i], titles[i], dates[i], uses[i]])
        csv_writer.writerows(result_list)

        time.sleep(random.choice(times))

豆瓣评论数据分析

导入需要的库

import jieba
import wordcloud
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

为了解决matplotlib显示中文问题,仅适用于Windows

plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题

数据查看

filepath = "movie.csv"
data = pd.read_csv(filepath, names=["date", "rate", "title", "uses", "name", "comment"], usecols=[0, 1, 2, 3, 4, 5])
display("数据集有{}条记录。".format(len(data)))

display(data.head())

如果列宽不足,则显示其他所有列

进行设置即可

# 如果部分列的信息没显示出来,可以这么做
pd.set_option("max_columns", 20)
display(data.head())

显示数据的详细信息

data.info() 

查看空数据

data[data["rate"].isnull()].head()

去除空数据

data.dropna(axis=0, inplace=True)
data.sample(10)
data.info()

查看数据信息

display(data.columns)
data.describe()

对评分进行简单的频次统计

data["rate"].plot(kind="hist")
data["rate"].plot(kind="kde")
data.groupby("rate").size().plot(kind="bar")

对日期进行统计

data["date"] = data["date"].apply(lambda x: str(x).split(" ")[0])
data.head()
data["title"].value_counts()

输入自己需要截止的日期

data[data["date"] > "2020-10-08"].sample(30)
data[data["date"] > "2020-10-08"]["date"].value_counts()

按照每天统计评论数

data[data["date"] > "2019-07-26"].groupby("date").size().plot(kind="bar")

电影上映后每天的评论数量走势

import pandas as pd
import matplotlib.pyplot as plt

# 读取文件
df = pd.read_csv("c:/movie.csv", 
                 names=["date", "rate", "title", "uses", "name", "comment"], 
                 usecols=[0, 1, 2, 3, 4, 5])

# 去掉带null字段的数据
df.dropna(axis=0, inplace=True)

# 处理日期字段,保留年月日,去掉时分秒
df["date"] = df["date"].apply(lambda x: str(x).split(" ")[0])
df["count"] = 1

# 筛选出上映之后的评论数据
df1 = df[df["date"] > "2019-10-08"]

# 按天统计评论的个数,并且按照天数排序
# df_result = df1.groupby("date")["count"].agg(["count"]).sort_values("count", ascending=False)
# 按天统计评论的个数
df_result = df1.groupby("date")["count"].agg(["count"])

df_result.plot(kind='bar')
plt.show()

电影上映后每天的评分走势

["rate"].agg(["mean"]).sort_values("mean", ascending=False)
# 统计每天评论的平均分
df_result = df1.groupby("date")["rate"].agg(["mean"])

df_result.plot(kind='bar')
plt.show()

评分饼图

# 绘图
df_result.plot.pie(subplots=True, figsize=(6, 6), fontsize=18, counterclock=False, startangle=-270)
plt.title("评分饼图", fontsize=16, fontweight="bold")
plt.ylabel("", fontsize=12, fontweight="bold")
plt.show()

生成词云

filepath = "c:/nezha.csv"
file = open(filepath, mode="r", encoding="utf-8")
content = file.read().replace("推荐", "").replace("力荐", "")
file.close()

# 分词,并生成词云图
ls = jieba.lcut(content)
txt = " ".join(ls)
w = wordcloud.WordCloud(font_path='c:\windows\Fonts\STZHONGS.TTF', width=1200, height=500, background_color='white')
w.generate(txt)
w.to_file('movie.png')

读取出来显示一下

import matplotlib.image as imgplt
x = imgplt.imread("movie.png")
plt.imshow(x)

总结

数据分析流程

  1. 明确分析目的和思路/提出假设
  2. 数据收集
    数据库业务数据 + 日志数据 + 公开出版物 + 互联网 + 市场调查
  3. 数据处理/整理
    数据处理是指对收集到的大量数据进行加工、整理,把它变成适合数据分析的样式。
    数据处理主要包括:数据清洗、数据转化、数据提取、数据计算等处理方法。
  4. 数据分析/验证假设
    Python的Numpy/Pandas,SPSS/SAS,Matlab/R,RDBMS/MySQL/Hive
  5. 数据展现/可视化图表
    柱状图,折线图,散点图,饼图,条形图,雷达图,地图,热力图,气泡图,面积图…
  6. 报告撰写
    三个要求:好的分析框架 + 明确的结论 + 建议/解决方案

你可能感兴趣的:(大数据分析实战,可视化,数据分析,python,数据可视化,数据挖掘)