前言:
准备工具:python3.7、vscode、chrome
安装urllib、beautifulsoup、jieba、wordcloud(pip install 库)
一、分析豆瓣页面
首先我们先观察豆瓣的搜索页面
我们可以看到左侧的导航栏,结合url我们会发现cat后面的值和q后面的书名电影名影响着搜索的变化,可以找出如下规律:
读书 1001
电影1002
音乐1003
我们查看网页的源代码(F12)可以发现我们所需要的内容全部都在a标签之下,我们利用豆瓣优秀的排序算法可以直接获取搜索排序的第一名作为我们的待爬取内容,我们也只需要其中的sid号,其余的事情就交给待爬取页面的爬虫去做了。
下面给出源代码:
import ssl
import string
import urllib
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
def create_url(keyword: str, kind: str) -> str:
'''
Create url through keywords
Args:
keyword: the keyword you want to search
kind: a string indicating the kind of search result
type: 读书; num: 1001
type: 电影; num: 1002
type: 音乐; num: 1003
Returns: url
'''
num = ''
if kind == '读书':
num = 1001
elif kind == '电影':
num = 1002
elif kind == '音乐':
num = 1003
url = 'https://www.douban.com/search?cat=' + \
str(num) + '&q=' + keyword
return url
def get_html(url: str) -> str:
'''send a request'''
headers = {
# 'Cookie': 你的cookie,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Connection': 'keep-alive'
}
ssl._create_default_https_context = ssl._create_unverified_context
s = urllib.parse.quote(url, safe=string.printable) # safe表示可以忽略的部分
req = urllib.request.Request(url=s, headers=headers)
req = urllib.request.urlopen(req)
content = req.read().decode('utf-8')
return content
def get_content(keyword: str, kind: str) -> str:
'''
Create url through keywords
Args:
keyword: the keyword you want to search
kind: a string indicating the kind of search result
type: 读书; num: 1001
type: 电影; num: 1002
type: 音乐; num: 1003
Returns: url
'''
url = create_url(keyword=keyword, kind=kind)
html = get_html(url)
# print(html)
soup_content = BeautifulSoup(html, 'html.parser')
contents = soup_content.find_all('h3', limit=1)
result = str(contents[0])
return result
def find_sid(raw_str: str) -> str:
'''
find sid in raw_str
Args:
raw_str: a html info string contains sid
Returns:
sid
'''
assert type(raw_str) == str, \
'''the type of raw_str must be str'''
start_index = raw_str.find('sid:')
sid = raw_str[start_index + 5: start_index + 13]
sid.strip(',')
return sid
if __name__ == "__main__":
raw_str = get_content('看见', '读书')
print(find_sid(raw_str))
这样我们就有了具有唯一标实的图书(电影)的sid
其次我们先观察待爬取页面并查看网页源代码(F12)
通过观察我们不难发现我们所需的评论都在 标签下,想要爬取的作者、时间、推荐星级也分别藏在其他几个子标签下,代码如下:
comments = soupComment.findAll('span', 'short')
time = soupComment.select( '.comment-item > div > h3 > .comment-info > span:nth-of-type(2)')
name = soupComment.select('.comment-item > div > h3 > .comment-info > a')
第一页评论url:https://book.douban.com/subject/20427187/comments/hot?p=1
第二页评论url:https://book.douban.com/subject/20427187/comments/hot?p=2
...
第n页评论url:https://book.douban.com/subject/20427187/comments/hot?p=n
通过翻取评论,url的规律这样就找到了,只需要改变p后面的一个变量就可以
二、豆瓣评论数据抓取
我们需要为爬虫伪装一个头部信息防止网站的反爬虫
headers = {
# 'Cookie': 你的cookie,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Referer': 'https: // movie.douban.com / subject / 20427187 / comments?status = P',
'Connection': 'keep-alive'
}
关于cookie你可以先在网页登陆你的豆瓣账号然后F12->network->all->heders中寻找
爬虫代码如下:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import time
import jieba
import wordcloud
import crawler_tools
def creat_url(num):
urls = []
for page in range(1, 20):
url = 'https://book.douban.com/subject/' + \
str(num)+'/comments/hot?p='+str(page)+''
urls.append(url)
print(urls)
return urls
def get_html(urls):
headers = {
# 'Cookie': 你的cookie,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Connection': 'keep-alive'
}
for url in urls:
print('正在爬取:'+url)
req = urllib.request.Request(url=url, headers=headers)
req = urllib.request.urlopen(req)
content = req.read().decode('utf-8')
time.sleep(10)
return content
def get_comment(num):
a = creat_url(num)
html = get_html(a)
soupComment = BeautifulSoup(html, 'html.parser')
comments = soupComment.findAll('span', 'short')
onePageComments = []
for comment in comments:
# print(comment.getText()+'\n')
onePageComments.append(comment.getText()+'\n')
print(onePageComments)
f = open('数据.txt', 'a', encoding='utf-8')
for sentence in onePageComments:
f.write(sentence)
f.close()
raw_str = crawler_tools.get_content('看见', '读书')
sid = crawler_tools.find_sid(raw_str)
print('sid:'+sid)
get_comment(sid)
三、数据清洗、特征提取及词云显示
首先利用jieba库分词,并使用其其库里内置的TFIDF算法对分词进行权重运算
然后利用wordcloud库生成词云,具体设置参数如下:
font_path='FZQiTi-S14S.TTF', # 设置字体
max_words=66, # 设置最大显示字数
max_font_size=600, # 设置字体最大值
random_state=666, # 设置随机生成状态
width=1400, height=900, # 设置图像大小
background_color='black', # 设置背景大小
stopwords=(type(list)) # 设置停用辞典
我们把做了数据处理的词云和普通词云做个对比:
数据处理代码如下:
import jieba
import jieba.analyse
import wordcloud
f = open('/Users/money666/Desktop/The_new_crawler/看见.txt',
'r', encoding='utf-8')
contents = f.read()
f.close()
stopWords_dic = open(
'/Users/money666/Desktop/stopwords.txt', 'r', encoding='gb18030') # 从文件中读入停用词
stopWords_content = stopWords_dic.read()
stopWords_list = stopWords_content.splitlines() # 转为list备用
stopWords_dic.close()
keywords = jieba.analyse.extract_tags(
contents, topK=75, withWeight=False,)
print(keywords)
w = wordcloud.WordCloud(background_color="black",
font_path='/Users/money666/Desktop/字体/粗黑.TTF',
width=1400, height=900, stopwords=stopWords_list)
txt = ' '.join(keywords)
w.generate(txt)
w.to_file("/Users/money666/Desktop/The_new_crawler/看见.png")
四、问题及解决办法
1、pip timeout
一、创建或修改pip.conf配置文件:
$ sudo vi ~/.pip/pip.config
timeout =500 #设置pip超时时间
二、使用国内镜像
使用镜像来替代原来的官网,方法如下:
1. pip install redis -i https://pypi.douban.com/simple
-i:指定镜像地址
2. 创建或修改pip.conf配置文件指定镜像地址:
[global]
timeout =6000
index-url = http://pypi.douban.com/simple/
[install]
use-mirrors =true
mirrors = http://pypi.douban.com/simple/
trusted-host = pypi.douban.com
补充:可以在多个路径下找到pip.conf,没有则创建,另外,还可以通过环境变量Linux*:/etc/pip.conf *
*~/.pip/pip.conf *
*~/.config/pip/pip.conf *
Windows: %APPDATA%\pip\pip.ini
- %HOME%\pip\pip.ini *
C:\Documents and Settings\All Users\Application Data\PyPA\pip\pip.conf (Windows XP)
- C:\ProgramData\PyPA\pip\pip.conf (Windows 7及以后)*
Mac OSX*: ~/Library/Application Support/pip/pip.conf *
*~/.pip/pip.conf *
*/Library/Application Support/pip/pip.conf *