大佬万福,若有高见,还请不吝赐教。
折腾了一天半,总算解决了豆瓣数据的爬取问题。
-------需要登录和输入验证码才能继续爬数据的问题。
你可以获得数据:
"""
链接: https://pan.baidu.com/s/1StbBu4DDh0dQAwf8Ph5I9g 提取码: up6r
"""
详细代码可以参照我的github。
"""
lets begin
"""
故事是这样的,我有一份媒资数据表,表里面都是影片数据,包括:导演\演员\影片类型等等.但是这份数据表的数据缺失太多了,也没有对应上最新的豆瓣评分.
为了建立影片之间的相互关系,需要尽可能的补充影片的各项属性.于是,爬取豆瓣数据来丰富该数据表成为首选解决方案.
具体的实现思路,是根据我数据库中的影片名称在豆瓣网站上搜索,寻找最佳topk 匹配的影片信息,然后下载如影片的海报,评分,年份等等信息.
想过用scrapy神框架,但是考虑到我是用一个一个数据进行搜索的,框架带给我的便利不是特别大.再者我对scrapy还不是很熟(其实就看了两天书和几篇博文),所以我毅然放弃此大杀器,用requests和xpath 实现我的需求。
我们发现,当我使用如下网址:“https://movie.douban.com/j/subject_suggest?q=” + “电影名称” (后面加上我的电影名称之后),他会返回给我该电影的信息,或者返回给我该电影从名称上看比较类似的信息.
比如我搜终结者--------------“https://movie.douban.com/j/subject_suggest?q=终结者”
仔细观看网页返回给我的json数据,我发现非常容易就可以解读出电影的海报地址(“img”),以及指向电影的详细信息的链接(“url”).
拿其中的第一条来看:
{"episode":"",
"img":"https://img3.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p1910909085.webp",
"title":"终结者2:审判日",
"url":"https:\/\/movie.douban.com\/subject\/1291844\/?suggest=%E7%BB%88%E7%BB%93%E8%80%85",
"type":"movie",
"year":"1991",
"sub_title":"Terminator 2: Judgment Day",
"id":"1291844"},
所以我的行为变得相对简单:
1、将mainURL = “https://movie.douban.com/j/subject_suggest?q=” 和电影名称组合起来,形成mainURL +NAME 的url。
2、request.get 每一个url ,分别获得response。此response的内容是一个json,解析该json即获得海报地址和电影详情页链接m_url。
3、对海报地址和详情页链接稍加处理(变成规范的url形式,基本内容不变。本例中,即把"/" 全部replace为"/" 即可)
4、直接存储海报。(urllib.request.urlretrieve(img,filename)直接存储至本地)
5、继续request.get (m_url) 电影详情页。获得所需要的电影信息。查看源代码,我们知道电影的相关信息都包裹在中,
与电影相关的电影信息包裹在div[@class="recommendations-bd"]
中,
借助于xpath 于是可以通过以下代码来解析:
#影片信息搜集
def summaryCraw(url):
while True:
try:
time.sleep(1)
mresponse = requests.get(url, headers={'User-Agent': random.choice(User_Agents)},proxies=ipGet()).content
break
except:
time.sleep(300)
print("summaryCraw try to reconnect....")
if len(mresponse):
html = etree.HTML(mresponse, parser=etree.HTMLParser(encoding='utf-8'))
mInfo = html.xpath('//script[@type="application/ld+json"]/text()')
recommendation_m = html.xpath('//div[@class="recommendations-bd"]//img/@alt')
else:
return 0
return mInfo,recommendation_m
如何应对反爬虫?
访问次数增加时,豆瓣会要求用户登录才能取得以上信息,当访问频率过高的时候,IP可能会被封掉。
有一些大佬解决过类似的问题:
比如:https://blog.csdn.net/qingminxiehui/article/details/81671161
比如:https://blog.csdn.net/loner_fang/article/details/81132571
一些反爬虫策略:https://www.cnblogs.com/micro-chen/p/8676312.html
爬虫与反爬虫和之后的反反爬虫:https://www.zhihu.com/question/28168585
一般来说,网站反爬虫有以下的策略:
爬虫采取的反击有以下:
本人采取的是先登录后爬取,并使用代理ip,使用多个账号。
这是豆瓣登录页面请求信息(包含验证码):
我们将账号信息封装成这样的表单形式,然后Session.post 提交实现登录。
如果有验证码,就将验证码存储在本地,然后手动输入验证码。封装成如上表单形式,实现提交。
综上具体的实现步骤为:
1、代理ip池的建立。
“”"
1、抓取西刺代理网站的代理ip
2、并根据指定的目标url,对抓取到ip的有效性进行验证
3、最后存到指定的path
4、将存储的ip 设定为request.get时使用的代理ip格式。proxies的格式是一个字典: {‘http’: ‘http://122.114.31.177:808‘}
“”"
代码引自:https://blog.csdn.net/OnlyloveCuracao/article/details/80968233
import threading
import random
import requests
import datetime
from bs4 import BeautifulSoup
# ------------------------------------------------------文档处理--------------------------------------------------------
# 写入文档
def write(path,text):
with open(path,'a', encoding='utf-8') as f:
f.writelines(text)
f.write('\n')
# 清空文档
def truncatefile(path):
with open(path, 'w', encoding='utf-8') as f:
f.truncate()
# 读取文档
def read(path):
with open(path, 'r', encoding='utf-8') as f:
txt = []
for s in f.readlines():
txt.append(s.strip())
return txt
# ----------------------------------------------------------------------------------------------------------------------
# 计算时间差,格式: 时分秒
def gettimediff(start,end):
seconds = (end - start).seconds
m, s = divmod(seconds, 60)
h, m = divmod(m, 60)
diff = ("%02d:%02d:%02d" % (h, m, s))
return diff
# ----------------------------------------------------------------------------------------------------------------------
# 返回一个随机的请求头 headers
def getheaders():
user_agent_list = [ \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
UserAgent=random.choice(user_agent_list)
headers = {'User-Agent': UserAgent}
return headers
# -----------------------------------------------------检查ip是否可用----------------------------------------------------
def checkip(targeturl,ip):
headers =getheaders() # 定制请求头
proxies = {"http": "http://"+ip, "https": "http://"+ip} # 代理ip
try:
response=requests.get(url=targeturl,proxies=proxies,headers=headers,timeout=5).status_code
if response == 200 :
return True
else:
return False
except:
return False
#-------------------------------------------------------获取代理方法----------------------------------------------------
# 免费代理 XiciDaili
def findip(type,pagenum,targeturl,path): # ip类型,页码,目标url,存放ip的路径
list={'1': 'http://www.xicidaili.com/nt/', # xicidaili国内普通代理
'2': 'http://www.xicidaili.com/nn/', # xicidaili国内高匿代理
'3': 'http://www.xicidaili.com/wn/', # xicidaili国内https代理
'4': 'http://www.xicidaili.com/wt/'} # xicidaili国外http代理
url=list[str(type)]+str(pagenum) # 配置url
headers = getheaders() # 定制请求头
html=requests.get(url=url,headers=headers,timeout = 5).text
soup=BeautifulSoup(html,'lxml')
all=soup.find_all('tr',class_='odd')
for i in all:
t=i.find_all('td')
ip=t[1].text+':'+t[2].text
is_avail = checkip(targeturl,ip)
if is_avail == True:
write(path=path,text=ip)
print(ip)
#-----------------------------------------------------多线程抓取ip入口---------------------------------------------------
def getip(targeturl,path):
truncatefile(path) # 爬取前清空文档
start = datetime.datetime.now() # 开始时间
threads=[]
for type in range(4): # 四种类型ip,每种类型取前三页,共12条线程
for pagenum in range(3):
t=threading.Thread(target=findip,args=(type+1,pagenum+1,targeturl,path))
threads.append(t)
print('开始爬取代理ip')
for s in threads: # 开启多线程爬取
s.start()
for e in threads: # 等待所有线程结束
e.join()
print('爬取完成')
end = datetime.datetime.now() # 结束时间
diff = gettimediff(start, end) # 计算耗时
ips = read(path) # 读取爬到的ip数量
print('一共爬取代理ip: %s 个,共耗时: %s \n' % (len(ips), diff))
#-------------------------------------------------------启动-----------------------------------------------------------
if __name__ == '__main__':
path = 'ip.txt' # 存放爬取ip的文档path
targeturl = 'http://www.cnblogs.com/TurboWay/' # 验证ip有效性的指定url
getip(targeturl,path)
2、申请多个账号
每次登陆随机使用多个账号中的一个。
3、隔一段时间登陆
while True:
try:
time.sleep(random.random()*5)
do spider......
4、登录时使用不同的header模拟不同的浏览器。
我使用的是开源包:UserAgent
from fake_useragent import UserAgent
ua = UserAgent(verify_ssl=False)
#如果你的IDE执行ua = UserAgent()报错,请添加verify_ssl=False
关于登陆、验证码存储和输入的代码如下:
# -*- coding: utf-8 -*-
# @Time : 19-1-16 下午2:35
# @Author : jayden.zheng
# @FileName: login.py
# @Software: PyCharm
# @content :
import requests
import urllib.request
from lxml import etree
import random
#在此之前先构建好代理ip池
#每次登陆使用不同的代理ip
def login(session,agents,proxies):
name_pass = {'@163.com': '98', '@qq.com': 'w534', '@qq.com': 'q98', '18300@qq.com': 'qw98'}
url = 'https://www.douban.com/accounts/login'
#proxies_ = ipGet()
captchav,captchai = get_captcha(url,agents,proxies)
email = random.sample(name_pass.keys(), 1)[0]
passwd = name_pass[email]
if len(captchav)>0:
'''此时有验证码'''
#人工输入验证码
urllib.request.urlretrieve(captchav[0],"/home/lenovo/dev/captcha.jpg")
captcha_value = input('查看captcha.png,有验证码请输入:')
print ('验证码为:',captcha_value)
data={
"source":"None",
"redir": "https: // www.douban.com",
"form_email": email,
"form_password": passwd,
"captcha-solution": captcha_value,
"captcha-id":captchai[0],
"login": "登录",
}
else:
'''此时没有验证码'''
print ('无验证码')
data={
"source":"None",
"redir": "https: // www.douban.com",
"form_email": email,
"form_password": passwd,
"login": "登录",
}
print ('正在登陆中......')
#print('帐号为: %20s 密码为: %20s'%(email,passwd))
return session.post(url,data=data,headers= {'User-Agent': agents},proxies=proxies)
#验证码获取
def get_captcha(url,agents,proxies):
html = requests.get(url,headers= {'User-Agent': agents},proxies=proxies).content
html = etree.HTML(html, parser=etree.HTMLParser(encoding='utf-8'))
captcha_value = html.xpath('//*[@id="captcha_image"]/@src')
captcha_id = html.xpath('//*[@name="captcha-id"]/@value')
return captcha_value,captcha_id
具体的抓取过程代码:
# -*- coding: utf-8 -*-
# @Time : 19-1-15 下午2:53
# @Author : jayden.zheng
# @FileName: spider.py
# @Software: PyCharm
# @content :
from lxml import etree
import requests
import time
import urllib3
import warnings
import random
import urllib.request
import pandas as pd
from login import login
from fake_useragent import UserAgent
#去掉warnings 提示
warnings.simplefilter(action='ignore', category=FutureWarning)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
ua = UserAgent(verify_ssl=False)
from wsInput import dataInput
movie = dataInput()
mainUrl = "https://movie.douban.com/j/subject_suggest?q="
def getip():
ipdf = pd.read_csv('ip.txt',header=None)
ipdf.columns =['ip']
ipdf['ip'] = 'http://' + ipdf['ip']
ipdf['ip'] = ipdf['ip'].apply(lambda x:{'http':x})
return ipdf['ip']
#将图片保存至本地
def outputImg(c2code,img):
imgTmp = img.split('.')[-1].lower()
if imgTmp == 'jpg':
filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".jpg"
elif imgTmp == 'png':
filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".png"
elif imgTmp == 'jpeg':
filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".jpeg"
else:
filename = "/home/lenovo/dev/data/Douban/img/" + str(c2code) + ".pdf"
urllib.request.urlretrieve(img, filename)
#影片信息搜集
def summaryCraw(url,agents,proxies):
while True:
try:
time.sleep(random.random()*5)
mresponse = requests.get(url, headers={'User-Agent': agents},proxies=proxies).content
break
except:
time.sleep(200)
print("summaryCraw try to reconnect....")
if len(mresponse):
html = etree.HTML(mresponse, parser=etree.HTMLParser(encoding='utf-8'))
mInfo = html.xpath('//script[@type="application/ld+json"]/text()')
recommendation_m = html.xpath('//div[@class="recommendations-bd"]//img/@alt')
else:
return 0
return mInfo,recommendation_m
def infoCompare(response,qstr,movie,agents,proxies):
imageDf = pd.DataFrame()
c2code = movie[movie['NAME'] == qstr]['C2CODE'].iloc[0]
imageLst = []
movieUrlLst = []
yearLst = []
nameLst = []
minfoLSt = []
recommendationLst = []
for item in eval(response):
if qstr in item['title'] and 'img' and 'url' and 'year' in item or len(response) == 1:
img = item['img'].replace('\\/','/')
url = item['url'].replace('\\/','/')
nameLst.append(item['title'])
imageLst.append(img)
movieUrlLst.append(url)
yearLst.append(item['year'])
print("img store ------------")
print(img)
outputImg(c2code, img)
mInfo, recommendation_m = summaryCraw(url,agents,proxies)
minfoLSt.append(mInfo)
recommendationLst.append(recommendation_m)
c2code = c2code + "_"
else:
continue
imageDf['name'] = nameLst
imageDf['img'] = imageLst
imageDf['url'] = movieUrlLst
imageDf['year'] = yearLst
imageDf['mInfo'] = minfoLSt
imageDf['recommendation_m'] = recommendationLst
return imageDf
def targetUrl(mainUrl,qstr,agents,proxies):
while True:
try:
time.sleep(random.random()*3)
response = requests.get(mainUrl + qstr,headers= {'User-Agent': agents},proxies = proxies).text
break
except:
time.sleep(120)
print("targetUrl try to reconnect....")
return response
def urlCollect(mainUrl,movie,ipLst):
movieLst = movie['NAME']
mlen = len(movie)
urlDf = pd.DataFrame()
crawlcount = 1
for qstr in movieLst:
print("\ncraw times %d / %d ---@--- %s -------------->"%(crawlcount,mlen,qstr))
session = requests.Session()
agents = ua.random
proxies = random.choice(ipLst)
req = login(session,agents,proxies)
print(req)
response = targetUrl(mainUrl,qstr,agents,proxies)
crawlcount = crawlcount + 1
if len(response):
imageDf = infoCompare(response,qstr,movie,agents,proxies)
urlDf = pd.concat([urlDf,imageDf],sort=True)
if crawlcount % 100 == 0:
urlDf.to_csv("urlDf.csv")
else:
continue
return urlDf
ipLst = getip()
urlDf = urlCollect(mainUrl,movie,ipLst)
print(urlDf)
urlDf.to_csv("urlDf.csv")
详细代码可以参照我的github。