一、爬取青春有你2 百度百科的url
https://baike.baidu.com/item/%E9%9D%92%E6%98%A5%E6%9C%89%E4%BD%A0%E7%AC%AC%E4%BA%8C%E5%AD%A3
二、使用pycharm编写爬虫程序
编写程序前,我们先简单分析一下爬取的具体流程。
首先,在参赛选手列表内,所有的选手的姓名都是蓝色(也就是隐藏超链接,我们需要获取到该超链接的url),我们点击进去就能看到具体的每个选手的相关消息;
其次,在每个选手具体信息页面右侧有选手的图片(有些选手没有相关具体信息或者该选手不满足通用爬虫程序,会导致出现报错,此时使用try--except就能解决问题),我们只要将该图片爬下来就行了;
最后,我们使用编写的爬虫程序批量爬下来所有选手的图片,并且存放在一个文件夹下。
(1)获取对应url(如上述url),使用requests模块发起get请求,实现对当前页面的爬取;
url='https://baike.baidu.com/item/%E9%9D%92%E6%98%A5%E6%9C%89%E4%BD%A0%E7%AC%AC%E4%BA%8C%E5%AD%A3'
paga_text=requests.get(url=url,headers=headers).text
(2)爬取到该页面,打开谷歌浏览器,鼠标右击点击检查,查看选手在对应的html标签下
(3)使用xpath获取对应的href的属性(然后加上https://baike.baidu.com/即可,因为获取的href属性不足以打开该选手的具体信息)
parser = etree.HTMLParser(encoding="utf-8")#此步需不需要都可以
tree=etree.HTML(paga_text,parser=parser)
tr_list_obj=tree.xpath('/html/body/div[3]/div[3]/div/div[1]/div/table[7]/tr')[1:]#不用加tbody标签,tr_list_obj是列表,自己也可以使用print函数输出看一看
(4)就是一些具体的批量获取所有选手对应具体信息的url,使用for循环即可实现
(5)每个选手的对应具体信息的url获取后(选手具体信息页面),我们就实现批量下载选手的图片,同样使用xpath定位图片的位置。
try:
a_list = treee.xpath('.//div[@class="summary-pic"]/a/img/@src')[0]
# print('学员图片url:', a_list)
filenamepath = './青春有你' + '-' + filename
tupian = requests.get(url=a_list, headers=headers).content
with open(filenamepath, 'wb') as fp:
fp.write(tupian)
print(filename,'爬取成功')
except:
# filenamepath = './青春有你' + '-' + filename
# tupian = requests.get(url='https://bkimg.cdn.bcebos.com/pic/faedab64034f78f0f736c7f60c7c1d55b319ebc4b62b?x-bce-process=image/resize,m_lfit,w_268,limit_1/format,f_jpg', headers=headers).content
# with open(filenamepath, 'wb') as fp:
# fp.write(tupian)
print(filename,'爬取失败')
具体实现的代码如下
import requests
from bs4 import BeautifulSoup
import json
from lxml import etree
url='https://baike.baidu.com/item/%E9%9D%92%E6%98%A5%E6%9C%89%E4%BD%A0%E7%AC%AC%E4%BA%8C%E5%AD%A3'
headers={
'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
paga_text=requests.get(url=url,headers=headers).text
#print(paga_text)
parser = etree.HTMLParser(encoding="utf-8")
tree=etree.HTML(paga_text,parser=parser)
#print(tree)
tr_list_obj=tree.xpath('/html/body/div[3]/div[3]/div/div[1]/div/table[7]/tr')[1:]#不用加tbody标签
#print(tr_list_obj)
name_list=[]#姓名列表
xingzuo_list=[]#星座
city_list=[]#地区
huayu_list=[]#花语
for tr_list in tr_list_obj:
td_list=tr_list.xpath('./td')
#print(td_list)
#获取了学员的url
url_name='https://baike.baidu.com'+td_list[0].xpath('./div/a/@href')[0]
# print('学员具体信息url:',url_name)
#获取学员姓名
name_list=td_list[0].xpath('./div/a/text()')[0]
# print('学员姓名:',name_list)
#获取学员地区
city_list=td_list[1].xpath('./div/text()')[0]
# print('学员出生地区:',city_list)
#获取学员星座
xingzuo_list=td_list[2].xpath('./div/text()')[0]
# print('学员星座:',xingzuo_list)
#获取学员花语
huayu_list=td_list[3].xpath('./div/text()')[0]
# print('学员花语:',huayu_list)
#文件名称
filename=name_list+'.jpg'
# print('学员图片文件名称:',filename)
#获取学员具体信息页面
xueyuan_paga_text=requests.get(url=url_name,headers=headers).text
# print(xueyuan_paga_text)
treee=etree.HTML(xueyuan_paga_text,parser=parser)
#有的页面不同或者无具体的学员信息,未能捕获图片url,只能用列表存储
try:
a_list = treee.xpath('.//div[@class="summary-pic"]/a/img/@src')[0]
# print('学员图片url:', a_list)
filenamepath = './青春有你' + '-' + filename
tupian = requests.get(url=a_list, headers=headers).content
with open(filenamepath, 'wb') as fp:
fp.write(tupian)
print(filename,'爬取成功')
except:
# filenamepath = './青春有你' + '-' + filename
# tupian = requests.get(url='https://bkimg.cdn.bcebos.com/pic/faedab64034f78f0f736c7f60c7c1d55b319ebc4b62b?x-bce-process=image/resize,m_lfit,w_268,limit_1/format,f_jpg', headers=headers).content
# with open(filenamepath, 'wb') as fp:
# fp.write(tupian)
print(filename,'爬取失败')
# a_list.append('https://bkimg.cdn.bcebos.com/pic/faedab64034f78f0f736c7f60c7c1d55b319ebc4b62b?x-bce-process=image/resize,m_lfit,w_268,limit_1/format,f_jpg')
print('爬虫结束!!!!')