2021-02-05

百度贴吧图片爬取练习

需求: 爬取一个贴吧主题的图片

思路:找到这个(些)图片的url 然后保存图片

1 分析页面

image.png

图片地址:
https://imgsa.baidu.com/forum/w%3D415/sign=53b2a2cf0c3387449cc52e7d640ed937/4c53ca8065380cd734b55e72a144ad34588281b0.jpg

找到了图片的url地址(但是发现源码中并没有)

  • 复制图片地址--打开网页源码-- ctrl+f---ctrl+v
image.png

通过network分析真正的数据接口
一种是通过network分析数据接口,另一种是通过selenium进行模拟爬取数据(还没学到)

我们发现这个数据在
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=40&info=1&_=1616760409348

image.png

2 代码实现

import requests
import re

name = 1
url = 'https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=40&info=1&_=1616760409348'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
res = requests.get(url, headers=headers)
print(res.text)
img_urls = re.findall('"murl":"(.*?)"', res.text)
for img_url in img_urls:
    # print(img_url)
    res2 = requests.get(img_url)
    with open('img/%d.jpg' % name, 'wb') as file_obj:
        file_obj.write(res2.content)
    print('正在下载图片%d.jpg' % name)
    name += 1

通过结果发现 图片的数量不对,继续的往下拖动 拖动条 然后在 network当中又刷新出来两个数据包


image.png

通过三个数据包的地址对分析,找出规律

https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=40&info=1&_=1616760409348
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=40&pe=79&wall_type=h&_=1616762893617
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=79&pe=118&wall_type=h&_=1616762897873
image.png
import requests
import re
import time
"""
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=1&pe=40&info=1&_=1612525105173
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=40&pe=79&wall_type=h&_=1612529311001
https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps=79&pe=118&wall_type=h&_=1612529314701
ps:1 40 79   pe:40 79 118 规律是:相差39 
"""
name = 1
for i in range(1,80,39):
    url = 'https://tieba.baidu.com/photo/g/bw/picture/list?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&alt=jview&rn=200&tid=1934517161&pn=1&ps='+str(i)+'&pe='+str(39+i) +'&info=1&_=1616760409348'
    # print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    }
    res = requests.get(url, headers=headers)
    # print(res.text)
    img_urls = re.findall('"murl":"(.*?)"', res.text)
    for img_url in img_urls:
        # print(img_url)
        res2 = requests.get(img_url)
        with open('img/%d.jpg' % name, 'wb') as file_obj:
            file_obj.write(res2.content)
            time.sleep(1)  # 减速
        print('正在下载图片%d.jpg' % name)
        name += 1

通过这个案例当中我们的总结
1 看这个数据是否在源码当中 如果在直接请求去解析 如果不在network中分析 具体情况具体分析
2 url规律的总结 要看出它的规律并寻求解决
for i in range(1,80,39):
print(i)
图片的名字
定义了一个变量 然后在循环当中不断的让这个变量的值 增加1

正则表达式 img_urls = re.findall('"murl":"(.*?)"',res.text)

你可能感兴趣的:(2021-02-05)