图片获取太麻烦?python爬虫实战:百度百万级图片采集

python爬虫实战:百度百万级图片采集

  • 项目分析
  • 项目解析
  • 源码展示

项目分析

效果展示



开发工具

开发环境: Python3.7 + win10

开发工具: pycharm + chrome

主要内容:

1.获取网址数据
2.正则提取数据
3.保存表格数据

项目解析

找到图片的动态数据接口
图片获取太麻烦?python爬虫实战:百度百万级图片采集_第1张图片

找到图片的url地址

图片获取太麻烦?python爬虫实战:百度百万级图片采集_第2张图片

请求当前动态的接口网址

提取图片的网址信息

再次对图片网址发送请求

保存数据

源码展示

import requests  # 导入请求的工具包
import re  # 正则匹配工具包

# 添加请求头
headers = {
     
    # cookie信息
    "Cookie": "BDqhfp=%E6%98%8E%E6%98%9F%E5%8D%95%E4%BA%BA%E7%85%A7%26%260-10-1undefined%26%261132%26%263; PSTM=1606885275; BAIDUID=D9B7A2A3C7555B9A30BC448DE032D13B:FG=1; BIDUPSID=5EEB8A912FC8FDCB0C187D519FABA455; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=RxtOJeC62CaLjt3rvnJAhLqAYfS_AG3TH6ao5bUjY3stbaPxsXJ7EG0P8f8g0KubzcDrogKKLmOTHpKF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJPD_CLKtDI3fP36qRQtbt00qxby26nfa6T9aJ5nJDoNqIopb5bK54CkXM7rbxok3gQ3Lqo8QpP-HJ7zbxRqQhkD3NQXJU3p2erEKl0MLU7tbb0xynoDMbtNMfnMBMnramOnaPJc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDF4jj-hj5QLjaRf-b-X2CjyWb88Kb7VbUo95MnkbfJBD4bKWPTJt5rqWqcp2pRaEfTI0pnNQTt7yajK25QaQCQko-O2KJjmJ-Oyy6JpQT8reMDOK5OibCrE3hb-ab3vOpRzXpO1KMPzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksDMDtqtJHKbDDVILMJMK; BDUSS=g5U3pnMlhCRWJWT1lYQzR3SVBuSUN0MUNkWERaSDJHV2xKN3NHUDJzMFhGZTlmRUFBQUFBJCQAAAAAAAAAAAEAAAAIs6iyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABeIx18XiMdfRE; BDUSS_BFESS=g5U3pnMlhCRWJWT1lYQzR3SVBuSUN0MUNkWERaSDJHV2xKN3NHUDJzMFhGZTlmRUFBQUFBJCQAAAAAAAAAAAEAAAAIs6iyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABeIx18XiMdfRE; delPer=0; PSINO=6; __yjsv5_shitong=1.0_7_65901da49a5a59037e920cbd2020d5073dda_300_1606918958343_113.240.215.138_cdf8224b; H_PS_PSSID=1468_32855_33059_33098_33100_33199_33147_22160; BA_HECTOR=20050k2gakah8100r71fsguaf0q; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; firstShowTip=1; cleanHistoryStatus=0; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; userFrom=tupian.baidu.com; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; indexPageSugList=%5B%22%E6%98%8E%E6%98%9F%E5%8D%95%E4%BA%BA%E7%85%A7%22%2C%22%E6%98%8E%E6%98%9F%22%5D",
    # 用户代理
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
    # 请求数据来源
    "Referer": "https://tupian.baidu.com/search/index",
    "Host": "tupian.baidu.com"
}

key = input("请输入要下载的图片:")
# 保存图片的地址
path = r"E:\\python_project\\vip_course\\百度图片\\图片\\"
# 请求数据接口
for i in range(5, 50):
    url = "https://tupian.baidu.com/search/acjson?tn=resultjson_com&logid=11528842549528169565&ipn=rj&ct=201326592&is=&fp=result&queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=©right=&word={}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=78&1606974013784=".format(key, key, i*30)
    # 发送请求
    response = requests.get(url, headers=headers)
    # 正则匹配数据
    url_list = re.findall('"thumbURL":"(.*?)",', response.text)

    print(url_list)
    # 循环取出图片url 和 name
    for new_url in url_list:
        # 再次对图片发送请求
        result = requests.get(new_url).content
        # 分割网址获取图片名字
        name = new_url.split("/")[-1]
        print(name)
        # 写入文件
        with open(path + name, "wb")as f:
            f.write(result)
①3000多本Python电子书有
②Python开发环境安装教程有
③Python400集自学视频有
④软件开发常用词汇有
⑤Python学习路线图有
⑥项目源码案例分享有

如果你用得到的话可以直接拿走,在我的QQ技术交流群里(纯技术交流和资源共享,广告勿入)可以自助拿走,群号是764406565。

你可能感兴趣的:(爬虫,python,爬虫,程序人生,经验分享)