python爬虫基础知识-下载图片

  @全体成员:提供据体图布局和源代码供大家参考,如有不理解的地方可以下载压缩包体验。
  个人原创,仅供参考。
  **pyton 爬虫下载图片:**
  下载链接:https://download.csdn.net/download/ganyonjie/12555125 欢迎下载
  **C# 通用登录UI详细版:**
  下载链接:https://download.csdn.net/download/ganyonjie/11431098 欢迎下载
  **C++版贪吃蛇:**
  下载链接:https://download.csdn.net/download/ganyonjie/11431277 欢迎下载
批量下载音乐并保存:

访问链接:https://blog.csdn.net/ganyonjie/article/details/107002977

爬虫运行完成后目录结构图:

python爬虫基础知识-下载图片_第1张图片

包含知识点:

  1. re 正则表达式的使用
  2. requests 包的使用
  3. BeautifulSoup 的使用
  4. 网页内容的截取
  5. 递归下载
  6. OS 文件操作
  7. 图片的存储
  8. Log 日志的保存

源代码如下:


import requests
import re
from bs4 import BeautifulSoup
import os

#全局变量
oldUrlList = []
pictureIndex = 0

def DownPicture(url, savePath, logPath):
    if url in oldUrlList:
        flog = open(logPath, 'a+')
        flog.write(f"{url} saved!\n")
        flog.close()
        return
    # 更新爬取链接列表
    oldUrlList.append(url)
    # 获取当前网页
    try:
        rConnect = requests.get(url, timeout=2)
    except:
        flog = open(logPath, 'a+')
        flog.write(f"{url} connect failed!\n")
        flog.close()
        return
    # 对网页进行处理
    xml = str(BeautifulSoup(rConnect.text, "lxml"))
    xml = re.sub(r"[\'\"(){}\[\],;]", " ", xml)
    # 获取当前网页中的图片地址
    imgList = []
    for i in re.findall(r"\S+\.jpg|\S+\.png|\S+\.gif", xml):
        if (i[:4] == "http"):
            imgList.append(i)
        elif (i[:2] == "//"):
            imgList.append(f"http:{i}")
    # 图片地址为空不保存
    if len(imgList) != 0:
        print(url, "DownPictureing ...")
        # 保存图片
        imgSet = set(imgList)
        for i in imgSet:
            # 进行路径处理
            pathdir = f"{savePath}\\{i[-3:]}"
            path = f"{pathdir}\\{i[-10:]}"
            path = re.sub(r"[/]", "", path)
            # 查看当前路径是否存在
            if (not os.path.exists(pathdir)):
                os.mkdir(pathdir)
            if os.path.exists(path):
                continue
            # 获取图片数据
            try:
                buf = requests.get(i, timeout=2)
            except:
                continue
            # 保存图片
            f = open(path, "wb")
            f.write(buf.content)
            f.close()
            print(f"{path},image{i[-10:]} save ok")
    # 查找当前网站中存在的二级网址
    httpList = re.findall("\S+\.com|\S+\.cn", xml)
    # 将获取的二级网址进行处理
    srcList = []
    for i in httpList:
        if(i[:4] == "http"):
            srcList.append(i)
        elif(i[:2] == "//"):
            srcList.append(f"http:{i}")
    if(srcList == ""):
        flog = open(logPath, 'a+')
        flog.write(f"{url} is blank\n")
        flog.close()
        return
    flog = open(logPath, 'a+')
    flog.write(f"{url} Current Save End! Get Under Url List And Inset\n")
    flog.close()
    # 递归获取下级网页图片
    srcSet = set(srcList)
    for i in srcSet:
        DownPicture(i, savePath, logPath)
    return f"{url} SAVE END!"


if __name__ == "__main__":
    url = "https://www.taobao.com/"
    savePath = "E:\\img\\"
    logPath = f"{savePath}\\log.txt"
    DownPicture(url, savePath, logPath)


你可能感兴趣的:(源代码,python,windows,http,https,数据挖掘)