爬取校花网图片

校花网: http://www.521609.com/daxuexiaohua/

实现代码:

import requests
import random
import lxml.html

header_list = [
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"},
    {
        "User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"},
    {
        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"},
    {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"},
    {"User-Agent": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"},
    {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"},
    {"User-Agent": "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11"}
]
index = random.randint(0, len(header_list) - 1)
header = header_list[index]
# url = "http://www.521609.com/daxuexiaohua/"
# response= requests.get(url, header)
# res = response.content.decode("gbk")
metree = lxml.html.etree
# parser = metree.HTML(res, metree.HTMLParser())
# url_list = parser.xpath("/html/body//div[@class='listpage']/ol/li[3]/a/@href")
for con in range(2,358+1):
    try:
        # print(con)
        url = "http://www.521609.com/daxuexiaohua/list" + str(con) + ".html"
        print(url)
        response = requests.get(url, header)
        res = response.content.decode("gbk")
        # print(res)
    
        parser = metree.HTML(res, metree.HTMLParser())
        result_list = parser.xpath("/html/body//div[@class='index_img list_center']/ul/li")
        # print(result_list)
        json_list = []
        for i in result_list:
            name = i.xpath("./a[1]/img/@alt")[0]
            url = i.xpath("./a[1]/img/@src")[0]
            url = "http://www.521609.com" + url
            # print(url)
            index = random.randint(0, len(header_list) - 1)
            header = header_list[index]
            response_piure = requests.get(url, header)
            text = response_piure.content
            with open("./piture/" + name + ".jpg", "wb") as fs:
                fs.write(text)
    except Exception as e:
        print(e)

效果图片:

爬取校花网图片_第1张图片

你可能感兴趣的:(爬取校花网图片)