python项目之 爬虫爬取煎蛋jandan的妹子图-上

python项目之 爬虫爬取煎蛋jandan的妹子图-上

抓取妹子图练练手。

网页url格式

http://jandan.net/ooxx/page-1777#comment
只需改变页码1777即可

分析页面源码发现妹子图有两个

一个是缩略图

<img src="http://ww1.sinaimg.cn/mw600/4bf31e43jw1f09htnzkh5j20dw0kumz0.jpg" />p>

另一个是原图

<a href="http://ww1.sinaimg.cn/large/4bf31e43jw1f09htnzkh5j20dw0kumz0.jpg" target="_blank" class="view_img_link">[查看原图]a>

这里我们抓取原图,使用class和target这个属性查找。

最终得到每一页的TXT文件,下篇是文件合并与图片存取。

源码如下

代理ip文件请自行查找:-D

# coding:utf-8
####################################################
# coding by 刘云飞
####################################################

import requests
import os
import time
import random
from bs4 import BeautifulSoup
import threading

url = "http://jandan.net/ooxx/page-"
img_lists = []
url_lists = []
not_url_lists = []
ips = []
thread_list = []

with open('ip2.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        ip_one = "http://" + line.strip()
        ips.append(ip_one)

headers = {
    'Host': 'jandan.net',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/42.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Referer': 'http://jandan.net/ooxx/',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
}

for i in range(1530, 1883):
    url_lists.append(url + str(i) + '#comments')


def writeToTxt(name, list):
    with open(name, 'w+') as f:
        for urlOne in list:
            f.write(urlOne + "\n")


def get_img_url(url):
    single_ip_addr = random.choice(ips)
    lists_tmp = []
    page = int(url[28:32])
    filename = str(page) + ".txt"
    proxies = {'http': single_ip_addr}
    try:
        res = requests.get(url, headers=headers, proxies=proxies)
        print(res.status_code)
        if res.status_code == 200:
            text = res.text
            Soup = BeautifulSoup(text, 'lxml')
            results = Soup.find_all("a", target="_blank", class_="view_img_link")
            for img in results:
                lists_tmp.append(img['href'])
                url_lists.append(img['href'])
            print(url + "  --->>>>抓取完毕!!")
            writeToTxt(filename, lists_tmp)
        else:
            not_url_lists.append(url)

            print("not ok")
    except:
        not_url_lists.append(url)
        print("not ok")


for url in url_lists:
    page = int(url[28:32])
    filename = str(page) + ".txt"
    if os.path.exists(filename):
        print(url + "   is pass")
    else:
        # time.sleep(1)
        get_img_url(url)

print(img_lists)

with open("img_url.txt", 'w+') as f:
    for url in img_lists:
        f.write(url + "\n")

print("共有 " + str(len(img_lists)) + " 张图片。")
print("all done!!!")

with open("not_url_lists.txt", 'w+') as f:
    for url in not_url_lists:
        f.write(url + "\n")

你可能感兴趣的:(python项目,爬虫项目)