宅男福利--利用Python简单爬图

Ver beta..代码粗陋。

使用说明以Windows为例, Python版本为2.7.6

  1. 确认你电脑已经安装了Python, Windows默认安装路径为C:\Python27。如果没有安装,先下载安装 https://www.python.org/download/releases/2.7.6
  2. 下载mechanize (mechanize-0.2.5.zip)和BeautifulSoup (beautifulsoup4-4.3.2.tar.gz
  3. 解压缩mechanize-0.2.5.zip 到C:\mechanize-0.2.5,打开命令行(Windows键+R键,输入cmd,回车),分别执行以下两条命令
    cd C:\mechanize-0.2.5
    
    C:\Python27\python setup.py install
  4. 解压缩beautifulsoup4-4.3.2.tar.gz 到C:\beautifulsoup4-4.3.2,打开命令行,执行命令
    cd C:\beautifulsoup4-4.3.2
    
    C:\Python27\python setup.py install
  5. 拷贝下面代码,保存到任意目录(如:C:\picture\meizitu_spider.py)
  6. 打开命令行,执行命令
    cd C:\picture
    
    C:\Python27\python meizitu_spider.py
  7. 查看文件夹 C:\picture\MeiziTu
  8. Enjoy :-)

 

代码:

#!/usr/local/bin/python

# -*-coding=utf-8-*-

# Filename: meizitu_spider.py



import os

import mechanize

from bs4 import BeautifulSoup



br = mechanize.Browser()

br.set_handle_robots(False)

br.addheaders = [("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1")]

#br.set_proxies({"http": "proxy.host.com:port"})



def parse_url(url):

    br.open(url)

    response = br.response()

    soup = BeautifulSoup(response.read(), from_encoding='gb18030')



    return soup



def find_next_page(soup):

    page_nums = soup.find('div', id='wp_page_numbers').find_all('li');

    next_page_wrapper = page_nums[-2]



    return next_page_wrapper.find('a')





host = "http://www.meizitu.com/"

next_page_uri = ''



page_count = 1

parent_folder = 'MeiZiTu'

if(not(os.path.exists(parent_folder))):

    os.mkdir(parent_folder)



while True:

    print 'Start to parse PAGE %d' %page_count



    soup = parse_url(host + 'a/' + next_page_uri)

    next_page = find_next_page(soup)

    

    if next_page == None:

        break



    next_page_uri = next_page.get('href')



    for pic_link_wrapper in soup.find_all('div', attrs={'class':'metaRight'}):

        pic_link = pic_link_wrapper.find('a')

        album_soup = parse_url(pic_link.get('href'))

        album_name = os.path.join(parent_folder, pic_link.get_text())

        if(os.path.exists(album_name)):

            continue



        os.mkdir(album_name)



        for img in album_soup.find('div', id='picture').find_all('img'):

            img_src = img.get('src')

            img_name = img_src[img_src.rindex('/')+1:]

            picture_data = mechanize.urlopen(img_src)



            with open(os.path.join(album_name, img_name), 'wb') as picture:

                picture.write(picture_data.read())



    page_count += 1

 

你可能感兴趣的:(python)