使用python 爬虫,爬取图片

一、需求:

用python实现去内涵段子里面下载网页当中的图片到本地当中

二、实现:

1、获取要爬取的URL地址

2、设置headers

3、请求网页内容,把html内容转换成XML

4、解析地址内容,进行图片下载

三、开始操作:以下图为例子

使用python 爬虫,爬取图片_第1张图片

1、获取要爬取的URL地址:

url="http://www.neihan8.com/gaoxiaomanhua/index_2.html"

2、设置headers:

headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}

3、请求网页内容,把html内容转换成XML

request = urllib2.Request(url,headers=headers)

response = urllib2.urlopen(request).read()

xml = etree.HTML(response)#这个etree是需要在前面导入包的 : from lxml import etree 

4、解析地址内容,进行图片下载,我们通过上面的图片进行获取到具体的xpath图片地址.

linklist = content.xpath('/html/body/div[@class="main wrap"]//div[@class="left"]/div[@class="pic-column-list mt10"]/div/a/img/@src')

ps:这个linklist里面存放的是所有这个xpath里面的内容,所以如果需要下载的话需要依次提取

for link in linklist:

    image_request = urllib2.Request(link)

    response = urllib2.urlopen(image_request).read()

    filename = link[10:0]

    with open(fileName,"wb") as f:

            f.write(response)



上面是分别解释了一下流程,都是手写的代码,第一次写文章比较粗糙大家见谅了。下面是整个代码的内容

import urllib2

from lxmlimport etree

class Spider:

pass

    def __init__(self):

self.pageNum =2

        self.switch =True

    def loadImage(self):

url ="http://www.neihan8.com/gaoxiaomanhua/index_"+str(self.pageNum)+".html"

        headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}

request = urllib2.Request(url,headers=headers)

response= urllib2.urlopen(request).read()

content = etree.HTML(response)

linklist = content.xpath('/html/body/div[@class="main wrap"]//div[@class="left"]/div[@class="pic-column-list mt10"]/div/a/img/@src')

for image_linkin linklist:

print "downLoading..."

            self.writeImage(image_link)

def writeImage(self,link_address):

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}

download_request  =urllib2.Request(link_address)

response = urllib2.urlopen(download_request).read()

fileName = link_address[-10:]

with open(fileName,"wb")as f:

f.write(response)

print "downLoad---FINISH"

if __name__ =="__main__":

spider = Spider()

spider.loadImage()

你可能感兴趣的:(使用python 爬虫,爬取图片)