今天加班啊,苦啊!!
无聊,用python写了一个抓图片的爬虫,感觉很不错啊,哈哈
先贴上代码:(python 版本:2.7.9)
__author__ = 'bloodchilde' import urllib import urllib2 import re import os class Spider: def __init__(self): self.siteUrl="http://sc.chinaz.com/biaoqing/" self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko' self.headers = { 'User-Agent' : self.user_agent } def getPage(self,pageIndex): url = self.siteUrl+"index_"+str(pageIndex)+".html" request = urllib2.Request(url,headers = self.headers) response = urllib2.urlopen(request) return response.read().decode("utf-8") def getContents(self,pageIndex): page = self.getPage(pageIndex) pattern = re.compile('''<div.*?class='num_1'.*?>.*?<p>.*?<a.*?href='.*?'.*?target='_blank'.*?title='(.*?)'.*?><img.*?src2="(.*?)".*?>.*?</a>.*?</p>.*?</div>''',re.S) items = re.findall(pattern,page) contents=[] for item in items: contents.append([item[0],item[1]]) return contents def mk_dir(self,path): isExisist = os.path.exists(path) if not isExisist: os.makedirs(path) return True else: return False def downImage(self,url,dirname): imageUrl = url request = urllib2.Request(imageUrl,headers = self.headers) response = urllib2.urlopen(request) imageContents = response.read() urlArr = imageUrl.split(u"/") imageName = str(urlArr[len(urlArr)-1]) print imageName path = u"C:/Users/bloodchilde/Desktop/image_python/"+dirname self.mk_dir(path) imagePath = path+u"/"+imageName f = open(imagePath, 'wb') f.write(imageContents) f.close() def downLoadAllPicture(self,PageIndex): contents = self.getContents(PageIndex) for list in contents: dirname = list[0] imageUrl = list[1] self.downImage(imageUrl,dirname) demo = Spider() for page in range(3,100): demo.downLoadAllPicture(page)
下载这么多图片,瞬间搞定,下面来分析一下程序:
首先,我的目标网页是:
http://sc.chinaz.com/biaoqing/index_3.html
程序功能是到这个网页下载表情图片
程序思路:
1,获取网页的源码信息
2,解析源码获取要下载的图片的URL(正则处理)
3,重新定位url向这个图片的url发起请求获取url的信息,这个url信息其实就是图片内容contents
4,通过上面获取的图片的URL还可以获取图片的名称(带后缀名) imageName
5,在本地创建文件以获取的imageName命名,将内容contents写进文件即可
打开http://sc.chinaz.com/biaoqing/index_3.html,查看源码,找到要处理的代码段如下:
对应的正则是:
'''<div.*?class='num_1'.*?>.*?<p>.*?<a.*?href='.*?'.*?target='_blank'.*?title='(.*?)'.*?><img.*?src2="(.*?)".*?>.*?</a>.*?</p>.*?</div>'''
是不是感觉很简单啊。。。。