经过几天学习,花了半天时间分析,初步实现了一个简单的爬虫;
爬虫功能:抓取网易美女图片
爬取链接:http://help.3g.163.com/15/0601/17/AR1QC1OQ00964JJI.html;
思路如下:
1.先实现爬取一个当前页面的美女图片;
2.在当前页面获取下次要获取图片的页面;
3.跳转到1步骤;
实现类如下:
class GetPic(): def __init__(self, curreLink, headers): print"begin initGetpic" self.firstlink = curreLink self.headers = headers self.picpath="E:\\picwy\\" #存放爬取图片的路径 print"end pic" def getThepage(self): #获取当前页面内容 print "begin content" # data={} # req = urllib2.request(self.firstlink, data, self.headers) req = urllib2.Request(self.firstlink, headers=self.headers) content = urllib2.urlopen(req).read() print "end content" return content def getThePic(self,piclink,filename):#根据获取图片的url将图片保存到指定位置 print "begin get pic" urllib.urlretrieve(piclink,filename) print "end content" def saveThePic(self,pic_content,filename):#留作扩展,暂时没用这个接口 fopen = open(filename,"w") fopen.write(pic_content) fopen.close() def getNextLike(self, curmsg):#获取下次爬去的页面链接 m = re.findall(r'', curmsg) for eachitem in m: print "itemNext" + eachitem thetemitem=[m[0]] if thetemitem in item: print "===========================" return m[1] return m[0]
(r'这个正则表达式是为了获取下次爬去的页面,这个由于下个爬去页面的不确定性,
在测试时容易产生异常,考虑后续优化。
上图为页面中下次爬去链接所在位置;
def main(): print "begin main" curtLink = "http://help.3g.163.com/15/0325/15/ALIHIPS800964JJI.html" #首次爬取的页面 item.append([curtLink])#记录爬去的页面到item while(curtLink): headers = {'User-Agent': ' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} geetPic = GetPic(curtLink, headers) Mypage = geetPic.getThepage() m = re.findall(r', Mypage)#根据此正则表达式可获取,当前页面图片的地址; curtLink = geetPic.getNextLike(Mypage) print "curtLink====="+curtLink geetPic.getfromEverPage(m) item.append([curtLink]) print "End main" print str(item)