python爬取网易美女图片

经过几天学习,花了半天时间分析,初步实现了一个简单的爬虫;

爬虫功能:抓取网易美女图片

爬取链接:http://help.3g.163.com/15/0601/17/AR1QC1OQ00964JJI.html;

思路如下:

1.先实现爬取一个当前页面的美女图片;

2.在当前页面获取下次要获取图片的页面;

3.跳转到1步骤;

 实现类如下:

class GetPic():
    def __init__(self, curreLink, headers):    
        print"begin initGetpic"
        self.firstlink = curreLink
        self.headers = headers
        self.picpath="E:\\picwy\\"  #存放爬取图片的路径
        print"end pic"

    def getThepage(self):    #获取当前页面内容
       print "begin content"
       # data={}
       # req = urllib2.request(self.firstlink, data, self.headers)
        req = urllib2.Request(self.firstlink, headers=self.headers)
        content = urllib2.urlopen(req).read()
        print "end content"
        return content

    def getThePic(self,piclink,filename):#根据获取图片的url将图片保存到指定位置
        print "begin get pic"
        urllib.urlretrieve(piclink,filename)
        print "end content"


    def saveThePic(self,pic_content,filename):#留作扩展,暂时没用这个接口
        fopen = open(filename,"w")
        fopen.write(pic_content)
        fopen.close()





 def getNextLike(self, curmsg):#获取下次爬去的页面链接
        m = re.findall(r'
  • ', curmsg) for eachitem in m: print "itemNext" + eachitem thetemitem=[m[0]] if thetemitem in item: print "===========================" return m[1] return m[0]
  • (r'
  • 这个正则表达式是为了获取下次爬去的页面,这个由于下个爬去页面的不确定性, 在测试时容易产生异常,考虑后续优化。 上图为页面中下次爬去链接所在位置;

  •   def getfromEverPage( self , m): print "begin getfromEverPage" for item in m: item = item.replace( "\"" , "") print item pos = item.rfind( "/") if pos != - 1: filename=item[pos+ 1:] print "filename:"+filename Myfilepath = self.picpath+filename self.getThePic(item , Myfilepath) print "End getfromEverPage"



     

    主流程实现如下
    def main():
        print "begin main"
        curtLink = "http://help.3g.163.com/15/0325/15/ALIHIPS800964JJI.html"   #首次爬取的页面
        item.append([curtLink])#记录爬去的页面到item
        while(curtLink):
            headers = {'User-Agent': ' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
            geetPic = GetPic(curtLink, headers)
            Mypage = geetPic.getThepage()
            m = re.findall(r'

    , Mypage)#根据此正则表达式可获取,当前页面图片的地址; curtLink = geetPic.getNextLike(Mypage) print "curtLink====="+curtLink geetPic.getfromEverPage(m) item.append([curtLink]) print "End main" print str(item)


    你可能感兴趣的:(python)