Python网络爬虫爬取站长素材上的表情包

由于不经常看群消息,收藏的表情包比较少,每次在群里斗图我都处于下风,最近在中国大学MOOC上学习了嵩天老师的Python网络爬虫与信息提取课程,于是决定写一个爬取网上表情包的网络爬虫。通过搜索发现站长素材上的表情包很是丰富,一共有446页,每页10个表情包,一共是4000多个表情包,近万个表情,我看以后谁还敢给我斗图

技术路线

requests+beautifulsoup

网页分析

站长素材第一页表情包是这样的:

Python网络爬虫爬取站长素材上的表情包_第1张图片

可以看到第一页的url是:http://sc.chinaz.com/biaoqing/index.html

点击下方的翻页按钮可以看到第二页的url是:http://sc.chinaz.com/biaoqing/index_2.html

可以推测第446页的url是:http://sc.chinaz.com/biaoqing/index_446.html

接下来是分析每一页表情包列表的源代码:

Python网络爬虫爬取站长素材上的表情包_第2张图片

再来分析每个表清包全部表情对应的网页:

Python网络爬虫爬取站长素材上的表情包_第3张图片

Python网络爬虫爬取站长素材上的表情包_第4张图片

步骤

1、获得每页展示的每个表情包连接和title

2、获得每个表情包的所有表情的链接

3、使用获取到的表情链接获取表情,每个表情包的表情放到一个单独的文件夹中,文件夹的名字是title属性值

代码

#-*-coding:utf-8-*-
'''
Created on 2017年3月18日
@author: lavi
'''
import bs4
from bs4 import BeautifulSoup
import re
import requests
import os
import traceback
'''
获得页面内容
'''
def getHtmlText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
'''
获得content
'''    
def getImgContent(url):
    head = {"user-agent":"Mozilla/5.0"}
    try:
        r = requests.get(url,headers=head,timeout=30)
        print("status_code:"+r.status_code)
        r.raise_for_status()
        return r.content
    except:
        return None
    
'''
获得页面中的表情的链接
'''    
def getTypeUrlList(html,typeUrlList):
    soup = BeautifulSoup(html,'html.parser')
    divs = soup.find_all("div", attrs={"class":"up"})
    for div in divs:
        a = div.find("div", attrs={"class":"num_1"}).find("a")
        title = a.attrs["title"]
        typeUrl = a.attrs["href"]
        typeUrlList.append((title,typeUrl))
def getImgUrlList(typeUrlList,imgUrlDict):
    for tuple in typeUrlList:
        title = tuple[0]
        url = tuple[1]
        title_imgUrlList=[]
        html = getHtmlText(url)
        soup = BeautifulSoup(html,"html.parser")
        #print(soup.prettify())
        
        div = soup.find("div", attrs={"class":"img_text"})
        #print(type(div))
        imgDiv = div.next_sibling.next_sibling
        #print(type(imgDiv))
        imgs = imgDiv.find_all("img");
        for img in imgs:
            src = img.attrs["src"]
            title_imgUrlList.append(src)
        imgUrlDict[title] = title_imgUrlList
def getImage(imgUrlDict,file_path):
    head = {"user-agent":"Mozilla/5.0"}
    countdir = 0
    for title,imgUrlList in imgUrlDict.items():
        #print(title+":"+str(imgUrlList))
        try:
            dir = file_path+title
            if not os.path.exists(dir):
                os.mkdir(dir)
            countfile = 0
            for imgUrl in imgUrlList:
                path = dir+"/"+imgUrl.split("/")[-1]
                #print(path)
                #print(imgUrl)
                if not os.path.exists(path):
                    r = requests.get(imgUrl,headers=head,timeout=30)
                    r.raise_for_status()
                    with open(path,"wb") as f:
                        f.write(r.content)
                        f.close()
                        countfile = countfile+1
                        print("当前进度文件夹进度{:.2f}%".format(countfile*100/len(imgUrlList)))
            countdir = countdir + 1
            print("文件夹进度{:.2f}%".format(countdir*100/len(imgUrlDict)))
            
        except:
            print(traceback.print_exc())
            #print("from getImage 爬取失败")
        
def main():
    #害怕磁盘爆满就不获取全部的表情了,只获取30页,大约300个表情包里的表情
    pages = 30
    root = "http://sc.chinaz.com/biaoqing/"
    url = "http://sc.chinaz.com/biaoqing/index.html"
    file_path = "e://biaoqing/"
    imgUrlDict = {}
    typeUrlList = []
    html = getHtmlText(url);
    getTypeUrlList(html,typeUrlList)
    getImgUrlList(typeUrlList,imgUrlDict)
    getImage(imgUrlDict,file_path)
    for page in range(pages):
        url = root + "index_"+str(page)+".html"
        imgUrlDict = {}
        typeUrlList = []
        html = getHtmlText(url);
        getTypeUrlList(html,typeUrlList)
        getImgUrlList(typeUrlList,imgUrlDict)
        getImage(imgUrlDict,file_path)
        
main()
结果

Python网络爬虫爬取站长素材上的表情包_第5张图片

Python网络爬虫爬取站长素材上的表情包_第6张图片

如果你在群里斗图吃了亏,把上面的程序运行一遍。。。不要谢我,3月是学雷锋月。哈哈,来把我们斗会图,

你可能感兴趣的:(python)