由于不经常看群消息,收藏的表情包比较少,每次在群里斗图我都处于下风,最近在中国大学MOOC上学习了嵩天老师的Python网络爬虫与信息提取课程,于是决定写一个爬取网上表情包的网络爬虫。通过搜索发现站长素材上的表情包很是丰富,一共有446页,每页10个表情包,一共是4000多个表情包,近万个表情,我看以后谁还敢给我斗图
技术路线
requests+beautifulsoup
网页分析
站长素材第一页表情包是这样的:
可以看到第一页的url是:http://sc.chinaz.com/biaoqing/index.html
点击下方的翻页按钮可以看到第二页的url是:http://sc.chinaz.com/biaoqing/index_2.html
可以推测第446页的url是:http://sc.chinaz.com/biaoqing/index_446.html
接下来是分析每一页表情包列表的源代码:
再来分析每个表清包全部表情对应的网页:
步骤
1、获得每页展示的每个表情包连接和title
2、获得每个表情包的所有表情的链接
3、使用获取到的表情链接获取表情,每个表情包的表情放到一个单独的文件夹中,文件夹的名字是title属性值
代码
#-*-coding:utf-8-*-
'''
Created on 2017年3月18日
@author: lavi
'''
import bs4
from bs4 import BeautifulSoup
import re
import requests
import os
import traceback
'''
获得页面内容
'''
def getHtmlText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
'''
获得content
'''
def getImgContent(url):
head = {"user-agent":"Mozilla/5.0"}
try:
r = requests.get(url,headers=head,timeout=30)
print("status_code:"+r.status_code)
r.raise_for_status()
return r.content
except:
return None
'''
获得页面中的表情的链接
'''
def getTypeUrlList(html,typeUrlList):
soup = BeautifulSoup(html,'html.parser')
divs = soup.find_all("div", attrs={"class":"up"})
for div in divs:
a = div.find("div", attrs={"class":"num_1"}).find("a")
title = a.attrs["title"]
typeUrl = a.attrs["href"]
typeUrlList.append((title,typeUrl))
def getImgUrlList(typeUrlList,imgUrlDict):
for tuple in typeUrlList:
title = tuple[0]
url = tuple[1]
title_imgUrlList=[]
html = getHtmlText(url)
soup = BeautifulSoup(html,"html.parser")
#print(soup.prettify())
div = soup.find("div", attrs={"class":"img_text"})
#print(type(div))
imgDiv = div.next_sibling.next_sibling
#print(type(imgDiv))
imgs = imgDiv.find_all("img");
for img in imgs:
src = img.attrs["src"]
title_imgUrlList.append(src)
imgUrlDict[title] = title_imgUrlList
def getImage(imgUrlDict,file_path):
head = {"user-agent":"Mozilla/5.0"}
countdir = 0
for title,imgUrlList in imgUrlDict.items():
#print(title+":"+str(imgUrlList))
try:
dir = file_path+title
if not os.path.exists(dir):
os.mkdir(dir)
countfile = 0
for imgUrl in imgUrlList:
path = dir+"/"+imgUrl.split("/")[-1]
#print(path)
#print(imgUrl)
if not os.path.exists(path):
r = requests.get(imgUrl,headers=head,timeout=30)
r.raise_for_status()
with open(path,"wb") as f:
f.write(r.content)
f.close()
countfile = countfile+1
print("当前进度文件夹进度{:.2f}%".format(countfile*100/len(imgUrlList)))
countdir = countdir + 1
print("文件夹进度{:.2f}%".format(countdir*100/len(imgUrlDict)))
except:
print(traceback.print_exc())
#print("from getImage 爬取失败")
def main():
#害怕磁盘爆满就不获取全部的表情了,只获取30页,大约300个表情包里的表情
pages = 30
root = "http://sc.chinaz.com/biaoqing/"
url = "http://sc.chinaz.com/biaoqing/index.html"
file_path = "e://biaoqing/"
imgUrlDict = {}
typeUrlList = []
html = getHtmlText(url);
getTypeUrlList(html,typeUrlList)
getImgUrlList(typeUrlList,imgUrlDict)
getImage(imgUrlDict,file_path)
for page in range(pages):
url = root + "index_"+str(page)+".html"
imgUrlDict = {}
typeUrlList = []
html = getHtmlText(url);
getTypeUrlList(html,typeUrlList)
getImgUrlList(typeUrlList,imgUrlDict)
getImage(imgUrlDict,file_path)
main()
结果
如果你在群里斗图吃了亏,把上面的程序运行一遍。。。不要谢我,3月是学雷锋月。哈哈,来把我们斗会图,