本篇文章主要针对Python爬虫爬取微博内容(也可类似实现图片)。通过给定初始爬取起点用户id,获取用户关注其他用户,不断爬取,直到达到要求。
一、项目结构:
1. main.py中对应程序过程逻辑
2. url_manager.py对应管理URL
3. html_parser.py 将网页下载器、网页解析器、博文保存封装在了一起。(理论上应该分开,但是我这里图方便就合在一起了)
二、程序介绍:
1. 主函数main.py:
程序代码:
from craw4weibo import html_parser,url_manager
class SpiderMain(object):
def __init__(self):
self.urls = url_manager.UrlManager()
self.parser = html_parser.HtmlParser()
def craw(self, uid):
count = 1
root_url = 'https://weibo.cn/%s' % uid
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url = self.urls.get_new_url()
print 'craw %d : %s' % (count, new_url)
new_urls = self.parser.parse(new_url)
self.urls.add_new_urls(new_urls)
if count == 1:
break
count = count + 1
except:
print "craw failed"
if __name__ == "__main__":
root_uid = "#Your root id"
spider = SpiderMain()
spider.craw(root_uid)
SpiderMain类中,craw函数对应逻辑:
1. 将起始用户id拼凑成起始URL,添加到URL管理器。
2. 如果URL管理器中的new_url容器不为空,进入while循环
3.从new_url中获取一个新的url,传入html_parser中的parse函数。在parse函数中,下载、解析该url网页。保存需要的博文,同时获取新的urls,作为参数返回
4.将3中返回的urls存入URL管理器,查看是否达到爬取条件。在这里我只爬取1个用户。count设为1
5.重复2的操作直到退出while
2. URL管理器url_manager.py:
class UrlManager(object):
def __init__(self):
self.new_urls = set()
self.old_urls = set()
def add_new_url(self, url):
if url is None:
return
if url not in self.new_urls and url not in self.old_urls:
self.new_urls.add(url)
def add_new_urls(self,urls):
if urls is None or len(urls) == 0:
return
for url in urls:
self.add_new_url(url)
def has_new_url(self):
return len(self.new_urls) != 0
def get_new_url(self):
new_url = self.new_urls.pop()
self.old_urls.add(new_url)
return new_url
功能都很简单,主要是储存新、旧URL和取出URL。需要注意的是在添加新的ULR的时候需要注意该URL是否已经在未使用的URL容器或者在已使用的URL容器。如果在,忽略该URL
3. 网页下载、解析、保存、获取新URL html_parser.py:
import sys
import os
import requests
import time
from lxml import etree
class HtmlParser(object):
def _save_user_data(self, url):
print url
cookie = "#Your Cookie"
headers = "#Your header"
#get web page using request
response = requests.get(url, cookies=cookie, headers=headers)
print "code:", response.status_code
html_cont = response.content
# get current user_id
res_url = url.split("/")
num = len(res_url)
user_id = res_url[num - 1]
#print user_id
selector = etree.HTML(html_cont)
#print selector.xpath('//input[@name="mp"]')
#get page of the weibo
pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])
print "page",pageNum
result = ""
word_count = 1
times = 5
one_step = pageNum / times
for step in range(times):
if step < times - 1:
i = step * one_step + 1
j = (step + 1) * one_step + 1
else:
i = step * one_step + 1
j = pageNum + 1
for page in range(i, j):
try:
#download weibo of certain page
url = 'https://weibo.cn/%s?filter=1&page=%d' % (user_id, page)
#print url
lxml = requests.get(url, cookies=cookie, headers=headers).content
selector = etree.HTML(lxml)
content = selector.xpath('//span[@class="ctt"]')
for each in content:
text = each.xpath('string(.)')
#print text
if word_count >= 3:
text = "%d: " % (word_count - 2) + text + "\n"
else:
text = text + "\n\n"
result = result + text
word_count += 1
print 'getting',page, ' page word ok!'
sys.stdout.flush()
except:
print page, 'error'
print page, 'sleep'
sys.stdout.flush()
time.sleep(3)
print 'continuing', step + 1, 'stopping'
time.sleep(10)
try:
#print "1"
file_name = "%s.txt" % user_id
fo = open(file_name, "wb")
#print "2"
fo.write(result.encode('utf-8'))
#print "3"
fo.close()
print 'finishing word spiderring'
except:
print 'cannot find adress'
sys.stdout.flush()
def _get_new_urls(self, url):
new_urls = set()
res_url = url.split("/")
num = len(res_url)
user_id = res_url[num - 1]
cookie = "#Your Cookie"
headers = "#Your header"
url = 'https://weibo.cn/%s/follow' % user_id
response = requests.get(url, cookies=cookie,headers=headers).content
selector = etree.HTML(response)
pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])
times = 5
one_step = pageNum / times
for step in range(times):
if step < times - 1:
i = step * one_step + 1
j = (step + 1) * one_step + 1
else:
i = step * one_step + 1
j = pageNum + 1
for page in range(i, j):
try:
url = 'https://weibo.cn/%s/follow?page=%d' % (user_id, page)
lxml = requests.get(url, cookies=cookie, headers=headers).content
selector = etree.HTML(lxml)
content = selector.xpath('/html/body/table/tr/td[2]/a[1]')
for c in content:
temp_url = c.attrib['href']
if temp_url is not None:
new_urls.add(temp_url)
print 'geint follow',page, 'page word ok!'
sys.stdout.flush()
except:
print page, 'error'
print page, 'sleep'
sys.stdout.flush()
time.sleep(3)
print 'continuing', step + 1, 'stopping'
time.sleep(10)
return new_urls
def parse(self, url):
if url is None:
return
print "saving"
self._save_user_data(url)
print "getting"
new_urls = self._get_new_urls(url)
print "finishing"
return new_urls
1. 保存博文部分代码解释:
下载部分比较简单,使用了requests库,通过伪装浏览器发送请求,获取博文。分5次将所有page博文下载并保存完毕,计算每次对应的page范围。这些都比较简单。这里有两个需要注意的地方:
1). sleep的时间:现在微博反爬虫比较敏感,访问过于频繁cookie很快就会失效,出现访问403。所以需要设置sleep的时间。可以分别设置为60 和 300.不过这样效率很慢,每爬取一页sleep一分钟,每爬取一轮sleep五分钟。这样的话有的用户有900多页,就需要很久。这个问题我还没有解决。
2). 关于selector的xpath问题,在chorme的F12开发者模式中直接获取的xpath 为chorme优化过的,可能会导致你在应用到selector时出现空匹配。例如:
content = selector.xpath('/html/body/table/tr/td[2]/a[1]')
这个在chorme中,就会优化为../table/tbody/tr.....这样就无法匹配,所以需要删除掉tbody。 关于xpath,可以参见菜鸟教程
最后提醒:如果出现403拒绝访问,或者在爬取时打印page error,再或者在获取pageNum出现数组越界(由于获取到的是空数组所以会越界),一般是你的cookie挂掉了,或者你被新浪暂时封了ip
参考文章:
1. imooc 爬虫课程
2. 《python爬虫爬取指定用户微博图片及内容,并进行微博分类及使用习惯分析,生成可视化图表》
表示感谢!
本文Github源码下载
P.S. 文章不妥之处还望指正