[Python 实战] - No.2 Python实现微博爬虫

本篇文章主要针对Python爬虫爬取微博内容(也可类似实现图片)。通过给定初始爬取起点用户id,获取用户关注其他用户,不断爬取,直到达到要求。

一、项目结构:

[Python 实战] - No.2 Python实现微博爬虫_第1张图片

1. main.py中对应程序过程逻辑

2. url_manager.py对应管理URL

3. html_parser.py 将网页下载器、网页解析器、博文保存封装在了一起。(理论上应该分开,但是我这里图方便就合在一起了)


二、程序介绍:

1. 主函数main.py:

程序代码:

from craw4weibo import html_parser,url_manager

class SpiderMain(object):

    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.parser = html_parser.HtmlParser()

    def craw(self, uid):
        count = 1
        root_url = 'https://weibo.cn/%s' % uid
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print 'craw %d : %s' % (count, new_url)
                new_urls = self.parser.parse(new_url)
                self.urls.add_new_urls(new_urls)

                if count == 1:
                    break
                count = count + 1
            except:
                print "craw failed"


if __name__ == "__main__":
    root_uid = "#Your root id"
    spider = SpiderMain()
    spider.craw(root_uid)
 
SpiderMain类中,craw函数对应逻辑:

1. 将起始用户id拼凑成起始URL,添加到URL管理器。

2. 如果URL管理器中的new_url容器不为空,进入while循环

3.从new_url中获取一个新的url,传入html_parser中的parse函数。在parse函数中,下载、解析该url网页。保存需要的博文,同时获取新的urls,作为参数返回

4.将3中返回的urls存入URL管理器,查看是否达到爬取条件。在这里我只爬取1个用户。count设为1

5.重复2的操作直到退出while


2. URL管理器url_manager.py:

class UrlManager(object):

    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self,urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url 


功能都很简单,主要是储存新、旧URL和取出URL。需要注意的是在添加新的ULR的时候需要注意该URL是否已经在未使用的URL容器或者在已使用的URL容器。如果在,忽略该URL


3. 网页下载、解析、保存、获取新URL html_parser.py:

import sys
import os
import requests
import time
from lxml import etree


class HtmlParser(object):

    def _save_user_data(self, url):
        print url
        cookie = "#Your Cookie"
        headers = "#Your header"
        #get web page using request
        response = requests.get(url, cookies=cookie, headers=headers)
        print "code:", response.status_code
        html_cont = response.content
        # get current user_id
        res_url = url.split("/")
        num = len(res_url)
        user_id = res_url[num - 1]
        #print user_id
        selector = etree.HTML(html_cont)
        #print selector.xpath('//input[@name="mp"]')
        #get page of the weibo
        pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])
        print "page",pageNum
        result = ""
        word_count = 1
        times = 5
        one_step = pageNum / times
        for step in range(times):
            if step < times - 1:
                i = step * one_step + 1
                j = (step + 1) * one_step + 1
            else:
                i = step * one_step + 1
                j = pageNum + 1
            for page in range(i, j):
                try:
                    #download weibo of certain page
                    url = 'https://weibo.cn/%s?filter=1&page=%d' % (user_id, page)
                    #print url
                    lxml = requests.get(url, cookies=cookie, headers=headers).content
                    selector = etree.HTML(lxml)
                    content = selector.xpath('//span[@class="ctt"]')
                    for each in content:
                        text = each.xpath('string(.)')
                        #print text
                        if word_count >= 3:
                            text = "%d: " % (word_count - 2) + text + "\n"
                        else:
                            text = text + "\n\n"
                        result = result + text
                        word_count += 1
                    print 'getting',page, ' page word ok!'
                    sys.stdout.flush()

                except:
                    print page, 'error'
                print page, 'sleep'
                sys.stdout.flush()
                time.sleep(3)
            print 'continuing', step + 1, 'stopping'
            time.sleep(10)

        try:
            #print "1"
            file_name = "%s.txt" % user_id
            fo = open(file_name, "wb")
            #print "2"
            fo.write(result.encode('utf-8'))
            #print "3"
            fo.close()
            print 'finishing word spiderring'
        except:
            print 'cannot find adress'
        sys.stdout.flush()

    def _get_new_urls(self, url):
        new_urls = set()
        res_url = url.split("/")
        num = len(res_url)
        user_id = res_url[num - 1]
        cookie = "#Your Cookie"
        headers = "#Your header"
        url = 'https://weibo.cn/%s/follow' % user_id
        response = requests.get(url, cookies=cookie,headers=headers).content
        selector = etree.HTML(response)
        pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])

        times = 5
        one_step = pageNum / times
        for step in range(times):
            if step < times - 1:
                i = step * one_step + 1
                j = (step + 1) * one_step + 1
            else:
                i = step * one_step + 1
                j = pageNum + 1
            for page in range(i, j):
                try:
                    url = 'https://weibo.cn/%s/follow?page=%d' % (user_id, page)
                    lxml = requests.get(url, cookies=cookie, headers=headers).content
                    selector = etree.HTML(lxml)
                    content = selector.xpath('/html/body/table/tr/td[2]/a[1]')
                    for c in content:
                        temp_url = c.attrib['href']
                        if temp_url is not None:
                            new_urls.add(temp_url)
                    print 'geint follow',page, 'page word ok!'
                    sys.stdout.flush()

                except:
                    print page, 'error'
                print page, 'sleep'
                sys.stdout.flush()
                time.sleep(3)
            print 'continuing', step + 1, 'stopping'
            time.sleep(10)

        return new_urls

    def parse(self, url):

        if url is None:
            return
        print "saving"
        self._save_user_data(url)
        print "getting"
        new_urls = self._get_new_urls(url)
        print "finishing"
        return new_urls


主要的逻辑就是保存当前用户博文、和获取新的关注者

1. 保存博文部分代码解释:

下载部分比较简单,使用了requests库,通过伪装浏览器发送请求,获取博文。分5次将所有page博文下载并保存完毕,计算每次对应的page范围。这些都比较简单。这里有两个需要注意的地方:

1). sleep的时间:现在微博反爬虫比较敏感,访问过于频繁cookie很快就会失效,出现访问403。所以需要设置sleep的时间。可以分别设置为60 和 300.不过这样效率很慢,每爬取一页sleep一分钟,每爬取一轮sleep五分钟。这样的话有的用户有900多页,就需要很久。这个问题我还没有解决。

2). 关于selector的xpath问题,在chorme的F12开发者模式中直接获取的xpath 为chorme优化过的,可能会导致你在应用到selector时出现空匹配。例如:

 content = selector.xpath('/html/body/table/tr/td[2]/a[1]')
这个在chorme中,就会优化为../table/tbody/tr.....这样就无法匹配,所以需要删除掉tbody。 关于xpath,可以参见菜鸟教程


最后提醒:如果出现403拒绝访问,或者在爬取时打印page error,再或者在获取pageNum出现数组越界(由于获取到的是空数组所以会越界),一般是你的cookie挂掉了,或者你被新浪暂时封了ip


参考文章:

1.   imooc 爬虫课程

2. 《python爬虫爬取指定用户微博图片及内容,并进行微博分类及使用习惯分析,生成可视化图表》

表示感谢!


本文Github源码下载

P.S. 文章不妥之处还望指正




你可能感兴趣的:(Python,Python实战,python)