趣头条爬虫(以财经频道为例)

相比于"今日头条"App, 大家可能对"趣头条"的了解少了很多,趣头条App作为一款以"阅读有奖"来吸引阅读的新闻类app,用户群体很大.

趣头条的首页如下,与其他内容类App大同小异,即包含列表页( 样例地址)和详情页 样例地址.

趣头条爬虫(以财经频道为例)_第1张图片

 

首先定义需要抓取的列表页地址

bash_url = 'http://api.1sapp.com/content/outList?cid='
mid_url = '&tn=1&page='
end_url = '&limit=10&user=temporary1534345404402&show_time=&min_time=&content_type=1&dtu=200'    
api_url = 'http://api.1sapp.com/content/outList?cid={}&tn=1&page=1&limit=2&user=temporary1534345404402&show_time=&min_time={}&content_type=1&dtu=200'
  

趣头条的列表页地址中的参数比较多,本文主要对其中的如下字段做处理:

  • cid : 类别编码,用于区分新闻类别
  • min_time: 请求的新闻的最小时间, 在遍历历史新闻时,需要用到

趣头条首页包含多个类别banner,在请求参数中用cid标识,主要如下:

    cate_info_dict = {
        # '6': '娱乐',
        # '255': '推荐',
        # '1': '热点',
        # '42': '健康',
        # '5': '养生',
        # '4': '励志',
        # '7': '科技',
        # '8': '生活',
        '10': '财经'
        # '9': '汽车',
        # '18': '星座',
        # '12': '美食',
        # '14': '时尚',
        # '16': '旅行',
        # '17': '育儿',
        # '13': '体育',
        # '15': '军事',
        # '23': '历史',
        # '30': '收藏',
        # '19': '游戏',
        # '28': '国际',
        # '40': '新时代',
        # '50': '房产',
        # '51': '家居',
    }

 

从列表页获取到的关键字段如下所示,主要包含列表内各个新闻的详情页地址以及响应的统计信息

news_url = news['detail_url'] # detail url contains main part of content using json format 
news_stat_info = {
    'read_cnt': news.get('read_count', 0),
    'share_cnt': news.get('share_count', 0),
    'comment_cnt': news.get('comment_count', 0),
    'people_comment_cnt': news.get('people_comment_count', 0),
    'member_id': news.get('member_id', 'UNK'),
    'follow_num': news.get('follow_num', 0),
    'follow_num_show': news.get('follow_num_show', 0),
    'publish_time': news.get('publish_time', -1)
}

从详情页中的获取的关键字段如下:

item = {
    "news_id": news_brief_json.get('id',''),
    "title": news_brief_json.get('title', ''),
    "source": news_brief_json.get('source', ''),
    "url": news_brief_json.get('url', ''),
    "create_time": news_brief_json.get('createTime', ''),
    "publish_info": news_brief_json.get('publish_info', ''),
    "detail": news_brief_json.get('detail', ''),
    "keywords": news_brief_json.get('keywords', ''),
    "description": news_brief_json.get('description', ''),
    "source_site": news_brief_json.get('sourceSite', '') ,
    "is_origin": news_brief_json.get('isOrigin', -1) ,
    "need_statement": news_brief_json.get('needStatement', ''),
    "source_name": news_brief_json.get('sourceName', '') ,
    "authorid": news_brief_json.get('authorid', ''), 
    "share_cnt": news_stat_info.get('share_cnt', -1),
    "people_comment_cnt": news_stat_info.get("people_comment_cnt", -1),
    "follow_num_show": news_stat_info.get("follow_num_show", -1),
    "read_cnt": news_stat_info.get("read_cnt", -1), 
    "comment_cnt": news_stat_info.get("comment_cnt", -1),
    "member_id": news_stat_info.get("member_id", -1), 
    "follow_num": news_stat_info.get("follow_num", -1),
    "publish_time": news_stat_info.get("publish_time", ''),
    "is_clean": 0,
    "ctime":  time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),
    "mtime":  time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
}

另外本项目设置了增量爬虫部分,即只爬取最新的新闻信息,首先从数据库中找到最新的一条新闻的publish_time,代码如下,在爬虫过程中,如果爬取到的新闻的publish_time小于该值便停止抓取.

def get_newest_by_publish_time():
    settings = get_project_settings()
    conn = pymongo.MongoClient(host=settings.get('MONGO_HOST'), port=settings.get('MONGO_PORT'))
    news_info_cur = conn.qutoutiao_db.news_brief_collect.find().sort('publish_time', pymongo.DESCENDING).limit(1)
    try:
        return news_info_cur[0].get('news_id', ''), news_info_cur[0].get('publish_time', '')
    except IndexError:
        return 0, ''

 

项目其他信息:

  • 存储: Mongodb
  • 爬虫框架: Scrapy

 

Talk is cheap, show me the code, 啰啰嗦嗦说了半天,不如直接看代码实在, 代码地址:

https://github.com/yscoder-github/news-spider

 

编辑于 2019-09-03

你可能感兴趣的:(爬虫)