相比于"今日头条"App, 大家可能对"趣头条"的了解少了很多,趣头条App作为一款以"阅读有奖"来吸引阅读的新闻类app,用户群体很大.
趣头条的首页如下,与其他内容类App大同小异,即包含列表页( 样例地址)和详情页 样例地址.
首先定义需要抓取的列表页地址
bash_url = 'http://api.1sapp.com/content/outList?cid='
mid_url = '&tn=1&page='
end_url = '&limit=10&user=temporary1534345404402&show_time=&min_time=&content_type=1&dtu=200'
api_url = 'http://api.1sapp.com/content/outList?cid={}&tn=1&page=1&limit=2&user=temporary1534345404402&show_time=&min_time={}&content_type=1&dtu=200'
趣头条的列表页地址中的参数比较多,本文主要对其中的如下字段做处理:
趣头条首页包含多个类别banner,在请求参数中用cid标识,主要如下:
cate_info_dict = {
# '6': '娱乐',
# '255': '推荐',
# '1': '热点',
# '42': '健康',
# '5': '养生',
# '4': '励志',
# '7': '科技',
# '8': '生活',
'10': '财经'
# '9': '汽车',
# '18': '星座',
# '12': '美食',
# '14': '时尚',
# '16': '旅行',
# '17': '育儿',
# '13': '体育',
# '15': '军事',
# '23': '历史',
# '30': '收藏',
# '19': '游戏',
# '28': '国际',
# '40': '新时代',
# '50': '房产',
# '51': '家居',
}
从列表页获取到的关键字段如下所示,主要包含列表内各个新闻的详情页地址以及响应的统计信息
news_url = news['detail_url'] # detail url contains main part of content using json format
news_stat_info = {
'read_cnt': news.get('read_count', 0),
'share_cnt': news.get('share_count', 0),
'comment_cnt': news.get('comment_count', 0),
'people_comment_cnt': news.get('people_comment_count', 0),
'member_id': news.get('member_id', 'UNK'),
'follow_num': news.get('follow_num', 0),
'follow_num_show': news.get('follow_num_show', 0),
'publish_time': news.get('publish_time', -1)
}
从详情页中的获取的关键字段如下:
item = {
"news_id": news_brief_json.get('id',''),
"title": news_brief_json.get('title', ''),
"source": news_brief_json.get('source', ''),
"url": news_brief_json.get('url', ''),
"create_time": news_brief_json.get('createTime', ''),
"publish_info": news_brief_json.get('publish_info', ''),
"detail": news_brief_json.get('detail', ''),
"keywords": news_brief_json.get('keywords', ''),
"description": news_brief_json.get('description', ''),
"source_site": news_brief_json.get('sourceSite', '') ,
"is_origin": news_brief_json.get('isOrigin', -1) ,
"need_statement": news_brief_json.get('needStatement', ''),
"source_name": news_brief_json.get('sourceName', '') ,
"authorid": news_brief_json.get('authorid', ''),
"share_cnt": news_stat_info.get('share_cnt', -1),
"people_comment_cnt": news_stat_info.get("people_comment_cnt", -1),
"follow_num_show": news_stat_info.get("follow_num_show", -1),
"read_cnt": news_stat_info.get("read_cnt", -1),
"comment_cnt": news_stat_info.get("comment_cnt", -1),
"member_id": news_stat_info.get("member_id", -1),
"follow_num": news_stat_info.get("follow_num", -1),
"publish_time": news_stat_info.get("publish_time", ''),
"is_clean": 0,
"ctime": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),
"mtime": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
}
另外本项目设置了增量爬虫部分,即只爬取最新的新闻信息,首先从数据库中找到最新的一条新闻的publish_time,代码如下,在爬虫过程中,如果爬取到的新闻的publish_time小于该值便停止抓取.
def get_newest_by_publish_time():
settings = get_project_settings()
conn = pymongo.MongoClient(host=settings.get('MONGO_HOST'), port=settings.get('MONGO_PORT'))
news_info_cur = conn.qutoutiao_db.news_brief_collect.find().sort('publish_time', pymongo.DESCENDING).limit(1)
try:
return news_info_cur[0].get('news_id', ''), news_info_cur[0].get('publish_time', '')
except IndexError:
return 0, ''
项目其他信息:
Talk is cheap, show me the code, 啰啰嗦嗦说了半天,不如直接看代码实在, 代码地址:
https://github.com/yscoder-github/news-spider
编辑于 2019-09-03