爬取微博热搜榜
作为爬虫新手,这是一个练手项目,水平有限,仅供参考,欢迎交流
此代码基于requests包和lxml包编写,同时可以查看爬取热搜时间。
因为水平有限,暂时还未想到置顶热搜写入txt办法。
import requests
from lxml import etree
import datetime
if __name__=='__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/84.0.4147.125 Safari/537.36 Edg/84.0.522.59'
}
url='https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6'
page_text=requests.get(url=url,headers=headers).text
tree=etree.HTML(page_text)
list=tree.xpath('//div[@class="data"]//tbody/tr')
fp = open('微博热搜.txt', 'w')
now_time = datetime.datetime.now().strftime('%F %A %H:%M:%S') + '\n'
fp.write(now_time)
'''
一个繁琐的解法,无法输出微博置顶热搜
先获取list长度len(list),在此基础上-1后获取range(len(list)-1),从而不会超越索引
list[i]获取xpath数据,list从0开始,list[i+1]从list[1]开始,跳过了置顶热搜
for i in range(len(list)-1):
rank=list[i+1].xpath('./td[@class="td-01 ranktop"]/text()')[0]
data=list[i+1].xpath('./td[@class="td-02"]/a/text()')[0]
hot=list[i+1].xpath('./td[@class="td-02"]/span/text()')[0]
等效于
for li in list:
rank=list.xpath('./td[@class="td-01 ranktop"]/text()')[0]
data=list.xpath('./td[@class="td-02"]/a/text()')[0]
hot=list.xpath('./td[@class="td-02"]/span/text()')[0]
但跳过了置顶热搜
循环写入需要在循环外打开文件,否则重复覆盖
'''
for i in range(len(list)-1):
rank=list[i+1].xpath('./td[@class="td-01 ranktop"]/text()')[0]
data=list[i+1].xpath('./td[@class="td-02"]/a/text()')[0]
hot=list[i+1].xpath('./td[@class="td-02"]/span/text()')[0]
hot_search=rank + '.' + data + '\t'+ hot + '\n'
print(hot_search)
fp.write(hot_search)