Python爬取微博和知乎热榜的简单实现

Python爬取微博和知乎热榜的简单实现

需要安装lxml和urllib模块

爬取知乎热榜

import urllib.request
import urllib.parse
from lxml import etree
def zhihu():
    url = "https://www.zhihu.com/billboard"
    page = urllib.request.urlopen(url)
    html = page.read().decode("utf-8")
    tree = etree.HTML(html)
    list = tree.xpath(u'//*[@class="HotList-itemTitle"]')
    text_list = []
    for index,item in enumerate(list):
        text_list.append((str(0) if index < 9 else "") + str(index+1) + " : " + item.text)
        print(text_list[index])
zhihu()

爬取微博热榜

只有etree查找元素的xpath存在区别,其他部分基本一致

import urllib.request
import urllib.parse
from lxml import etree
def weibo():
    url = "https://s.weibo.com/top/summary"
    page = urllib.request.urlopen(url)
    html = page.read().decode("utf-8")
    tree = etree.HTML(html)
    list = tree.xpath(u'//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[@class="td-02"]/a')
    text_list = []
    for index,item in enumerate(list):
        text_list.append((str(0) if index < 9 else "") + str(index+1) + " : " + item.text)
        print(text_list[index])
weibo()

补充说明

1,除了爬取标题之外,也可以爬取热度和链接,基本思路都是通过etree的xpath筛选元素,读出信息存入list中。
2,注意编码格式需要转换成utf-8。
3,列表内容也可以使用python自带的io写入文件

你可能感兴趣的:(Python随笔)