爬虫练手 未登录状态爬取知乎发现页的热门问题和答案

 用requests将网页源代码获取下来,然后使用lxml解析库解析

import os
import requests
from lxml import etree

# 未登录状态下,爬取知乎发现页的热门

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
                  ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}

session = requests.Session()
response = session.get("https://www.zhihu.com/explore", headers=headers)
html = etree.HTML(response.text, etree.HTMLParser())
titles = html.xpath('//*[@id="js-explore-tab"]/div[1]/div/div/h2/a/text()')
answers = html.xpath('//*[@id="js-explore-tab"]/div[1]/div/div/div/div[4]/textarea[@class="content"]/text()')

print(type(titles[0])) 
# 类型是
# _ElementUnicodeResult在python中是字符串的一种
# 在python3中,字符串就是指以unicode编码规则存储的数据
# 而以其他方式如utf-8,ASCII编码方式存储的数据称为bytes类型

with open('zhihu.txt', 'w+', encoding='utf-8') as file:
    for item in zip(titles, answers):
        file.write(item[0] + item[1])
        file.write('\n' + '*' * 50)
print("end")

 

你可能感兴趣的:(python学习)