Python实战---使用正则表达式爬取古诗文网

使用正则表达式爬取古诗文网

爬取目标

Python实战---使用正则表达式爬取古诗文网_第1张图片
具体字段为:

  • title 标题
  • dynasty 朝代
  • author 作者
  • content 内容
  • tag 标签

实现代码

'''
@Description: 使用正则表达式爬取古诗词网
@Author: sikaozhifu
@Date: 2020-06-09 14:55:44
@LastEditTime: 2020-06-09 15:55:47
@LastEditors: Please set LastEditors
'''
import requests
import re
from lxml import etree
poems = []


def parse_url(url):  # 解析url返回的文档
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    text = response.text
    titles = re.findall(r'.*?(.*?)', text, re.DOTALL)
    dynasties = re.findall(r'.*?(.*?)', text, re.DOTALL)
    authors = re.findall(r'.*?.*?.*?(.*?)', text, re.DOTALL)
    contents_temp = re.findall(r'(.*?)
', text, re.DOTALL) contents = [] for content in contents_temp: contents.append(re.sub('<.*?>', '', content).strip()) # 使用lxml解析 html = etree.HTML(text) tag_element = html.xpath('//div[@class = "tag"]') tags = [] for tag_temp in tag_element: tag = tag_temp.xpath('string(.)') # 去掉标签字段中的空格、回车、换行等。 tag = tag.replace('\n', '').replace('\r', '').replace(' ', '') tags.append(tag) for value in list(zip(titles, dynasties, authors, contents, tags)): title, dynasty, author, content, tag = value poem = { 'title': title, 'dynasty': dynasty, 'author': author, 'content': content, 'tag': tag } poems.append(poem) def main(): for x in range(1, 11):# 共有10页 url = 'https://www.gushiwen.org/default_%s.aspx' % x parse_url(url) print(poems) if __name__ == "__main__": main()

结果展示

Python实战---使用正则表达式爬取古诗文网_第2张图片

你可能感兴趣的:(Python,爬虫,编程练习,Python,古诗词,爬虫,正则表达式,字符串替换)