测试文本:test.html
<div class="right-content">
<ul class="news-1" data-sudaclick="news_p">
<li><a href="https://news.sina.com.cn/c/2019-12-30/doc-iihnzhfz9312920.shtml" target="_blank">星的光点点洒于午夜 人人开开心心说说故事a>li>
<li><a href="https://news.sina.com.cn/c/2019-12-31/doc-iihnzhfz9358830.shtml" target="_blank">偏偏今宵所想讲不太易 迟疑地望你想说又复迟疑a>li>
<li><a href="https://news.sina.com.cn/c/2019-12-30/doc-iihnzhfz9330965.shtml" target="_blank">秋风将涌起的某夜 遗留她的窗边有个故事a>li>
<li><a href="https://news.sina.com.cn/c/2019-12-31/doc-iihnzhfz9360933.shtml" target="_blank">孤单单的小伙子不顾寂寞 徘徊树下直至天际露月儿a>li>
ul>
div>
XPath,全称XML Path Language,即XML路径语言
官方教程:https://www.w3.org/TR/xpath/all/
常用语法:
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 选取所有子节点 |
// | 选取所有子孙节点 |
. | 选取当前节点 |
. . | 选取当前节点的父节点 |
@ | 选取属性 |
import requests
from lxml import etree
res = requests.get('https://news.sina.com.cn/china/')
res.encoding = 'utf-8'
html = etree.HTML(res.text)
html.xpath('//ul[@class="news-1"]/li/a/text()')
from lxml import etree
# test.html以utf-8的编码格式保存,不会出现乱码
html = etree.parse('./test.html', etree.HTMLParser())
html.xpath('//ul[@class="news-1"]/li/a/text()')
Beautiful Soup是python的一个HTML或XML解析库
官网:https://www.crummy.com/software/BeautifulSoup/
文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
依赖解析器:
解析器 | 使用方法 |
---|---|
python标准库 | BeautifulSoup(markup, “html.parser”) |
lxml HTML 解析器 | BeautifulSoup(markup, “lxml”) |
lxml XML 解析器 | BeautifulSoup(markup, “xml”) |
html5lib | BeautifulSoup(markup, “html5lib”) |
import requests
from bs4 import BeautifulSoup
res = requests.get('https://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'lxml')
soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('./test.html', 'r', encoding='utf-8'), 'lxml')
soup
soup.find_all(name="ul", attrs={"class": "news-1"})[0].find_all(name='li')
输出:
[<li><a href="https://news.sina.com.cn/c/2019-12-30/doc-iihnzhfz9312920.shtml" target="_blank">星的光点点洒于午夜 人人开开心心说说故事</a></li>,
<li><a href="https://news.sina.com.cn/c/2019-12-31/doc-iihnzhfz9358830.shtml" target="_blank">偏偏今宵所想讲不太易 迟疑地望你想说又复迟疑</a></li>,
<li><a href="https://news.sina.com.cn/c/2019-12-30/doc-iihnzhfz9330965.shtml" target="_blank">秋风将涌起的某夜 遗留她的窗边有个故事</a></li>,
<li><a href="https://news.sina.com.cn/c/2019-12-31/doc-iihnzhfz9360933.shtml" target="_blank">孤单单的小伙子不顾寂寞 徘徊树下直至天际露月儿</a></li>]
soup.find_all(name="ul", attrs={"class": "news-1"})[0].find_all(name='li')[2].a.string
输出:
'秋风将涌起的某夜 遗留她的窗边有个故事'
soup.find_all(name="ul", attrs={"class": "news-1"})[0].find_all(name='li')[2].a['href']
输出:
'https://news.sina.com.cn/c/2019-12-30/doc-iihnzhfz9330965.shtml'
一个类似jQuery的python库,使用CSS选择器,比较soup.select()
官方文档:https://pythonhosted.org/pyquery/
import requests
from pyquery import PyQuery as pq
#res = requests.get('https://news.sina.com.cn/china/')
#res.encoding = 'utf-8'
#doc = pq(res.text)
doc = pq(url='https://news.sina.com.cn/china/', encoding='utf-8')
doc
import requests
from pyquery import PyQuery as pq
#doc = pq(filename="./test.html",encoding='utf-8') #存在编码错误
doc = pq(open('./test.html', 'r', encoding='utf-8').read())
doc
next(doc('.news-1').find('a').items()).text()
输出:
'星的光点点洒于午夜 人人开开心心说说故事'
next(doc('.news-1').find('a').items()).attr.href
输出:
'https://news.sina.com.cn/c/2019-12-30/doc-iihnzhfz9312920.shtml'
import re
result = re.search(pattern, html, 修饰符)
if result:
print(result.group(1), result.group(2))