BeautifulSoup4:beautifulsoup库是解析、遍历、维护“标签树”的功能库。BeautifulSoup4的使用依赖于lxml库,安装Beautifulsoup4之前请先安装lxml库,安装参考requests库
用法:
from bs4 import BeautifulSoup
soup = BeautifulSoup(‘
data
’,’html.parser’)
#测试
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser") #对demo进行HTML的解析Soup2 =BeautifulSoup(open(“D://demo.html”),”html.parser”) #写入文档
print(soup.prettify()) #将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
基本解析器:
bs4的HTML解析器:BeautifulSoup(mk,’html.parser’)(安装bs4)
lxml的HTML解析库:BeautifulSoup(mk,’lxml’)(安装lxml)
lxml的XML 解析库:BeautifulSoup(mk,’html.xml’)(安装lxml)
html5lib的解析库:BeautifulSoup(mk,’html5lib’) (安装html5lib)
基本元素:
Tag:<>>
Name:标签的名字<>中的内容,
Attributes:属性,
NavigableString:标签之间的内容,
Comment:标签中字符串的注释部分comment.replace_with(cdata)
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
#print(soup.title)
tag = soup.a #提取标签为a的代码段,但是只能获得第一个改标签内容
aptag = soup.a.parent.name #获取第一个a标签的父标签
ta = tag.attrs #获得标签的属性,其以字典的形式存在
tac = tag.attrs['class'] #获得标签属性中class内容
tah = tag.attrs['href'] #获得标签中的链接内容
tat = type(tag.attrs) #获取标签属性的类型
tta =type(tag) #获得标签的类型
tcont=tag.string #a标签之间的内容,即字符串信息
newsoup = BeautifulSoup("This is not a comment
") #comment是注释的类型,此中内容为this is a comment
标签树的下行遍历:
.contents:子节点的列表,将
.children:子节点的迭代类型,与.content类似,用于循环(for)遍历儿子节点
.descendants:子孙节点的迭代类型,包含所有子孙节点,用于循环(for)遍历
标签树的上行遍历:
.parent:节点的父亲标签
.parents:节点先辈标签的迭代类型,用于循环遍历先辈节点
标签树的平行遍历
注意:平行遍历发生在同一个父节点下的各个节点间
.next_sibling:返回按照HTML文本顺序的下一个平行节点的标签
.previous_sibing:返回按照HTML文本顺序的上一个平行节点的标签
.next_siblings:迭代类型(for),返回按照HTML文本顺序的后续所有平行节点的标签
.previous_sibings:迭代类型(for),返回按照HTML文本顺序的前续所有平行节点的标签
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
#下行遍历
sh = soup.head #获取head标签段
shc = soup.head.contents #获取head标签的儿子标签段
sbc = soup.body.contents #获取body标签段
sn = len(sbc) #获取body儿子节点的数量,以list的形式存在于body段
#下行遍历body的儿子节点
for child in soup.body.children:
print(child)
#上行遍历
stp = soup.title.parent #获取title的父亲标签
shp = soup.html.parent #html作为最高标签,其父标签是他自己
sop = soup.parent #soup的父标签为空
#标签树的上行遍历
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
#平型遍历
sans = soup.a.next_sibling #获取a标签的下一个平行标签
sanbs = soup.a.next_sibling.next_sibling
saps = soup.a.previous_sibling #获取a标签的前一个平行标签
sapspa = soup.a.previous_sibling.previous_sibling #为空
#遍历前后续节点
for sibling in soup.a.previous_siblings:
print(sibling)
for sibling in soup.a.next_siblings:
print(sibling)#基于bs4库的HTML格式化和编码
Soup = Beautifulsoup(“
中文
”,”htnl.parser”)Sps = Soup.p.string
Print(soup.p.prettify())