Python爬虫学习笔记(BeautifulSoup4库:标签树的上、下、平行遍历,html格式化)

BeautifulSoup4:beautifulsoup库是解析、遍历、维护“标签树”的功能库。BeautifulSoup4的使用依赖于lxml库,安装Beautifulsoup4之前请先安装lxml库,安装参考requests库

用法:

from bs4 import BeautifulSoup

soup = BeautifulSoup(‘

data

’,’html.parser’)

 

#测试

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser") #对demo进行HTML的解析

Soup2 =BeautifulSoup(open(“D://demo.html”),”html.parser”) #写入文档
print(soup.prettify()) #将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行

 

  基本解析器:

  bs4的HTML解析器:BeautifulSoup(mk,’html.parser’)(安装bs4)

  lxml的HTML解析库:BeautifulSoup(mk,’lxml’)(安装lxml)

  lxml的XML 解析库:BeautifulSoup(mk,’html.xml’)(安装lxml)

html5lib的解析库:BeautifulSoup(mk,’html5lib’) (安装html5lib)

基本元素:

Tag:<>

Name:标签的名字<>中的内容,.name

Attributes:属性,.attrs

NavigableString:标签之间的内容,.string

Comment:标签中字符串的注释部分comment.replace_with(cdata)

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
#print(soup.title)
tag = soup.a #提取标签为a的代码段,但是只能获得第一个改标签内容
aptag = soup.a.parent.name #获取第一个a标签的父标签
ta = tag.attrs #获得标签的属性,其以字典的形式存在
tac = tag.attrs['class'] #获得标签属性中class内容
tah = tag.attrs['href'] #获得标签中的链接内容
tat = type(tag.attrs) #获取标签属性的类型
tta  =type(tag) #获得标签的类型
tcont=tag.string #a标签之间的内容,即字符串信息
newsoup = BeautifulSoup("

This is not a comment

") #comment是注释的类型,此中内容为this is a comment

 

标签树的下行遍历:

.contents:子节点的列表,将所有儿子界定存入列表

.children:子节点的迭代类型,与.content类似,用于循环(for)遍历儿子节点

.descendants:子孙节点的迭代类型,包含所有子孙节点,用于循环(for)遍历

标签树的上行遍历:

.parent:节点的父亲标签

.parents:节点先辈标签的迭代类型,用于循环遍历先辈节点

标签树的平行遍历

注意:平行遍历发生在同一个父节点下的各个节点间

.next_sibling:返回按照HTML文本顺序的下一个平行节点的标签

.previous_sibing:返回按照HTML文本顺序的上一个平行节点的标签

.next_siblings:迭代类型(for),返回按照HTML文本顺序的后续所有平行节点的标签

.previous_sibings:迭代类型(for),返回按照HTML文本顺序的前续所有平行节点的标签

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
#下行遍历
sh = soup.head #获取head标签段
shc = soup.head.contents #获取head标签的儿子标签段
sbc = soup.body.contents #获取body标签段
sn = len(sbc) #获取body儿子节点的数量,以list的形式存在于body段
#下行遍历body的儿子节点
for child in soup.body.children:
    print(child)

#上行遍历
stp = soup.title.parent #获取title的父亲标签
shp = soup.html.parent #html作为最高标签,其父标签是他自己
sop = soup.parent #soup的父标签为空
#标签树的上行遍历
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

#平型遍历
sans = soup.a.next_sibling #获取a标签的下一个平行标签
sanbs = soup.a.next_sibling.next_sibling
saps = soup.a.previous_sibling #获取a标签的前一个平行标签
sapspa = soup.a.previous_sibling.previous_sibling #为空
#遍历前后续节点
for sibling  in soup.a.previous_siblings:
    print(sibling)
for sibling  in soup.a.next_siblings:
    print(sibling)

#基于bs4库的HTML格式化和编码

Soup = Beautifulsoup(“

中文

”,”htnl.parser”)

Sps = Soup.p.string

Print(soup.p.prettify())

你可能感兴趣的:(Python爬虫)