Beautifulsoup小结

Beautifulsoup小结


参考链接:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id14

什么是Beautifulsoup

Beautifulsoup是用python写的HTML/XML解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。

系统说明:centos7 linux环境

- 安装

pip install beautifulsoup4
easy_install beautifulsoup4

- 主要解析器及其优缺点

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, "html.parser") 1.Python的内置标准库;2.执行速度适中;3.文档容错能力强 Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
lxml HTML 解析器 BeautifulSoup(markup, "lxml") 1.速度快;2.文档容错能力强 需要安装C语言库
lxml XML 解析器 1.BeautifulSoup(markup, ["lxml-xml"]);2.BeautifulSoup(markup, "xml") 1.速度快;2.唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, "html5lib") 1.最好的容错性;2.以浏览器的方式解析文档;3.生成HTML5格式的文档 1.速度慢;2.不依赖外部扩展

推荐使用lxml作为解析器,效率更高


使用

对于本地文档的解析方法

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("XXX.html"), 'lxml')
soup = BeautifulSoup("data")

详细解析(匹配)使用方法

  1. Tag标签

注意事项:soup.b # 只能获取第一个b标签
soup.find_all('b') # 获取所有b标签

  • 获取tag数据:soup.b
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b # 只能获取第一个b标签 print(tag) # 输出为:Extremely bold print(type(tag)) # 输出为:
  • 获取标签名:tag.name
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b print(tag.name) # 输出为:b
  • 获取标签中属性值:tag['class']
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b print(tag['class']) # 输出为:['boldest']
  • 获取标签中属性名及其值:tag.attrs

注意:XML中不包含多值属性

soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b print(tag['class']) # 输出为:['boldest', 'mybold'] print(tag['id']) # 输出为:bold dd print(tag.attrs) # 输出为:{'class': ['boldest', 'mybold'], 'id': 'bold dd'}
  • 修改标签中的属性 :tag['class'] = 'change'/tag['class'] = 'change muil' /也可以写为tag['class'] = ['change','muil']
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b tag['class'] = 'change' tag['id'] = '1' print(soup) # 输出为:Extremely boldExtremely bold1

test

tag['class'] = 'change muil' # 也可以写为tag['class'] = ['change','muil'] tag['id'] = '1' print(soup) # 输出为:Extremely boldExtremely bold1

test

  • 删除标签中的属性:del tag['class']
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b del tag['class'] del tag['id'] print(soup) # 输出为:Extremely boldExtremely bold1

test

print(tag['class']) # KeyError: 'class' print(tag.get('class')) # 输出为:None
  1. 文本内容
  • 获取文本值:tag.string
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b print(tag.string) # 输出为:Extremely bold print(soup.string) # 输出为:None
  • 替换文本内容:tag.string.replace_with("repalced")
soup = BeautifulSoup('Extremely boldExtremely bold1

test

','lxml') tag = soup.b tag.string.replace_with("repalced") print(tag) # 输出为:repalced

总结:
soup.a之类只查找第一个;

soup.find_all('a')查找所有;

  • 以列表形式输出tag直接子节点:head.contents
soup = BeautifulSoup("The Dormouse's story",'lxml')
head = soup.head
print(head)
# 输出为:The Dormouse's story
print(head.contents)
# 输出为:[The Dormouse's story]
print(head.contents[0].contents)
# 输出为:["The Dormouse's story"]

  • 对tag直接子节点进行循环(生成器类型):for i in head.children
soup = BeautifulSoup("The Dormouse's story",'lxml')
head = soup.head
print(head)
# 输出为:The Dormouse's story
for i in head.children:
    print(i)
# 输出为:The Dormouse's story

  • 对所有子孙节点进行循环(生成器):for i in head.descendants
soup = BeautifulSoup("The Dormouse's story",'lxml')
head = soup.head
print(head)
# 输出为:The Dormouse's story
for i in head.descendants:
    print(i)
# 输出为:
# The Dormouse's story
# The Dormouse's story

  • 对所有文本内容进行循环(生成器):for i in head.strings
soup = BeautifulSoup("The Dormouse's story

ppppp

",'lxml') head = soup.html print(head) # 输出为:The Dormouse's story

ppppp

for i in head.strings: print(i) # 输出为: # The Dormouse's story # ppppp
  • 对所有文本内容进行循环,并去除多余空格或空行:for i in head.stripped_strings
soup = BeautifulSoup("  The Dormouse's   story \n\r  

ppppp

",'lxml') head = soup.html print(head) for i in head.stripped_strings: print(i) # 输出为: #   The Dormouse's story # ppppp
  • 获取某个元素的直属父节点:head.parent

html.parent是beautifulsoup对象,输出整个内容
soup.parent为None

soup = BeautifulSoup("The Dormouse's story

ppppp

",'lxml') print(soup) # 输出为:The Dormouse's story

ppppp

head = soup.p print(soup.p.string) # 输出为:ppppp print(soup.p.string.parent) # 输出为:

ppppp

print(head.parent) # 输出为:

ppppp

print(soup.html.parent) # 输出为:The Dormouse's story

ppppp

  • 获取某个节点的所有父节点(生成器):for i in head.parents
soup = BeautifulSoup("The Dormouse's story

ppppp

",'lxml') print(soup) # 输出为:The Dormouse's story

ppppp

head = soup.p print(head.parents) # 输出为: for i in head.parents: print(i) # 父节点----输出: # body-----

ppppp

# html-----The Dormouse's story

ppppp

# soup-----The Dormouse's story

ppppp

  • 获取兄弟节点(有可能得到换行符和顿号):
    获取后一个兄弟节点:soup.b.next_sibling
    获取前一个兄弟节点:soup.c.previous_sibling
    获取所有兄弟节点:
    for sibling in soup.a.next_siblings
    for sibling in soup.find(id="link3").previous_siblings
soup = BeautifulSoup("text1text2",'lxml')
print(soup)
# 输出为:text1text2
print(soup.b.next_sibling)
# 输出为:text2
print(soup.c.previous_sibling)
# 输出为:text1

总结:
获取元素直属子节点:head.contents \ head.children
获取元素直属父节点:head.parent
获取元素所有子孙节点:for i in head.descendants
获取元素所有父节点:for i in head.parents
获取兄弟节点:.next_sibling / .previous_sibling
获取所有兄弟节点:.next_siblings / .previous_siblings
获取某一个元素 soup.a
获取所有元素 soup('a')
获取某一文本内容 soup.string
获取所有文本内容 for i in head.strings


  • 搜索:

find_all

说明1:调用tag的 find_all()方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

说明2:soup.find_all("a")与soup("a")等价

说明3:find方法只返回第一个

查找所有b标签:soup.find_all('b')
查找所有以b开头的标签:soup.find_all(re.compile("^b"))
查找所有包含d的标签:soup.find_all(re.compile("d"))
查找所有a和b标签:soup.find_all(["a", "b"])
查找所有p标签中class值为myclass的标签:soup.find_all("b1", "myclass")
查找p标签中id值为myid的标签:soup.find_all("b1", id_="myid")
查找id为link2的标签:soup.find_all(id_="myid")
查找文本内容包含sister的文本值(注意:输出的不是整个标签,只有文本):soup.find(string=re.compile("text"))
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
查找所有href中包含elsie的标签:soup.find_all(href=re.compile("elsie"))
匹配任何值:soup.find_all(True)
soup.find_all(id=True)
多参数过滤:soup.find_all(href=re.compile("elsie"), id='link1')
数量限制:soup.find_all("a", limit=2)

soup = BeautifulSoup("text1text2",'lxml')
print(soup.find_all('b'))
# 输出为:[text1text2]
print(soup.find_all(re.compile("^b")))
# 输出为:[text1text2, text1text2, text1, text2]
print(soup.find_all(re.compile('d')))
# 输出为:[text1text2]
print(soup.find_all(["a", "b"]))
# 输出为:[text1text2, text1text2]

data_soup.find_all(data-foo="value")会报错
data_soup.find_all(attrs={"data-foo": "value"})这样就可以啦~
soup.find_all("a", attrs={"class": "sister"})

find_parents() 与 find_parent()

说明:find_parents()为列表,find_parent()则不是,但二者文本

find_next_siblings() 与 find_next_sibling()
说明:find_next_siblings() 方法返回所有符合条件的后面的兄弟节点, find_next_sibling() 只返回符合条件的后面的第一个tag节点.

find_previous_siblings() 与 find_previous_sibling()
说明:find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点

find_all_next() 与 find_next()
说明:find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

find_all_previous() 与 find_previous()
说明1:find_all_previous()方法返回所有符合条件的节点,find_previous()方法返回第一个符合条件的节点.
说明2:find_all_previous("p") 返回了文档中的第一段(class=”title”的那段),但还返回了第二段,

标签包含了我们开始查找的标签.不要惊讶,这段代码的功能是查找所有出现在指定标签之前的

标签,因为这个

标签包含了开始的标签,所以

标签一定是在之前出现的.

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie;

Laci

and and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify()) a_string = soup.find(string="Lacie") print(a_string) # Lacie a = a_string.find_parents("a") print(a) # [Lacie] p = a_string.find_parent("p") print(p) #

Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; #

Laci

and # and they lived at the bottom of a well.

p1 = a_string.find_parents("p", class_="title") print(p1) # [] first_link = soup.a first_link # Elsie first_link.find_all_previous("p") # [

Once upon a time there were three little sisters; ...

, #

The Dormouse's story

] first_link.find_previous("title") # The Dormouse's story

select

通过tag标签逐层查找:soup.select("body a")\soup.select("html head title")
找到某个tag标签下的直接子标签:soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("p > #link1")
找到兄弟节点标签:soup.select("#link1 ~ .sister") 找所有
soup.select("#link1 + .sister") 找第一个
通过CSS的类名查找:soup.select(".sister")\soup.select("[class~=sister]")
通过tag的id查找:soup.select("#link1")\soup.select("a#link2")\soup.select("#link1,#link2")
通过是否存在某个属性来查找:soup.select('a[href]')
通过属性的值来查找:soup.select('a[href="http://example.com/elsie"]') 精确查找
soup.select('a[href^="http://example.com/"]') 开头匹配查找
soup.select('a[href$="tillie"]') 结尾匹配查找

soup.select('a[href*=".com/el"]') 中间匹配查找
查找第一个:soup.select_one(".sister")

  • 获取文本
    获取文本:soup.get_text()
    以指定分隔符获取文本:soup.get_text("|")
    以指定分隔符获取文本并去除文本前后空白:soup.get_text("|", strip=True)
    使用 .stripped_strings 生成器,获得文本列表后手动处理列表:[text for text in soup.stripped_strings]

  • 方法
    校验当前元素,包含 class 属性却不包含 id 属性,并查找所有符合该方法的标签

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

找出所有href属性不符合指定正则的标签

def not_lacie(href):
    return href and not re.compile("lacie").search(href)
print(soup.find_all(href=not_lacie))

找出前后均有文字的标签

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))
for tag in soup.find_all(surrounded_by_strings):
    print(tag)

根据class长度进行匹配

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
print(soup.find_all(class_=has_six_characters))

  • 修改
    删除:del tag \ del tag['class']
    修改: =
    扩展:append("Bar")
soup = BeautifulSoup("Foo")
soup.a.append("Bar")

soup
# FooBar
soup.a.contents
# [u'Foo', u'Bar']

增加: append(new_string) 或 NavigableString(" there")

soup = BeautifulSoup("")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag
# Hello there.
tag.contents
# [u'Hello', u' there']

增加注释: soup.new_string("Nice to see you.", Comment)

from bs4 import Comment
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
tag
# Hello there
tag.contents
# [u'Hello', u' there', u'Nice to see you.']

创建tag: soup.new_tag("a", href="http://www.example.com")

soup = BeautifulSoup("")
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# 

new_tag.string = "Link text."
original_tag
# Link text.

添加到末尾:append()
添加到指定位置:insert()
在当前tag或文本节点前插入内容:soup.b.string.insert_before(tag) 在b.string前添加tag
在当前tag或文本节点后插入内容:soup.b.i.insert_after(soup.new_string(" ever "))

markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a

tag.insert(1, "but did not endorse ")
tag
# I linked to but did not endorse example.com
tag.contents
# [u'I linked to ', u'but did not endorse', example.com]

移除当前tag的内容:tag.clear()
将当前tag移除文档树,并作为方法结果返回(即将删除后的文档树返回):x = soup.i.extract()
将当前节点移除文档树并完全销毁:soup.i.decompose()
移除文档树中的某段内容,并用新tag或文本节点替代它:a_tag.i.replace_with(new_tag)
x = a_tag.i.replace_with(new_tag) 返回被替代的节点
对指定的tag元素进行包装,并返回包装后的结果:soup.p.string.wrap(soup.new_tag("b"))
移除tag内的所有tag标签(不删除文本),该方法常被用来进行标记的解包:a_tag.i.unwrap()
x = a_tag.i.unwrap() 返回被移除的标签

  • 输出
    格式化输出:soup.prettify()
  • 编码检测
  • beautifulsoup会自动识别并猜测编码格式
    编码自动识别:soup.original_encoding
    指定编码方式:soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
    排除该项猜测编码:soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
    编码方式传入prettify()方法:soup.prettify("latin-1")
    子节点编码:soup.p.encode("utf-8")

你可能感兴趣的:(Beautifulsoup小结)