【python爬虫自学笔记】-----Beautiful Soup 用法

参考:https://cuiqingcai.com/1319.html

简介

主要功能是从网页抓取数据。Beautiful Soup提供一些简单的、python式的函数用来处理导航】搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,仅仅说明一下原始编码方式就可以了。

Beautiful Soup已成为和xml、html6lib一样出色的python解释器,为用户灵活提供不同的解析策略或强劲的速度。

安装

python3中的模块为bs4

Beautiful Soup使用

创建Beautiful Soup对象

from bs4 import BeautifulSoup
html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" #创建beautifulsoup对象 soup = BeautifulSoup(html) #打印soup对象的内容,格式化输出 print(soup.prettify())

四大对象种类

Beautiful Soup将复杂的HTML文档转换为一个复杂的树形结构,每个结点都是python对象,所有的对象可以归纳为以下4种:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

(1)Tag      HTML中的标签

print(soup.title)
print(soup.head)
print(soup.a)
print(soup.p)

The Dormouse's story
The Dormouse's story

The Dormouse's story

#验证对象类型
print(type(soup.a))

属性

name:soup对象本身比较特殊,它的name即为[document],对于其他内部标签,输出值变为标签本身名称。

print(soup.name)
print(soup.head.name)

[document]
head

attrs:可以对属性进行删查改

#查看属性
print(soup.p.attrs)
print(soup.p['class'])
print(soup.p.get('class'))
#修改属性
soup.p['class'] = 'newClass'
print(soup.p)
#删除属性
del soup.p['class']
print(soup.p)

{'class': ['title'], 'name': 'dromouse'}
['title']
['title']

The Dormouse's story

The Dormouse's story

(2) NavigableString,可遍历字符串

#获取标签内部的文字
print(soup.p.string)
print(type(soup.p.string))

The Dormouse's story

(3)BeautifulSoup对象表示一个文档的全部内容

print(type(soup.name))
print(soup.name)
print(soup.attrs)


[document]
{}  #空字典

(4)Comment对象是一个特殊类型的NavigableString对象,实际是注释内容,但是已经把注释符号去掉了。

print(soup.a)
print(soup.a.string)
print(type(soup.a.string))


 Elsie 

遍历文档树

(1)直接子结点

tag.contents属性可以将tag的子结点以列表方式输出,还可以通过列表索引获取某一个元素。

print(soup.head.contents)
print(soup.head.contents[0])

[The Dormouse's story]
The Dormouse's story

tag.children 返回的不是list,而是list生成器对象,可以通过它遍历获取所有子结点。

print(soup.head.children)
for child in soup.body.children:
    print(child)




The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

(2)所有子孙结点

tag.descendants属性,可以对所有tag的子孙结点进行递归循环,和children类似,也需要遍历获取其中的内容。

for child in soup.descendants:
    print(child)

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

The Dormouse's story The Dormouse's story The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

The Dormouse's story

The Dormouse's story The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

Once upon a time there were three little sisters; and their names were Elsie , Lacie Lacie and Tillie Tillie ; and they lived at the bottom of a well.

...

将所有的结点打印出来,先是html,其次是head,一个一个剥离。

(3)结点内容

tag.string属性,如果tag只有一个NavigableString类型子结点,那么这个tag可以使用 .string属性得到子结点;如果一个tag仅有一个子结点,那么也可以使用.string属性获取内容。

print(soup.head.string)
print(soup.title.string)

The Dormouse's story
The Dormouse's story

如果tag包含多个子结点,tag就无法确定,string方法应该调用哪个子结点的内容,.string的输出结果为None。

print(soup.html.string)

None

(4)多个内容

tag.strings属性获取多个内容,需要遍历获取。

for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'

tag.stripped_strings属性,输出的字符串中可能包含了很多空格或空行,使用.stripped_strings可以去除多余空白内容。

for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'

(5)父结点

tag.parent属性

p = soup.p
print(p.parent.name)
content = soup.head.title.string
print(content.parent.name)

body
title

(6)全部父结点

tag.parents属性:递归得到元素的所有父辈结点。

content = soup.head.title.string
for parent in content.parents:
    print(parent.name)

title
head
html
[document]

(7)兄弟结点

tag.next_sibling属性:获取该结点的下一个兄弟结点。如果结点不存在,返回None。

tag.previous_sibling属性:获取该结点的前一个兄弟结点。如果结点不存在,返回None。

注意:实际文档中的tag.next_sibling和tag.previous_sibling属性通常是字符串或者空白,因为空白或者换行也可以被视作一个结点,所哟得到的结果可能是空白或者换行。

print(soup.p.next_sibling)
print(soup.p.prev_sibling)
print(soup.p.next_sibling.next_sibling)

D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/bs_test.py
#空白
None#没有前一个兄弟结点,返回None

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

#下一个结点的下一个兄弟结点是可以看到的结点 Process finished with exit code 0

(8)全部兄弟结点

tag.next_siblings属性、tag.previous_siblings属性:对当前结点的兄弟结点迭代输出。

for sibling in soup.a.next_siblings:
    print(repr(sibling))

D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/bs_test.py
',\n'
Lacie
' and\n'
Tillie
';\nand they lived at the bottom of a well.'

Process finished with exit code 0

(9)前后结点

tag.next_element属性、tag.previous_element属性:不分层次关系的前后标签。

print(soup.head.next_element)

The Dormouse's story

(10)所有前后结点

tag.next_elements属性和tag.previous_elements属性:通过迭代器向前或者向后访问文档的解析内容。

for element in soup.a.next_elements:
    print(repr(element))

' Elsie '
',\n'
Lacie
'Lacie'
' and\n'
Tillie
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'

...

'...' '\n'

搜索文档树

(1)find_all(name, attrs, recursive, text, **kwargs):搜索当前tag的所有tag子结点,并判断是否符合过滤器的条件。

name参数:查找所有名字为name的tag,字符串对象会自动忽略掉。

  • 传字符串
print(soup.find_all('b'))
print(soup.find_all('a'))

[The Dormouse's story]
[, Lacie, Tillie]
  • 传正则表达式:Beautiful Soup通过正则表达式的match()来匹配内容。
import re
for tag in soup.find_all(re.compile('^b')):  #以b开头的标签
    print(tag.name)

body
b
  • 传列表:如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回。
print(soup.find_all(['a','b'])) 

[The Dormouse's story, , Lacie, Tillie]
  • 传布尔值:True可以匹配任何值。
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p
  • 传方法:如果没有合适的过滤器,可以定义一个方法,方法只接受一个元素参数,如果这个方法返回True表示当前元素匹配并且被找到,如果不是则返回False。
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))
[

The Dormouse's story

,

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

,

...

]

keyword参数:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为id的参数,Beautiful Soup会搜索每个tag的id属性。

import re
print(soup.find_all(id = 'link2'))
print(soup.find_all(href = re.compile('elsie')))
print(soup.find_all(href = re.compile('elsie'),id = 'link1'))
print(soup.find_all('a',class_ = 'sister'))#若用class过滤,不过class是python关键词,后面要加下划线

[Lacie]
[]
[]
[, Lacie, Tillie]

text参数:搜索文档中的字符串内容与name参数的可选值一样,text参数接受字符串,正则表达式,列表,True。

print(soup.find_all(text='Elsie'))
print(soup.find_all(text=['Tillie','Elsie','Lacie']))
print(soup.find_all(text=re.compile('Dormouse')))

[]
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]

limit参数:find_all()返回全部的搜索结构,如果文档树很大那么搜索会很慢,如果不需要全部结果,可以使用limit参数限制返回结果的数量,效果与SQL中limit关键字类似,当搜索到的结果数量达到limit的限制时,就停止搜索返回结果。

print(soup.find_all('a',limit=2))
[, Lacie]

recursive参数:调用tag的find_all()方法时,Beautiful Soup会检索当前tag的所有子孙结点,若只想搜索tag的直接子结点,需要使用参数recursive=false。

print(soup.html.find_all('title'))
print(soup.html.find_all('title',recursive = False))
print(soup.html.find_all('head',recursive = False))

[The Dormouse's story]
[]
[The Dormouse's story]

(2)find(name,attrs,recursive,text,**kwargs):与find_all()的区别是,find_all()方法的返回结果值包含一个元素的列表,而find()直接返回结果。

(3)find_parents()和find_parent():用来搜索当前结点的父结点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容。

(4)find_next_siblings()和find_next_sibling():返回所有符合条件的后面的兄弟结点;返回符合条件的后面的第一个tag结点。

(5)find_previous_siblings()和find_previous_sibling():返回所有符合条件的前面的兄弟结点;返回第一个符合条件的前面的兄弟结点。

(6)find_all_next()和find_next():通过.next_elements属性对当前tag之后的tag和字符串进行迭代

(7)find_all_previous()和find_previous():通过.previous_elements属性对当前结点前面的tag和字符串进行迭代。

CSS选择器

使用方法为soup.select(),返回类型为list

(1)通过标签名查找

print(soup.select('title'))
print(soup.select('a'))
print(soup.select('b'))

[The Dormouse's story]
[, Lacie, Tillie]
[The Dormouse's story]

(2)通过类名查找:类名前加 .。

print(soup.select('.sister'))
[, Lacie, Tillie]

(3)通过id名查找:id名前加#

print(soup.select('#link1'))
[]

(4)组合查找

print(soup.select('p #link1'))
#直接子标签查找
print(soup.select('head > title'))

[]
[The Dormouse's story]

(5)属性查找:属性需要使用[],属性与标签属于同一结点,中间不能加空格,否则无法匹配。

print(soup.select('a[class="sister"]'))
print(soup.select('a[href="http://example.com/elsie"]'))
print(soup.select('p a[href="http://example.com/elsie"]'))

[, Lacie, Tillie]
[]
[]

(6)也可以使用遍历形式输出,然后使用get_text()方法来获取它的内容。

print(type(soup.select('title')))
print(soup.select('title')[0].get_text())
for title in soup.select('title'):
    print(title.get_text())


The Dormouse's story
The Dormouse's story

 

你可能感兴趣的:(python)