爬虫笔记:BeautifulSoup详解

BeautifulSoup

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。

用法详解

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

基本使用

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#使用lxml解析库 print(soup.prettify())#格式化代码,自动补全代码。前面给的html不完整,有的只有头标签 print(soup.title.string)#打印内容

prettify美化。前面给的html不全,有的不完整,只有头标签,soup.prettify()补全代码。soup.title.string 打印内容
爬虫笔记:BeautifulSoup详解_第1张图片

标签选择器
选择元素

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#选择解析器 print(soup.title)#选择title标签。标签即<> print(type(soup.title)) print(soup.head)#选择head标签 print(soup.p)#选择p标签

代码里p标签比较特殊,有多个,结果只输出了第一个p标签。意味着这种选择方式只会返回一个(第一个)
爬虫笔记:BeautifulSoup详解_第2张图片
获取标签名称

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title.name)#获取title标签名

结果把标签名字输出来啦
爬虫笔记:BeautifulSoup详解_第3张图片

获取标签属性

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.attrs['name'])#获取p标签name属性值 print(soup.p['name'])#获取p标签name属性值

获取p标签name属性值
爬虫笔记:BeautifulSoup详解_第4张图片

获取标签里的内容

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.string)#获取p标签内容,该方法只能获取第一个p标签内容

获取p标签内容,该方法只能获取第一个p标签内容
爬虫笔记:BeautifulSoup详解_第5张图片

标签嵌套选择


html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.head.title.string)#嵌套选择,用点号分割,层层嵌套
爬虫笔记:BeautifulSoup详解_第6张图片

子节点和子孙节点


html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)#获取p标签的子节点及其内容

print(soup.p.contents)contents#获取标签的子孙节点及其内容。结果是列表形式
爬虫笔记:BeautifulSoup详解_第7张图片

html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child)

print(soup.p.children) children获取标签的子节点。返回结果是一个迭代器
爬虫笔记:BeautifulSoup详解_第8张图片


html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants)#返回子孙节点(不仅仅是子节点,返回结果为迭代器) for i, child in enumerate(soup.p.descendants): print(i, child)

descendants#返回子孙节点(不仅仅是子节点,返回结果为迭代器)
爬虫笔记:BeautifulSoup详解_第9张图片

父节点和祖先节点


html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)

soup.a.parent 返回的是第一个a标签的父节点
爬虫笔记:BeautifulSoup详解_第10张图片

html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.parents)))

soup.a.parents 返回的是父节点以及祖先节点
爬虫笔记:BeautifulSoup详解_第11张图片

兄弟节点

html = """

    
        The Dormouse's story
    
    
        

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.next_siblings)))#a标签的下一个兄弟节点 print(list(enumerate(soup.a.previous_siblings)))#上一个兄弟

sibling 翻译兄弟姐妹

标准选择器find_all
前面的选择方式是根据标签名来选择,但现实中不适合,因为一个html文档有很多相同名的标签。这时需要其他选择签
find_all( name , attrs , recursive , text , **kwargs )
可根据标签名、属性、内容查找文档

根据标签名name选择


html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))

find_all(‘ul’) 找出ul标签。找出的是所有。返回的是列表形式
爬虫笔记:BeautifulSoup详解_第12张图片

标签名name嵌套选择

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.find_all('ul'):#因为返回结果是列表,所以可以遍历 print(ul.find_all('li'))
爬虫笔记:BeautifulSoup详解_第13张图片

利用attrs选择

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={ 'id': 'list-1'}))#查找id为list-1的标签 print('----') print(soup.find_all(attrs={ 'name': 'elements'}))

attrs接收的是一个字典型参数
爬虫笔记:BeautifulSoup详解_第14张图片

可以不用attrs,直接在后面添加属性值


html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))#因为classs是一个关键词,不能直接输入,所以在后面加一个下划线
爬虫笔记:BeautifulSoup详解_第15张图片

根据文本内容text来选择

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))#查找文本内容为Foo

返回的是内容。
爬虫笔记:BeautifulSoup详解_第16张图片

标签选择器find
find( name , attrs , recursive , text , **kwargs )
find返回单个元素,find_all返回所有元素
用法和findall一样


html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))
爬虫笔记:BeautifulSoup详解_第17张图片

find 其他用法

find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
#%% md
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

css选择器
通过select()直接传入CSS选择器即可完成选择
如果是class前加点,如果是id ,前面加#

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading'))#嵌套选择class=panel后的class=panel-heading print(soup.select('ul li'))#嵌套选择ul标签下面的li标签 print(soup.select('#list-2 .element'))#嵌套选择id=list-2,class=element print(type(soup.select('ul')[0]))#根据标签名选择
爬虫笔记:BeautifulSoup详解_第18张图片
html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li'))

标签选择前面不用加任何东西
爬虫笔记:BeautifulSoup详解_第19张图片

css选择器获取属性
用[]获取属性

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id'])#获取属性id print(ul.attrs['id'])#获取属性id
爬虫笔记:BeautifulSoup详解_第20张图片

css选择器获取文本内容
get_text()获取文本内容

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
'''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())#get_text()获取文本内容
爬虫笔记:BeautifulSoup详解_第21张图片

总结

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all() 查询匹配单个结果或者多个结果
  • 如果对CSS选择器熟悉建议使用select()
  • 记住常用的获取属性和文本值的方法

在这里插入图片描述
作者:电气-余登武

你可能感兴趣的:(爬虫,爬虫,python)