目录
基本使用
标签选择器
选择元素
获取名称
获取属性
获取内容
嵌套选择
子节点和子孙节点
父节点和祖先节点
兄弟节点
标准选择器
name
attrs
用attrs:
不用attrs(更加方便):
text
find(name,attrs,recursive,text,**kwargs)
find_parents() find_parent()
find_next_siblings() find_next_sibling()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() 和 find_previous()
CSS选择器
迭代选择
获取属性
获取内容
总结
灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。
作用:解析网页
解析器 |
使用方法 |
优势 |
劣势 |
Python标准库 |
BeautifulSoup(markup,'html.parser') |
Python的内置标准库,执行速度适中,文档容错能力强 |
Python 2.7.3 or 3.2.2前的版本中文容错能力差 |
lxml HTML解析库 |
BeautifulSoup(markup,'lxml') |
速度快,文档容错能力强 |
需要安装C语言库 |
lxml XML解析库 |
BeautifulSoup(markup,'xml') |
速度快,唯一支持XML的解析器 |
需要安装C语言库 |
html5lib |
BeautifulSoup(markup,'html5lib') |
最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 |
速度慢,不依赖外部扩展 |
from bs4 import BeautifulSoup
html = """网页代码"""
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())#格式化html并自动补全
print(soup.title.string)#打印title
from bs4 import BeautifulSoup
html = """网页代码"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)
#输出title标签中的内容和title标签
print(type(soup.title))
#
print(soup.head)
#打印head标签中的内容和head标签
print(soup.p)
#打印第一个p标签中的内容和p标签
from bs4 import BeautifulSoup
html = """网页代码"""
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)#打印标签的名称
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
运行结果:
dromouse
dromouse
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)
运行结果:
The Dormouse's story
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)
运行结果:
The Dormouse's story
html = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie
and
Tillie
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)#获取所有子节点
运行结果:
['\n Once upon a time there were three little sisters; and their names were\n ',
Elsie
, '\n', Lacie, ' \n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']
返回结果形式为列表
html = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie
and
Tillie
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)#获取所有子节点
for i, child in enumerate(soup.p.children):
#enumerate():返回节点内容和索引
print(i, child)
返回结果为迭代器
html = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie
and
Tillie
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)#获取所有子孙节点
for i, child in enumerate(soup.p.descendants):
print(i, child)
html = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie
and
Tillie
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)#将a标签的父节点打印输出
html = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie
and
Tillie
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))#输出所有祖先节点
html = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie
and
Tillie
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))#获取a标签后的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取a标签前面的兄弟节点
运行结果:
[(0, '\n'), (1, Lacie), (2, ' \n and\n '), (3, Tillie), (4, '\n and they lived at the bottom of a well.\n ')]
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
find_all(name,attrs,recursive,text,**kwargs)
作用:根据标签名,属性,内容查找文档
根据标签名查找
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#把ul标签提取出来
print(type(soup.find_all('ul')[0]))
运行结果:
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
运行结果:
[Foo , Bar , Jay ]
[Foo , Bar ]
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
运行结果:
[
- Foo
- Bar
- Jay
]
[
- Foo
- Bar
- Jay
]
以字典方式传递查找
键名——属性名 键值——属性值
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))#class为关键字,需加_
运行结果:
[
- Foo
- Bar
- Jay
]
[Foo , Bar , Jay , Foo , Bar ]
根据文本内容进行选择
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
运行结果:
['Foo', 'Foo']
直接打印text而不是打印标签
作用:内容匹配
find返回单个元素,find_all返回所有元素
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))
运行结果:
- Foo
- Bar
- Jay
None
find_parents()返回所有祖先节点,find_parent()返回直接父节点
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
通过select()直接传入CSS选择器即可完成选择
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))
运行结果:
[
Hello
]
[Foo , Bar , Jay , Foo , Bar ]
[Foo , Bar ]
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
运行结果:
[Foo , Bar , Jay ]
[Foo , Bar ]
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])#方法一
print(ul.attrs['id'])#方法二
运行结果:
list-1
list-1
list-2
list-2
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())
运行结果:
Foo
Bar
Jay
Foo
Bar
推荐使用lxml解析库,必要时使用html.parser(代码非常混乱时)
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法