BeautifulSoup

导入使用

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

解析器使用方法优势劣势

Python标准库BeautifulSoup(markup, "html.parser")Python的内置标准库、执行速度适中、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差

lxml HTML 解析器BeautifulSoup(markup, "lxml")速度快、文档容错能力强需要安装C语言库

lxml XML 解析器BeautifulSoup(markup, "xml")速度快、唯一支持XML的解析器需要安装C语言库

html5libBeautifulSoup(markup, "html5lib")最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

例子1

html = """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title)

print(soup.title.string)

print(type(soup.title))

print(soup.head)

print(soup.p)

The Dormouse's story

The Dormouse's story

The Dormouse's story

The Dormouse's story

基础使用

soup.title.string和soup.title的区别在于soup.title将整个title标签获取

.string就显示text内容

获取属性

print(soup.p.attrs['name'])

print(soup.p['name'])

dromouse

dromouse

嵌套使用

print(soup.head.title.string)

The Dormouse's story

子节点和子孙节点

.contents会获取标签下所有的子节点

print(soup.p.contents)

['\n Once upon a time there were three little sisters; and their names were\n ',

Elsie

, '\n', Lacie, ' \n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']

children获取所有的子节点和contents类似，但是返回类型是迭代器，需要迭代出来

print(soup.p.children)

for i, child in enumerate(soup.p.children):

print(i, child)

结果

0

Once upon a time there were three little sisters; and their names were

1

Elsie

2

3 Lacie

4

and

5 Tillie

6

and they lived at the bottom of a well.

descendants是获取子孙节点（不仅仅是子节点，可以和上面对比），返回类型也是迭代器

print(soup.p.descendants)

for i, child in enumerate(soup.p.descendants):

print(i, child)

0

Once upon a time there were three little sisters; and their names were

1

Elsie

2

3 Elsie

4 Elsie

5

6

7 Lacie

8 Lacie

9

and

10 Tillie

11 Tillie

12

and they lived at the bottom of a well.

BeautifulSoup

例子1

基础使用

获取属性

嵌套使用

子节点和子孙节点

你可能感兴趣的:(BeautifulSoup)