BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器,利用它不用编写正则表达式即可实现网点信息的提取。
各个解析库的比较:
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, “html.parser”) | Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)
结果为:
<html>
<head>
<title>
The Dormouse's story
title>
head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
b>
p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
a>
;
and they lived at the bottom of a well.
p>
<p class="story">
...
p>
body>
html>
The Dormouse's story
自动把代码转化成标准的lxml格式的文件。
html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
结果为:
<title>The Dormouse's storytitle>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's storytitle>head>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
结果:dromouse
dromouse
用.string方法获取内容。
html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p clss="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
content方法可以提取该标签下所有的子节点。
html = """
<html>
<head>
<title>The Dormouse's storytitle>
head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsiespan>
a>
<a href="http://example.com/lacie" class="sister" id="link2">Laciea>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>
and they lived at the bottom of a well.
p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
输出结果为:
[u'\n Once upon a time there were three little sisters; and their names were\n ', class="sister" href="http://example.com/elsie" id="link1">\nElsie\n, u'\n', class="sister" href="http://example.com/lacie" id="link2">Lacie, u' \n and\n ', class="sister" href="http://example.com/tillie" id="link3">Tillie, u'\n and they lived at the bottom of a well.\n ']
In [8]:
以上结果返回为list类型,有点杂乱,用迭代的方法依次取出。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
结果为:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
html = """
<html>
<head>
<title>The Dormouse's storytitle>
head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsiespan>
a>
<a href="http://example.com/lacie" class="sister" id="link2">Laciea>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>
and they lived at the bottom of a well.
p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
<list_iterator object at 0x1064f7dd8>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsiespan>
a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Laciea>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tilliea>
6
and they lived at the bottom of a well.
获取所有的子孙节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
用来获得父节点。
html = """
<html>
<head>
<title>The Dormouse's storytitle>
head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsiespan>
a>
<a href="http://example.com/lacie" class="sister" id="link2">Laciea>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>
and they lived at the bottom of a well.
p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
结果会把a标签的父标签-p标签,输出出来。
获得祖先节点,和获得子孙节点正好相反。
用.next_siblings或许上一个兄弟节点。
用.previous_siblings获取上一个兄弟节点。
可根据标签名、属性、内容查找文档
html='''
<div class="panel">
<div class="panel-heading">
<h4>Helloh4>
div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Fooli>
<li class="element">Barli>
<li class="element">Jayli>
ul>
<ul class="list list-small" id="list-2">
<li class="element">Fooli>
<li class="element">Barli>
ul>
div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
findall()方法查找所有指定的标签。
结果就是ul标签
[<ul class="list" id="list-1">\n<li class="element">Fooli>\n<li class="element">Barli>\n<li class="element">Jayli>\nul>,
<ul class="list list-small" id="list-2">\n<li class="element">Fooli>\n<li class="element">Barli>\nul>]
<class 'bs4.element.Tag'>
找到所有ul标签下的li标签的代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
结果为:
[<li class="element">Fooli>, <li class="element">Barli>, <li class="element">Jayli>]
[<li class="element">Fooli>, <li class="element">Barli>]
html='''
<div class="panel">
<div class="panel-heading">
Hello
div>
<div class="panel-body">
class="list" id="list-1" name="elements">
- class="element">Foo
- class="element">Bar
- class="element">Jay
class="list list-small" id="list-2">
- class="element">Foo
- class="element">Bar
div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
attrs利用字典类型,结果如下:
[<ul class="list" id="list-1" name="elements">\n<li class="element">Fooli>\n<li class="element">Barli>\n<li class="element">Jayli>\nul>]
[<ul class="list" id="list-1" name="elements">\n<li class="element">Fooli>\n<li class="element">Barli>\n<li class="element">Jayli>\nul>]
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element')) //class是关键字所以要用class_=" "
结果为:
[<ul class="list" id="list-1">
<li class="element">Fooli>
<li class="element">Barli>
<li class="element">Jayli>
ul>]
[<li class="element">Fooli>, <li class="element">Barli>, <li class="element">Jayli>, <li class="element">Fooli>, <li class="element">Barli>]
html='''
<div class="panel">
<div class="panel-heading">
<h4>Helloh4>
div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Fooli>
<li class="element">Barli>
<li class="element">Jayli>
ul>
<ul class="list list-small" id="list-2">
<li class="element">Fooli>
<li class="element">Barli>
ul>
div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
查找文本
['Foo', 'Foo']
找的是text内容而不是标签。
find用法和findall一模一样,但是返回的是找到的第一个符合条件的内容输出
html='''
<div class="panel">
<div class="panel-heading">
<h4>Helloh4>
div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Fooli>
<li class="element">Barli>
<li class="element">Jayli>
ul>
<ul class="list list-small" id="list-2">
<li class="element">Fooli>
<li class="element">Barli>
ul>
div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li')) //选择ul标签下面的li标签
print(soup.select('#list-2 .element')) //通过#number选择ID
//查找class=element的id=list-2的标签
print(type(soup.select('ul')[0]))
[<div class="panel-heading">
<h4>Helloh4>
div>]
[<li class="element">Fooli>, <li class="element">Barli>, <li class="element">Jayli>, <li class="element">Fooli>, <li class="element">Barli>]
[<li class="element">Fooli>, <li class="element">Barli>]
<class 'bs4.element.Tag'>
html='''
<div class="panel">
<div class="panel-heading">
Hello
div>
<div class="panel-body">
class="list" id="list-1">
- class="element">Foo
- class="element">Bar
- class="element">Jay
class="list list-small" id="list-2">
- class="element">Foo
- class="element">Bar
div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
结果:
用[ ]即可获取属性。ul[id]获取ul的id属性
list-1
list-1
list-2
list-2
html='''
<div class="panel">
<div class="panel-heading">
<h4>Helloh4>
div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Fooli>
<li class="element">Barli>
<li class="element">Jayli>
ul>
<ul class="list list-small" id="list-2">
<li class="element">Fooli>
<li class="element">Barli>
ul>
div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())
只要用get_text函数就能获取内容了。
Foo
Bar
Jay
Foo
Bar