转 http://www.crifan.com/python_third_party_lib_html_parser_beautifulsoup/
在Python去写爬虫,网页解析等过程中,比如:
如何用Python,C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站
常常需要涉及到HTML等网页的解析。
当然,对于简单的HTML中内容的提取,Python内置的正则表达式Re模块,就足够用了,
但是对于复杂的HTML的处理,尤其是一些非法的,有bug的html代码的处理,那么最好还是用专门的HTML的解析的库。
Python中的,专门用于HTML解析的库,比较好用的,就是BeautifulSoup。
Python中,专门用于HTML/XML解析的库;
特点是:
即使是有bug,有问题的html代码,也可以解析。
功能很强大;
BeautifulSoup的主页是:
http://www.crummy.com/software/BeautifulSoup/
BeautifulSoup主要有两个版本:
之前的,比较早的,是3.x的版本。
最新的,可用的,在线文档是:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
中文版的是:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/
中可以下载到很多版本,比如我常用的3.0.6的版本:
http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/BeautifulSoup-3.0.6.py
最新的v4版本的BeautifulSoup,改名为bs4了。
注意:
使用bs4时,导入BeautifulSoup的写法是:
1
|
from
bs4
import
BeautifulSoup;
|
然后就可以像之前3.x中一样,直接使用BeautifulSoup了。
详见:
【已解决】Python3中,已经安装了bs4(Beautifulsoup 4)了,但是却还是出错:ImportError: No module named BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
http://www.crummy.com/software/BeautifulSoup/bs4/download/
可以下载到对应的bs4的版本,比如:
此时最新的版本是:
http://www.crummy.com/software/BeautifulSoup/bs4/download/beautifulsoup4-4.1.3.tar.gz
3.0.6之前,都是不需要安装的,所以使用起来最简单,直接下载对应的版本,比如:
http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/BeautifulSoup-3.0.6.py
得到了BeautifulSoup-3.0.6.py,然后改名为:BeautifulSoup.py
然后,放到和你当前的python文件同目录下,比如我当前python文件是:
D:\tmp\tmp_dev_root\python\beautifulsoup_demo\beautifulsoup_demo.py
那就放到
D:\tmp\tmp_dev_root\python\beautifulsoup_demo\
下面,和beautifulsoup_demo.py同目录。
关于如何安装一个Python的第三方模块,简单说就是,进入对应目录,运行:
1
|
setup.py
install
|
详细解释可参考:
在你的Python文件,此处为beautifulsoup_demo.py,中直接import即可。
关于示例html代码,比如使用:
相关参考文档:
3.x版本的:
find(name, attrs, recursive, text, **kwargs)
关于最简单的,最基本的用法,提取html中的某个内容,具体用法,就死使用对应的find函数。
完整代码是:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【教程】Python中第三方的用于解析HTML的库:BeautifulSoup
Author: Crifan Li
Version: 2012-12-26
Contact: admin at crifan dot com
"""
from
BeautifulSoup
import
BeautifulSoup;
def
beautifulsoupDemo():
demoHtml
=
"""
<html>
<body>
<div class="icon_col">
<h1 class="h1user">crifan</h1>
</div>
</body>
</html>
"""
;
soup
=
BeautifulSoup(demoHtml);
print
"type(soup)="
,
type
(soup);
#type(soup)= <type 'instance'>
print
"soup="
,soup;
# 1. extract content
# method 1: no designate para name
#h1userSoup = soup.find("h1", {"class":"h1user"});
# method 2: use para name
h1userSoup
=
soup.find(name
=
"h1"
, attrs
=
{
"class"
:
"h1user"
});
# more can found at:
#http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#find%28name,%20attrs,%20recursive,%20text,%20**kwargs%29
print
"h1userSoup="
,h1userSoup;
#h1userSoup= <h1 class="h1user">crifan</h1>
h1userUnicodeStr
=
h1userSoup.string;
print
"h1userUnicodeStr="
,h1userUnicodeStr;
#h1userUnicodeStr= crifan
if
__name__
=
=
"__main__"
:
beautifulsoupDemo();
|
输出为:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
D:\tmp\tmp_dev_root\python\beautifulsoup_demo>beautifulsoup_demo.
type
ype(soup)
type
t
'instance'
tance'>
soup=
<html>
<
"icon_col"
<div clas
"h1user"
col">
<h1 class="h1user">crifan</h1>
</di
"h1user"
t;/body>
</html>
h1userSoup= <h1 class="h1user">crifan</h1>
h1userUnicodeStr= crifan
|
如果需要改变原先html中的某个值,可以参考官网解释:
后来证实,只能改(Tag的)中的属性的值,不能改(Tag的)的值本身
完整示例代码为:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
|
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【教程】Python中第三方的用于解析HTML的库:BeautifulSoup
Author: Crifan Li
Version: 2013-02-01
Contact: admin at crifan dot com
"""
from
BeautifulSoup
import
BeautifulSoup;
def
beautifulsoupDemo():
demoHtml
=
"""
<html>
<body>
<div class="icon_col">
<h1 class="h1user">crifan</h1>
</div>
</body>
</html>
"""
;
soup
=
BeautifulSoup(demoHtml);
print
"type(soup)="
,
type
(soup);
#type(soup)= <type 'instance'>
print
"soup="
,soup;
print
'{0:=^80}'
.
format
(
" 1. extract content "
);
# method 1: no designate para name
#h1userSoup = soup.find("h1", {"class":"h1user"});
# method 2: use para name
h1userSoup
=
soup.find(name
=
"h1"
, attrs
=
{
"class"
:
"h1user"
});
# more can found at:
#http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#find%28name,%20attrs,%20recursive,%20text,%20**kwargs%29
print
"h1userSoup="
,h1userSoup;
#h1userSoup= <h1 class="h1user">crifan</h1>
h1userUnicodeStr
=
h1userSoup.string;
print
"h1userUnicodeStr="
,h1userUnicodeStr;
#h1userUnicodeStr= crifan
print
'{0:=^80}'
.
format
(
" 2. demo change tag value and property "
);
print
'{0:-^80}'
.
format
(
" 2.1 can NOT change tag value "
);
print
"old tag value="
,soup.body.div.h1.string;
#old tag value= crifan
changedToString
=
u
"CrifanLi"
;
soup.body.div.h1.string
=
changedToString;
print
"changed tag value="
,soup.body.div.h1.string;
#changed tag value= CrifanLi
print
"After changed tag value, new h1="
,soup.body.div.h1;
#After changed tag value, new h1= <h1 class="h1user">crifan</h1>
print
'{0:-^80}'
.
format
(
" 2.2 can change tag property "
);
soup.body.div.h1[
'class'
]
=
"newH1User"
;
print
"changed tag property value="
,soup.body.div.h1;
#changed tag property value= <h1 class="newH1User">crifan</h1>
if
__name__
=
=
"__main__"
:
beautifulsoupDemo();
|
更多的,用法和使用心得,部分内容,已整理到:
【总结】Python的第三方库BeautifulSoup的使用心得
【整理】关于Python中的html处理库函数BeautifulSoup使用注意事项
有空再统一整理到: