先发一下官方文档地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/
建议有时间可以看一下python包的文档。
Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。
相比正则解析的爬虫,省略了学习正则的高成本。
相比xpath爬虫的解析,同样节约学习时间成本。虽然xpath已经简单点了。(爬虫框架Scrapy就是使用xpath)
linux下可以执行
apt-get install python-bs4
easy_install beautifulsoup4
pip install beautifulsoup4
下面说一下BeautifulSoup 的使用。
解析html需要提取数据。其实主要有几点
1:获取指定tag的内容。
hello, watsy
hello, beautiful soup.
2:获取指定tag下的属性。
watsy's blog
3:如何获取,就需要用到查找方法。
使用示例采用官方
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
格式化输出。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
#
#
#
# The Dormouse's story
#
#
#
#
#
# The Dormouse's story
#
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
#
# ...
#
#
#
soup.title
# The Dormouse's story
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# The Dormouse's story
soup.a
# Elsie
1:获取tag
soup.title
2:获取tag名称
soup.title.name
3:获取title tag的内容
soup.title.string
4:获取title的父节点tag的名称
soup.title.parent.name
怎么样,非常对象化的使用吧。
下面要说一下如何提取href等属性。
soup.p['class']
# u'title'
soup.tag['属性名称']
watsy's blog
常见的应该是如上的提取联接。
代码是
soup.a['href']
相当easy吧。
接下来进入重要部分。全文搜索查找提取.
soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
第一个是tag的名称,第二个是属性。第3个选择递归,text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。
举例使用。
tag名称
soup.find_all('b')
# [The Dormouse's story]
正则参数
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
for tag in soup.find_all(re.compile("t")):
print(tag.name)
# html
# title
列表
soup.find_all(["a", "b"])
# [The Dormouse's story,
# Elsie,
# Lacie,
# Tillie]
函数调用
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [The Dormouse's story
,
# Once upon a time there were...
,
# ...
]
tag的名称和属性查找
soup.find_all("p", "title")
# [The Dormouse's story
]
tag过滤
soup.find_all("a")
# [Elsie,
# Lacie,
# Tillie]
tag属性过滤
soup.find_all(id="link2")
# [Lacie]
text正则过滤
import re
soup.find(text=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'
title_tag.string
# u'The Dormouse's story'
for string in soup.strings:
print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'
head_tag = soup.head
head_tag
# The Dormouse's story
head_tag.contents
[The Dormouse's story ]
title_tag = head_tag.contents[0]
title_tag
# The Dormouse's story
title_tag.contents
# [u'The Dormouse's story']
soup = BeatifulSoup(data)
soup.title
soup.p.['title']
divs = soup.find_all('div', content='tpc_content')
divs[0].contents[0].string