beautifulsoup 是一个可以从HTML或者XML文件中提取数据的python库。他能通过你喜欢的转换器实现文档的导航,查找的方式。
安装:在新版的Debain 或ubuntu直接通过–>apt-get install python-bs4
还可以pip insatall bs4
因为第三方lxml 比python标准库的HTML解析起来快的多所以我们选择安装使用lxml,安装方式跟bs4一样。
首先,导入bs4库:from bs4 import BeautifulSoup.我们使用下面的HTML代码进行练习
html_str =
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
#<1>创建Beautifulsoup对象
soup = Beautifulsoup(html_str,'lxml',from_encoding="utf-8")
#<2>打开html文件
soup = Beautifulsoup(open('index.html'))
#格式化的输出soup 对象的内容
print(soup.prettify())
out :
<html>
<head>
<title>
The Dormouse's story
title>
head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
b>
p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
a>
;
and they lived at the bottom of a well.
p>
<p class="story">
...
p>
body>
html>
<head><title>The Dormouse's storytitle>head>
<a class="sister" href="http://example.com/elsie" id="link1">a>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
上面的title head a p 等html 标签加上里面的内容就是Tag
那么怎么获取呢?一起来看以下。
print(soup.title)
print(soup.head)
print(soup.a)
print(soup.p)
可以打印下类型print(type(soup.p))
–>
上面我们获取到了标签,但是里面的文字如何取出来呢?这就用到了我们的”.string”属性
print soup.p.string
# The Dormouse's story
print type(soup.p.string)
# In [13]:
BS对象表示一个文档的内容,大部分的情况,我们把他当作Tag对象。我们可以分别的获取他的类型,名称,以及属性。
print type(soup.name)
#
print soup.name
# [document]
print soup.attrs # 文档本身的属性为空
# {}
comment 对象是一个特殊类型的NavigableString对象,其输出的内容不包括注释符号
print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">a>
print soup.a.string
# Elsie
print type(soup.a.string)
# <class 'bs4.element.Comment'>
A.传字符串
soup.find_all('b')
# [<b>The Dormouse's storyb>]
print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">a>, <a class="sister" href="http://example.com/lacie" id="link2">Laciea>, <a class="sister" href="http://example.com/tillie" id="link3">Tilliea>]
B.传正则表达式
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
C.传列表
soup.find_all(['a','b'])
soup.find_all('id'='link2')
soup.find_all(text="Elsie")
1.通过标签名查找
print(soup.select(‘a’))
2.通过类名查找
print(soup.select(‘.sister’))
3.通过id名查找
print(soup.select(‘#link1’))
4.组合查找
#######组合查找即和写class文件一样,标签名与类名,id名进行组合原理一样。
print(soup.select('p #link1'))
5.属性查找
print(soup.select(‘a[class=’sister’]’))
6.获取内容
soup = BeautifulSoup(html_str,'lxml')
print(type(soup.select('title')))
print(soup.select('title')[0].get_text())
for title in soup.select('title'):
print('title.get_text()')