五 . BeautifulSoup库详解及运用

BeautifulSoup库详解及运用

BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器,利用它不用编写正则表达式即可实现网点信息的提取。

解析库

各个解析库的比较:

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

基本使用

html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

结果为:

<html>
 <head>
  <title>
   The Dormouse's story
  title>
 head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   b>
  p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    
   a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   a>
   ;
and they lived at the bottom of a well.
  p>
  <p class="story">
   ...
  p>
 body>
html>
The Dormouse's story

自动把代码转化成标准的lxml格式的文件。

标签选择器

html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

结果为:

<title>The Dormouse's storytitle>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's storytitle>head>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>

获取属性

html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

结果:dromouse
dromouse

获取内容

用.string方法获取内容。

html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p clss="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

嵌套选择

html = """
<html><head><title>The Dormouse's storytitle>head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's storyb>p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">a>,
<a href="http://example.com/lacie" class="sister" id="link2">Laciea> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilliea>;
and they lived at the bottom of a well.p>
<p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)

子节点和子孙节点

content方法可以提取该标签下所有的子节点。

html = """
<html>
    <head>
        <title>The Dormouse's storytitle>
    head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsiespan>
            a>
            <a href="http://example.com/lacie" class="sister" id="link2">Laciea> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tilliea>
            and they lived at the bottom of a well.
        p>
        <p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

输出结果为:

[u'\n            Once upon a time there were three little sisters; and their names were\n            ', class="sister" href="http://example.com/elsie" id="link1">\nElsie\n, u'\n', class="sister" href="http://example.com/lacie" id="link2">Lacie, u' \n            and\n            ', class="sister" href="http://example.com/tillie" id="link3">Tillie, u'\n            and they lived at the bottom of a well.\n        ']
In [8]:

以上结果返回为list类型,有点杂乱,用迭代的方法依次取出。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

结果为:


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
html = """
<html>
    <head>
        <title>The Dormouse's storytitle>
    head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsiespan>
            a>
            <a href="http://example.com/lacie" class="sister" id="link2">Laciea> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tilliea>
            and they lived at the bottom of a well.
        p>
        <p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
<list_iterator object at 0x1064f7dd8>
0 
            Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsiespan>
a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Laciea>
4  
            and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tilliea>
6 
            and they lived at the bottom of a well.

descendants

获取所有的子孙节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

parent

用来获得父节点。

html = """
<html>
    <head>
        <title>The Dormouse's storytitle>
    head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsiespan>
            a>
            <a href="http://example.com/lacie" class="sister" id="link2">Laciea> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tilliea>
            and they lived at the bottom of a well.
        p>
        <p class="story">...p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

结果会把a标签的父标签-p标签,输出出来。

parents方法

获得祖先节点,和获得子孙节点正好相反。

兄弟节点

用.next_siblings或许上一个兄弟节点。
用.previous_siblings获取上一个兄弟节点。

标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Helloh4>
    div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Fooli>
            <li class="element">Barli>
            <li class="element">Jayli>
        ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Fooli>
            <li class="element">Barli>
        ul>
    div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

findall()方法查找所有指定的标签。
结果就是ul标签

[<ul class="list" id="list-1">\n<li class="element">Fooli>\n<li class="element">Barli>\n<li class="element">Jayli>\nul>,
<ul class="list list-small" id="list-2">\n<li class="element">Fooli>\n<li class="element">Barli>\nul>]
<class 'bs4.element.Tag'>

找到所有ul标签下的li标签的代码:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

结果为:

[<li class="element">Fooli>, <li class="element">Barli>, <li class="element">Jayli>]
[<li class="element">Fooli>, <li class="element">Barli>]

attrs

html='''
<div class="panel">
    <div class="panel-heading">
        

Hello

div> <div class="panel-body">
    class="list" id="list-1" name="elements">
  • class="element">Foo
  • class="element">Bar
  • class="element">Jay
    class="list list-small" id="list-2">
  • class="element">Foo
  • class="element">Bar
div> div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'}))

attrs利用字典类型,结果如下:

[<ul class="list" id="list-1" name="elements">\n<li class="element">Fooli>\n<li class="element">Barli>\n<li class="element">Jayli>\nul>]
[<ul class="list" id="list-1" name="elements">\n<li class="element">Fooli>\n<li class="element">Barli>\n<li class="element">Jayli>\nul>]

支持直接用id和element查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element')) //class是关键字所以要用class_=" "

结果为:

[<ul class="list" id="list-1">
<li class="element">Fooli>
<li class="element">Barli>
<li class="element">Jayli>
ul>]
[<li class="element">Fooli>, <li class="element">Barli>, <li class="element">Jayli>, <li class="element">Fooli>, <li class="element">Barli>]

利用text进行选择

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Helloh4>
    div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Fooli>
            <li class="element">Barli>
            <li class="element">Jayli>
        ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Fooli>
            <li class="element">Barli>
        ul>
    div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

查找文本

['Foo', 'Foo']

找的是text内容而不是标签。

find

find用法和findall一模一样,但是返回的是找到的第一个符合条件的内容输出

CSS选择器

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Helloh4>
    div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Fooli>
            <li class="element">Barli>
            <li class="element">Jayli>
        ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Fooli>
            <li class="element">Barli>
        ul>
    div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li')) //选择ul标签下面的li标签
print(soup.select('#list-2 .element')) //通过#number选择ID
//查找class=element的id=list-2的标签
print(type(soup.select('ul')[0]))
[<div class="panel-heading">
<h4>Helloh4>
div>]
[<li class="element">Fooli>, <li class="element">Barli>, <li class="element">Jayli>, <li class="element">Fooli>, <li class="element">Barli>]
[<li class="element">Fooli>, <li class="element">Barli>]
<class 'bs4.element.Tag'>

获取属性

html='''
<div class="panel">
    <div class="panel-heading">
        

Hello

div> <div class="panel-body">
    class="list" id="list-1">
  • class="element">Foo
  • class="element">Bar
  • class="element">Jay
    class="list list-small" id="list-2">
  • class="element">Foo
  • class="element">Bar
div> div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])

结果:
用[ ]即可获取属性。ul[id]获取ul的id属性

list-1
list-1
list-2
list-2

获取内容

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Helloh4>
    div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Fooli>
            <li class="element">Barli>
            <li class="element">Jayli>
        ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Fooli>
            <li class="element">Barli>
        ul>
    div>
div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

只要用get_text函数就能获取内容了。

Foo
Bar
Jay
Foo
Bar

总结

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all() 查询匹配单个结果或者多个结果
  • 如果对CSS选择器熟悉建议使用select()
  • 记住常用的获取属性和文本值的方法

你可能感兴趣的:(Python从入门到放弃系列,python,库,爬虫)