Python爬虫辅助库BeautifulSoup4用法精要

BeautifulSoup是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。由于beautifulsoup3已经不再继续维护,因此新的项目中应使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip install beautifulsoup4直接进行安装,安装之后应使用from bs4 import BeautifulSoup导入并使用。下面我们就一起来简单看一下BeautifulSoup4的强大功能,更加详细完整的学习资料请参考https://www.crummy.com/software/BeautifulSoup/bs4/doc/。


>>> from bs4 import BeautifulSoup

>>> BeautifulSoup('hello world!', 'lxml')  #自动添加和补全标签

hello world!

>>> html_doc = """

The Dormouse's story

The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.


...

"""

>>> soup = BeautifulSoup(html_doc, 'html.parser')  #也可以使用lxml或其他解析器

>>> print(soup.prettify()) #以优雅的方式显示出来

 

  </span></p> <p style="line-height:1.5em;"><span style="color:rgb(0,82,255);">   The Dormouse's story</span></p> <p style="line-height:1.5em;"><span style="color:rgb(0,82,255);"> 

 

 

 

   

    The Dormouse's story

   

 

 

   Once upon a time there were three little sisters; and their names were

   

    Elsie

   

   ,

   

    Lacie

   

   and

   

    Tillie

   

   ;

and they lived at the bottom of a well.

 

 

   ...

 

 

>>> soup.title  #访问特定的标签

The Dormouse's story

>>> soup.title.name  #标签名字

'title'

>>> soup.title.text  #标签文本

"The Dormouse's story"

>>> soup.title.string

"The Dormouse's story"

>>> soup.title.parent  #上一级标签

The Dormouse's story

>>> soup.head

The Dormouse's story

>>> soup.b

The Dormouse's story

>>> soup.body.b

The Dormouse's story

>>> soup.name   #把整个BeautifulSoup对象看做标签对象

'[document]'

>>> soup.body

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

>>> soup.p

The Dormouse's story

>>> soup.p['class']  #标签属性

['title']

>>> soup.p.get('class') #也可以这样查看标签属性

['title']

>>> soup.p.text

"The Dormouse's story"

>>> soup.p.contents

[The Dormouse's story]

>>> soup.a

Elsie

>>> soup.a.attrs  #查看标签所有属性

{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}

>>> soup.find_all('a') #查找所有标签

[Elsie, Lacie, Tillie]

>>> soup.find_all(['a', 'b'])   #同时查找标签

[The Dormouse's story, Elsie, Lacie, Tillie]

>>> import re

>>> soup.find_all(href=re.compile("elsie"))  #查找href包含特定关键字的标签

[Elsie]

>>> soup.find(id='link3')

Tillie

>>> soup.find_all('a', id='link3')

[Tillie]

>>> for link in soup.find_all('a'):

print(link.text,':',link.get('href'))

Elsie : http://example.com/elsie

Lacie : http://example.com/lacie

Tillie : http://example.com/tillie

>>> print(soup.get_text()) #返回所有文本

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

>>> soup.a['id'] = 'test_link1'  #修改标签属性的值

>>> soup.a

Elsie

>>> soup.a.string.replace_with('test_Elsie')  #修改标签文本

'Elsie'

>>> soup.a.string

'test_Elsie'

>>> print(soup.prettify())

 

  </span></p> <p style="line-height:1.5em;"><span style="color:rgb(0,82,255);">   The Dormouse's story</span></p> <p style="line-height:1.5em;"><span style="color:rgb(0,82,255);"> 

 

 

 

   

    The Dormouse's story

   

 

 

   Once upon a time there were three little sisters; and their names were

   

    test_Elsie

   

   ,

   

    Lacie

   

   and

   

    Tillie

   

   ;

and they lived at the bottom of a well.

 

 

   ...

 

 

>>> for child in soup.body.children:   #遍历直接子标签

print(child)


The Dormouse's story

Once upon a time there were three little sisters; and their names were

test_Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

>>> for string in soup.strings:  #遍历所有文本,结果略

print(string)

>>> test_doc = '

'

>>> s = BeautifulSoup(test_doc, 'lxml')

>>> for child in s.html.children:   #遍历直接子标签

print(child)

>>> for child in s.html.descendants: #遍历子孙标签

print(child)

你可能感兴趣的:(Python程序设计,Python,爬虫)