Python爬虫——BeautifulSoup库的基本使用

目录

基本使用

标签选择器

选择元素

获取名称

获取属性

获取内容

嵌套选择

子节点和子孙节点

父节点和祖先节点

兄弟节点

标准选择器

name

attrs

用attrs:

不用attrs(更加方便):

text

find(name,attrs,recursive,text,**kwargs)

find_parents() find_parent()

find_next_siblings() find_next_sibling()

find_previous_siblings() find_previous_sibling()

find_all_next() find_next()

find_all_previous() 和 find_previous()

CSS选择器

迭代选择

获取属性

获取内容

总结


灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

作用:解析网页

解析器

使用方法

优势

劣势

Python标准库

BeautifulSoup(markup,'html.parser')

Python的内置标准库,执行速度适中,文档容错能力强

Python 2.7.3 or 3.2.2前的版本中文容错能力差

lxml HTML解析库

BeautifulSoup(markup,'lxml')

速度快,文档容错能力强

需要安装C语言库

lxml XML解析库

BeautifulSoup(markup,'xml')

速度快,唯一支持XML的解析器

需要安装C语言库

html5lib

BeautifulSoup(markup,'html5lib')

最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档

速度慢,不依赖外部扩展

 

基本使用

from bs4 import BeautifulSoup
html = """网页代码"""
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())#格式化html并自动补全
print(soup.title.string)#打印title

标签选择器

选择元素

from bs4 import BeautifulSoup
html = """网页代码"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)
#输出title标签中的内容和title标签
print(type(soup.title))
#
print(soup.head)
#打印head标签中的内容和head标签
print(soup.p)
#打印第一个p标签中的内容和p标签

获取名称

from bs4 import BeautifulSoup
html = """网页代码"""
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)#打印标签的名称

获取属性

html = """
The Dormouse's story
    
    

The Dormouse's story

    

Once upon a time there were three little sisters; and their names were     ,     Lacie and     Tillie;     and they lived at the bottom of a well.

    

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.attrs['name']) print(soup.p['name'])

运行结果:

dromouse
dromouse

获取内容

html = """
The Dormouse's story
    
    

The Dormouse's story

    

Once upon a time there were three little sisters; and their names were     ,     Lacie and     Tillie;     and they lived at the bottom of a well.

    

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.string)

运行结果:

The Dormouse's story

嵌套选择

html = """
The Dormouse's story
    
    

The Dormouse's story

    

Once upon a time there were three little sisters; and their names were     ,     Lacie and     Tillie;     and they lived at the bottom of a well.

    

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.head.title.string)

运行结果:

The Dormouse's story

子节点和子孙节点

例一:

html = """

    
        The Dormouse's story
    
    
        

            Once upon a time there were three little sisters; and their names were                              Elsie                          Lacie             and             Tillie             and they lived at the bottom of a well.         

        

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)#获取所有子节点

运行结果:

['\n            Once upon a time there were three little sisters; and their names were\n            ', 
Elsie
, '\n', Lacie, ' \n            and\n            ', Tillie, '\n            and they lived at the bottom of a well.\n        ']

返回结果形式为列表

二:

html = """

    
        The Dormouse's story
    
    
        

            Once upon a time there were three little sisters; and their names were                              Elsie                          Lacie             and             Tillie             and they lived at the bottom of a well.         

        

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children)#获取所有子节点 for i, child in enumerate(soup.p.children): #enumerate():返回节点内容和索引     print(i, child)

返回结果为迭代器

三:

html = """

    
        The Dormouse's story
    
    
        

            Once upon a time there were three little sisters; and their names were                              Elsie                          Lacie             and             Tillie             and they lived at the bottom of a well.         

        

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants)#获取所有子孙节点 for i, child in enumerate(soup.p.descendants):     print(i, child)

父节点和祖先节点

html = """

    
        The Dormouse's story
    
    
        

            Once upon a time there were three little sisters; and their names were                              Elsie                          Lacie             and             Tillie             and they lived at the bottom of a well.         

        

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)#将a标签的父节点打印输出

 

html = """

    
        The Dormouse's story
    
    
        

            Once upon a time there were three little sisters; and their names were                              Elsie                          Lacie             and             Tillie             and they lived at the bottom of a well.         

        

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.parents)))#输出所有祖先节点

兄弟节点

html = """

    
        The Dormouse's story
    
    
        

            Once upon a time there were three little sisters; and their names were                              Elsie                          Lacie             and             Tillie             and they lived at the bottom of a well.         

        

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.next_siblings)))#获取a标签后的兄弟节点 print(list(enumerate(soup.a.previous_siblings)))#获取a标签前面的兄弟节点

运行结果:

[(0, '\n'), (1, Lacie), (2, ' \n            and\n            '), (3, Tillie), (4, '\n            and they lived at the bottom of a well.\n        ')]
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

作用:根据标签名,属性,内容查找文档

name

根据标签名查找

例一:

html='''
    
        

Hello

    
    
        
                
  • Foo
  •             
  • Bar
  •             
  • Jay
  •         
        
                
  • Foo
  •             
  • Bar
  •         
    
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul'))#把ul标签提取出来 print(type(soup.find_all('ul')[0]))

运行结果:

[
  • Foo
  • Bar
  • Jay
,
  • Foo
  • Bar
]

例二:

html='''
    
        

Hello

    
    
        
                
  • Foo
  •             
  • Bar
  •             
  • Jay
  •         
        
                
  • Foo
  •             
  • Bar
  •         
    
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.find_all('ul'):     print(ul.find_all('li'))

运行结果:

[
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ] [
  • Foo
  • ,
  • Bar
  • ]

    attrs

    用attrs:

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'}))

    运行结果:

    [
    • Foo
    • Bar
    • Jay
    ] [
    • Foo
    • Bar
    • Jay
    ]

    以字典方式传递查找

    键名——属性名    键值——属性值

    不用attrs(更加方便):

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))#class为关键字,需加_

    运行结果:

    [
    • Foo
    • Bar
    • Jay
    ] [
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ,
  • Foo
  • ,
  • Bar
  • ]

    text

    根据文本内容进行选择

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))

    运行结果:

    ['Foo', 'Foo']

    直接打印text而不是打印标签

    作用:内容匹配

    find(name,attrs,recursive,text,**kwargs)

    find返回单个元素,find_all返回所有元素

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))

    运行结果:

    • Foo
    • Bar
    • Jay
    None

    find_parents() find_parent()

    find_parents()返回所有祖先节点,find_parent()返回直接父节点

     

    find_next_siblings() find_next_sibling()

    find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点

     

    find_previous_siblings() find_previous_sibling()

    find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点

     

    find_all_next() find_next()

    find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

     

    find_all_previous() 和 find_previous()

    find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

     

    CSS选择器

    通过select()直接传入CSS选择器即可完成选择

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))

    运行结果:

    [

    Hello

    ] [
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ,
  • Foo
  • ,
  • Bar
  • ] [
  • Foo
  • ,
  • Bar
  • ]

    迭代选择

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'):     print(ul.select('li'))

    运行结果:

    [
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ] [
  • Foo
  • ,
  • Bar
  • ]

    获取属性

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'):      print(ul['id'])#方法一     print(ul.attrs['id'])#方法二

    运行结果:

    list-1
    list-1
    list-2
    list-2

    获取内容

    html='''
    
        
            

    Hello

        
        
            
                  
    • Foo
    •             
    • Bar
    •             
    • Jay
    •         
            
                  
    • Foo
    •             
    • Bar
    •         
        
    ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'):     print(li.get_text())

    运行结果:

    Foo
    Bar
    Jay
    Foo
    Bar

    总结

    • 推荐使用lxml解析库,必要时使用html.parser(代码非常混乱时)

    • 标签选择筛选功能弱但是速度快

    • 建议使用find()、find_all() 查询匹配单个结果或者多个结果

    • 如果对CSS选择器熟悉建议使用select()

    • 记住常用的获取属性和文本值的方法


     

    你可能感兴趣的:(Python爬虫)