Python爬虫库BeautifulSoup的介绍与简单使用实例

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库,本文为大家介绍下Python爬虫库BeautifulSoup的介绍与简单使用实例其中包括了,BeautifulSoup解析HTML,BeautifulSoup获取内容,BeautifulSoup节点操作,BeautifulSoup获取CSS属性等实例
一、介绍

BeautifulSoup库是灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

Python常用解析库Python爬虫库BeautifulSoup的介绍与简单使用实例_第1张图片
二、快速开始

给定html文档,产生BeautifulSoup对象

from bs4 import BeautifulSoup
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html_doc,'lxml')

输出完整文本

print(soup.prettify())

 
 
  The Dormouse's story
 
 
 
 

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well.

...

浏览结构化数据

print(soup.title) #标签及内容
print(soup.title.name) #<title>name属性
print(soup.title.string) #<title>内的字符串
print(soup.title.parent.name) #<title>的父标签name属性(head)
print(soup.p) # 第一个<p></p>
print(soup.p['class']) #第一个<p></p>的class
print(soup.a) # 第一个<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的标签
</code></pre> 
  <pre><code><title>The Dormouse's story
title
The Dormouse's story
head

The Dormouse's story

['title'] Elsie [Elsie, Lacie, Tillie]

找出所有标签内的链接

for link in soup.find_all('a'):
  print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

获得所有文字内容

print(soup.get_text())
The Dormouse's story
 
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自动补全标签并进行格式化

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml print(soup.prettify())#格式化代码,自动补全 print(soup.title.string)#得到title标签里的内容

标签选择器

选择元素

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml print(soup.title)#选择了title标签 print(type(soup.title))#查看类型 print(soup.head)

获取标签名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.title.name)

获取标签属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.attrs['name'])#获取p标签中,name这个属性的值
print(soup.p['name'])#另一种写法,比较直接

获取标签内容

print(soup.p.string)

标签嵌套选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.head.title.string)

子节点和子孙节点

html = """

  
    The Dormouse's story
  
  
    

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml print(soup.p.contents)#获取指定标签的子节点,类型是list

另一个方法,child:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.children)#获取指定标签的子节点的迭代器对象
for i,children in enumerate(soup.p.children):#i接受索引,children接受内容
    print(i,children)

输出结果与上面的一样,多了一个索引。注意,只能用循环来迭代出子节点的信息。因为直接返回的只是一个迭代器对象。

获取子孙节点:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象
for i,child in enumerate(soup.p.descendants):#i接受索引,child接受内容
    print(i,child)

父节点和祖先节点

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.a.parent)#获取指定标签的父节点

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点

兄弟节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点

标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档。

name

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul'))#查找所有ul标签下的内容 print(type(soup.find_all('ul')[0]))#查看其类型

下面的例子就是查找所有ul标签下的li标签:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs(属性)

通过属性进行元素的查找

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'}))#传入的是一个字典类型,也就是想要查找的属性 print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同样的内容,因为这两个属性是在同一个标签里面的。

特殊类型的参数查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是个特殊的属性,可以直接使用
print(soup.find_all(class_='element')) #class是关键字所以要用class_

text

根据文本内容来进行选择:

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))#查找文本为Foo的内容,但是返回的不是标签

所以说这个text在做内容匹配的时候比较方便,但是在做内容查找的时候并不是太方便。

方法

find

find用法和findall一模一样,但是返回的是找到的第一个符合条件的内容输出。

ind_parents(), find_parent()

find_parents()返回所有祖先节点,find_parent()返回直接父节点。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回后面的所有兄弟节点,find_next_sibling()返回后面的第一个兄弟节点

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点

find_all_next(),find_next()

find_all_next()返回节点后所有符合条件的节点,find_next()返回后面第一个符合条件的节点

find_all_previous(),find_previous()

find_all_previous()返回节点前所有符合条件的节点,find_previous()返回前面第一个符合条件的节点

CSS选择器 通过select()直接传入CSS选择器即可完成选择

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading'))#.代表class,中间需要空格来分隔 print(soup.select('ul li')) #选择ul标签下面的li标签 print(soup.select('#list-2 .element')) #'#'代表id。这句的意思是查找id为"list-2"的标签下的,class=element的元素 print(type(soup.select('ul')[0]))#打印节点类型

再看看层层嵌套的选择:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可获取属性
  print(ul.attrs['id'])#另一种写法

获取内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text()方法就能获取内容了。

总结

推荐使用lxml解析库,必要时使用html.parser

标签选择筛选功能弱但是速度快 建议使用find()、find_all() 查询匹配单个结果或者多个结果

如果对CSS选择器熟悉建议使用select()

记住常用的获取属性和文本值的方法
推荐我们的Python学习扣qun:913066266 ,看看前辈们是如何学习的!从基础的python脚本到web开发、爬虫、django、数据挖掘等【PDF,实战源码】,零基础到项目实战的资料都有整理。送给每一位python的小伙伴!每天都有大牛定时讲解Python技术,分享一些学习的方法和需要注意的小细节,点击加入我们的 python学习者聚集地

你可能感兴趣的:(python爬虫教程,python,编程语言)