beautifulsoup4教程(二)bs4中四大对象

beautifulsoup4教程(一)基础知识和第一个爬虫

beautifulsoup4教程(二)bs4中四大对象

beautifulsoup4教程(三)遍历和搜索文档树

beautifulsoup4教程(四)css选择器


三、四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag
NavigableString
BeautifulSoup
Comment

3.1 Tag 标签
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#创建beautifulsoup对象 #也可以用打开本地的html文件来创建beautifulsoup对象,例如: #soup = BeautifulSoup(open('index.html')) soup = BeautifulSoup(html,features="lxml") #格式化输出 print soup.title print soup.head print soup.a print soup.p result: <title>The Dormouse's story</title> <head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  1. 利用 soup加标签名轻松地获取这些标签的内容
  2. 它查找的是在所有内容中的第一个符合要求的标签
  3. 这些对象的类型是
  4. Tag对象的两个重要属性:
  • name

输出标签的标签类型名

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#创建beautifulsoup对象 #也可以用打开本地的html文件来创建beautifulsoup对象,例如: #soup = BeautifulSoup(open('index.html')) soup = BeautifulSoup(html,features="lxml") #格式化输出 print soup.name print soup.head.name result: [document] head
  • attrs

以字典的形式获取标签的属性

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#创建beautifulsoup对象 #也可以用打开本地的html文件来创建beautifulsoup对象,例如: #soup = BeautifulSoup(open('index.html')) soup = BeautifulSoup(html,features="lxml") #利用Tag对象的attrs方法获取属性 print soup.p.attrs #获取单个属性 print soup.p.attrs['class'] print soup.p.get('class') resutl: {'class': ['title'], 'name': 'dromouse'}
  • 既然利用attr获得的是字典对象,那么也是可以修改和删除的
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#创建beautifulsoup对象 #也可以用打开本地的html文件来创建beautifulsoup对象,例如: #soup = BeautifulSoup(open('index.html')) soup = BeautifulSoup(html,features="lxml") #修改Tag对象的属性 soup.p['class']="newClassname" print soup.p #删除Tag对象的属性 del soup.p['class'] print soup.p result: <p class="newClassname" name="dromouse"><b>The Dormouse's story</b></p> <p name="dromouse"><b>The Dormouse's story</b></p>

3.2 NavigableString

  • 作用:获取标签内部的文字
  • 直译:可遍历的字符串
  • 使用方法:soup.p.string
  • 对象类型:
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#创建beautifulsoup对象 #也可以用打开本地的html文件来创建beautifulsoup对象,例如: #soup = BeautifulSoup(open('index.html')) soup = BeautifulSoup(html,features="lxml") #获取标签内部文字 print soup.p.string print type(soup.p.string) result: The Dormouse's story <class 'bs4.element.NavigableString'>

3.3 BeautifulSoup

  • 文档对象,也就是整个文档的内容。
  • 可以当做是一个Tag对象。
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

"""
#创建beautifulsoup对象 #也可以用打开本地的html文件来创建beautifulsoup对象,例如: #soup = BeautifulSoup(open('index.html')) soup = BeautifulSoup(html,features="lxml") print soup.name print type(soup.name) print soup.attr result: [document] <type 'unicode'> None

3.4 Comment

  • Coment对象是一个特殊类型的NavigableString对象。
  • 如果标签内部的内容是注释,例如:。那么该NavigableSring对象会转换成Comment对象,并且会把注释符号去掉。
print soup.a
print soup.a.string
print type(soup.a.string)

result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>
  • 如果我们需要获得Coment类型的对象,需要先判断对象类型是Coment还是NavigableString。

你可能感兴趣的:(爬虫)