Python网络爬虫与信息提取-Day6-Beautiful Soup库

安装Beautiful Soup库:

pip install beautifulsoup4

 

Beautiful Soup库的安装小测

演示HTML页面地址:http://python123.io/ws/demo.html

1.手工获得HTML源代码

打开浏览器,右键点击“查看源文件”

2.利用requests

import requests

r = requests.get(“http://python123.io/ws/demo.html”)

r.text

demo = r.text

 

Beautiful Soup库安装小测

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,“html.parser”)
print(soup.prettify())

Python网络爬虫与信息提取-Day6-Beautiful Soup库_第1张图片

from bs4 import BeautifulSoup
soup = BeautifulSoup(‘

data

’,“html.parser”)

Beautiful Soup库的基本元素

 

 HTML文件<==>标签树

   

         

   

Beautiful Soup库是解析、遍历、维护“标签树”的功能库

 

:标签Tag

名称Name,成对出现

属性Attributes0个或多个

 

Beautiful Soup库,也叫beautifulsoup4bs4

约定引用方式如下,即主要是用BeautifulSoup

from bs4 import BeautifulSoup
import bs4

HTML文件<==>标签树<==>BeautifulSoup

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup("data","html.parser")

>>> soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")

 

Beautiful Soup库解析器

soup = BeautifulSoup("data","html.parser")

解析器

使用方法

条件

bs4HTML解析器

BeautifulSoup(mk,'html.parser')

安装bs4

lxmlHTML解析器

BeautifulSoup(mk,'lxml')

pip install lxml

lxmlXML解析器

BeautifulSoup(mk,'xml')

pip install lxml

html5lib的解析器

BeautifulSoup(mk,'html5lib')

pip install html5lib


BeautifulSoup类的基本元素

(1)Tag 标签

最基本的信息组织单元,分别用<>标明开头和结尾

任何存在于HTML语法中的标签都可以用soup.访问获得

HTML文档中存在多个相同对应内容时,soup.返回第一个

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo,"html.parser")

>>> soup.title

This is a python demo page

>>> tag = soup.a

>>> tag

Basic Python

 

(2)Tagname(名字)

的名字是'p',格式:.name

每个都有自己的名字,通过.name获取,字符串类型

>>> soup.a.name

'a'

>>> soup.a.parent.name

'p'

>>> soup.a.parent.parent.name

'body'

 

(3)Tagattrs(属性)

字典形式组织,格式:.attrs

一个可以有0或多个属性,字典类型

>>> tag = soup.a

>>> tag.attrs

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']

['py1']

>>> tag.attrs['href']

'http://www.icourse163.org/course/BIT-268001'

>>> type(tag.attrs)

>>> type(tag)

 

(4)TagNavigableString

标签内非属性字符串,<>中字符串,格式:.string

>>> soup.a

Basic Python

>>> soup.a.string

'Basic Python'

>>> soup.p

The demo python introduces several python courses.

>>> soup.p.string

'The demo python introduces several python courses.'

>>> type(soup.p.string)

 

NavigableString可以跨越多个层次

 

(5)TagComment

标签内字符串的注释部分,一种特殊的Comment类型

>>> newsoup = BeautifulSoup("

This is not a comment

","html.parser")

>>> newsoup.b.string

'This is a comment'

>>> type(newsoup.b.string)

>>> newsoup.p.string

'This is not a comment'

>>> type(newsoup.p.string)




你可能感兴趣的:(python,网络爬虫)