Beautiful Soup 4

Beautiful Soup 4安装

pip install beautifulsoup4

什么是beautiful soup?

是python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据

Beautiful Soup支持的解析器

解析器 使用方法 优势
Python标准库 BeautifulSoup(markup,'html.parser') Python的内置标准库、执行速度适中、文档容错能力强
lxml HTML解析器 BeautifulSoup(markup,'lxml') 速度快、文档容错能力强
lxml XML解析器 BeautifulSoup(markup,'xml') 速度快、唯一支持XML的解析器
html5lib BeautifulSoup(markup,'html5lib') 最好的容错性、以浏览器的方式解析文档,生成HTML5的格式文档

使用

  1. 引入,from bs4 import BeautifulSoup
  2. 初始化 , soup=BeautifulSoup(html文本,‘lxml’)
  3. 属性值,
    find():获取单个节点
    find_all():获取所有
    name:可以是正则表达式,可以是标签名称,可以是标签的列表[‘a’,‘img’]
    attrs:字典类型,标签的属性值

简短案例

def parse_page_data(self, response):
        ##使用bs4获取数据
        soup =BeautifulSoup(response,'lxml')
        ranks = soup.find_all(attrs={'class':'scores_List'})[0].find_all('dl')

        for dl in ranks:
            school_info = {}
            school_info['url'] = dl.select('dt a')[0].attrs['href']
            school_info['icon'] = dl.select('dt a img')[0].attrs['src']
            school_info['name'] = dl.select('dt > strong a')[0].text
            school_info['address'] = dl.select('dd > ul > li')[0].text
            school_info['test'] = ','.join([span.text for span in dl.select('dd > ul >li')[1].select('span')])
            school_info['type'] = dl.select('dd > ul > li')[2].text
            print(school_info)

你可能感兴趣的:(Beautiful Soup 4)