三阶段day27-爬虫2

爬虫解析库的使用

之前爬虫的信息抓取是采用正则表达式匹配抓取,而更常用的匹配方式是使用解析库匹配。目前有两种解析库比较流行,一种是XPath,一种是Beautiful Soup(靓汤)。

一、XPath

使用前先安装lxml库到当前虚拟环境中。其支持对HTML和XML的解析,支持XPath的解析方式。

pip install lxml

下面详细介绍XPath匹配规则


import requests
import re
from lxml import etree


# 取页面HTML
def get_one_page():
    url = "https://www.douban.com/group/explore"
    headers =  {
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)" 
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        text = response.content.decode('utf-8')
        return text
    return None


# 解析页面
def parse_with_xpath(html):
    etree_html = etree.HTML(html)
    # print(etree_html)

    channel_result = etree_html.xpath('//div[@class="channel-item"]')
    for channel in channel_result:
        title = channel.xpath('./div[@class="bd"]/h3/a/text()')[0]
        print(title)

    # title_result = etree_html.xpath('//div[@class="channel-item"]/div[@class="bd"]/h3/a/text()')
    # print(title_result)

    # 匹配所有节点 //*
    # result = etree_html.xpath('//*')
    # print(result)
    # print(len(result))

    # 匹配所有子节点(a标签) //a    文本获取方法:text()
    # result = etree_html.xpath('//a/text()')
    # print(result)

    # 查找元素子节点 /
    # result = etree_html.xpath('//div/p/text()')
    # print(result)

    # 查找元素所有子孙节点 //
    # result = etree_html.xpath('//div[@class="channel-item"]')
    # result = etree_html.xpath('//div[@class="channel-item"] | //span[@class="pubtime"]/../span/a/text()')
    # print(len(result))
    # print(result)


    # 父节点 ..(返回上一节点)
    # result = etree_html.xpath('//span[@class="pubtime"]/../span/a/text()')
    # print(len(result))
    # print(result)

    # 属性匹配 [@class="xxx"]
    # 文本匹配 text() 获取所有文本//text()
    # result = etree_html.xpath('//div[@class="article"]//text()')
    # print(result)

    # 属性获取 @href
    # result = etree_html.xpath('//div[@class="article"]/div/div/@class')[0]
    # result = etree_html.xpath('//div[@class="bd"]/h3/a/@href')
    # print(result)

    # 属性多值匹配 contains(@class 'xx')
    # result = etree_html.xpath('//div[contains(@class, "grid-16-8")]//div[@class="likes"]/text()[1]')
    # print(result)

    # 多属性匹配 or, and, mod, //book | //cd, + - * div = != < > <= >=
    # result = etree_html.xpath('//span[@class="pubtime" and contains(text(), "12:")]/text()')
    # print(result)

    # 按序选择 [1] [last()] [poistion() < 3] [last() -2]
    # 节点轴
    result = etree_html.xpath('//div/child::div[@class="likes"]/following-sibling::*')
    print(result)
    print(len(result))

    # //li/ancestor::*  所有祖先节点
    # //li/ancestor::div div这个祖先节点
    # //li/attribute::* attribute轴,获取li节点所有属性值
    # //li/child::a[@href="link1.html"]  child轴,获取直接子节点
    # //li/descendant::span 获取所有span类型的子孙节点 
    # //li/following::* 选取文档中当前节点的结束标记之后的所有节点
    # //li/following-sibling::*     选取当前节点之后的所用同级节点


    # result = etree_html.xpath('//div[@class="channel-item"][1]/following-sibling::*')
    # print(result)
    # print(len(result))

    
    # result = etree_html.xpath('//div[contains(@class, "channel-group-rec")]//div[@class="title"]/following::*[1]/text()')
    # print(result)


def main():
    html = get_one_page()
    # print(html)
    parse_with_xpath(html)

if __name__ == '__main__':
    main()

http://www.runoob.com/xpath/xpath-tutorial.html ←系统介绍去这儿

二、Beautiful Soup

同样需要安装beautifulsoup4到当前虚拟环境中。

pip install beautifulsoup4
from bs4 import BeautifulSoup


获取方法:
创建网页对象
soup = BeautifulSoup(html, "lxml") # 试用lxml解析器构造beautifulsoup

print(soup.prettify()) # 取网页缩进格式化输出
print(soup.title.string) # 取网页title内容
print(soup.head) # 取网页head
print(soup.p) # 取网页所有p标签

# 获取节点的名字
print(soup.title.name)

# 获取节点属性
soup.img.attrs["src"]
print(soup.p.attrs)
print(soup.p.attrs["name"])
print(soup.p["class"])

# 获取节点包含的内容
print(soup.p.string)


嵌套选择

  this is title

# soup的节点都为 bs4.element.Tag类型,可以继续选择
print(soup.head.title.string)


关联选择
有些元素没有特征定位,可以先选择有办法定位的,然后以这个节点为准选择它的⼦节点、⽗
节点、兄弟节点等

print(soup.p.contents) # 取p节点下⾯所有⼦节点列表 print(soup.p.descendants) #取p节点所有⼦孙节点 print(soup.a.parent) # 取⽗节点 print(soup.a.parents) # 取所有祖先节点 print(soup.a.next_sibling) # 同级下⼀节点 print(soup.a.previous_sibling) # 同级上⼀节点 print(soup.a.next_siblings) # 同级所有后⾯节点 print(soup.a.previous_siblings) # 同级所有前⾯节点 print(list(soup.a.parents)[0].attrs['class']) 方法选择器 根据属性和文本进行查找
      • jjj
      print(soup.find_all(name="ul")) for ul in soup.find_all(name="ul"): print(ul.find_all(name="li")) for li in ul.find_all(name="li"): print(li.string) soup.find_all(attrs={"id": "list-1"}) css 选择器

      soup.select('.panel .panel_heading') soup.select('ul li') soup.select('#id1 .element')

bs4方法获取元素或者标签的规则同HTML!!!

你可能感兴趣的:(三阶段day27-爬虫2)