python 爬虫数据解析

ip代理

被封了加代理
http://www.goubanjia.com/

HttpConnectionPool

- 原因:
	- 短时间内发起了高频的请求导致ip被封
	- http连接池中的连接资源被耗尽
- 解决:
	- 代理
	- headers中加入Conection:"close"

数据解析

数据解析可以帮助我们实现聚焦爬虫

数据解析的实现方式

  • 正则:爬取快,但正则写起来慢
  • bs4
  • xpath:通用性比较强
  • pyquery

数据解析的通用原理

  • 爬取的数据都被存储在相关标签之中和相应的标签属性中
    • 定位标签
    • 取文本或者属性值

通过正则简单爬个糗事百科视频吧

import requests
import re
import os

dirName = './videos/'
if not os.path.exists(dirName):
   os.mkdir(dirName)

headers = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
url = 'https://www.qiushibaike.com/video/'

response = requests.get(url=url, headers=headers)
page_text = response.text
# 
ex = ' '
page_list_video = re.findall(ex, page_text, re.S)
for v in page_list_video:
   v = "https:" + v
   video_name = dirName + v.split("/")[-1]
   # 方式一:
   # response_video = requests.get(v, headers=headers).content  # bytes类型数据
   # with open(video_name, "wb") as fp:
   #     fp.write(response_video)
   # 方式二:
   from urllib import request
   request.urlretrieve(url=v, filename=video_name)

bs4解析

bs4解析的原理

  • 实例化一个BeautifulSoup的对象,需要将即将被接卸的页面源码数据加载到该对象中
  • 调用BeaufitulSoup对象中的响应方法和属性进行标签定位和数据提取
  • 环境安装
    • pip install bs4
    • pip install lxml

BeautifulSoup的实例化

  • BeautifulSoup(fp, ‘lxml’) 将本地存储的一个html文档中的数据加载到实例化好的BeautifulSoup对象中
  • BeautifulSoup(page_text, ‘lxml’) 将从互联网上获取的页面源码数据加载到实例化好的BeautifulSoup对象中


    定位标签的操作
    • soup.tagName: 定位到第一个出现的tagName标签
    • 属性定位:
      • soup.find(‘tagName’, attrName=‘value’)
      • soup.find_all(‘tagName’, attrName=‘value’) 返回值为列表
    • 选择器定位器:
      • 层级选择器:> 表示一个层级 空格表示多个层级
      • soup.select(‘选择器’)
        • soup.select(’#feng’)
        • soup.select(’.tang > ul > li’)
        • soup.select(’.tang li’)
      • 取文本
        • string: 获取直系的文本内容
        • text:获取所有的文本内容
          • a_tag = soup.select(’#feng’)[0].string
          • a_tag = soup.select(’#feng’)[0].text
        • 取属性
          • tagName[‘attrName’]

通过BeautifulSoup下载个三国吧

from bs4 import BeautifulSoup
import requests

url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
response_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(response_text, 'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('./sg.txt', 'w', encoding='utf-8')
for a in a_list:
    title = a.string
    content_url = "https://www.shicimingju.com" + a['href']
    content_text = requests.get(url=content_url, headers=headers).text
    soup = BeautifulSoup(content_text, 'lxml')
    content = soup.find('div', class_="chapter_content").text
    fp.write(f"{title}\n{content}\n\n")
    print(f'{title}下载成功!')
fp.close()

xpath解析

xpath解析原理

  • 实例化一个etree的对象,然后将即将被解析的页面源码加载到该对象中
  • 使用etree对象中的xpath方法结合着不同形式的xpath表达式实现标签定位和数据提取
  • 环境安装
    • pip install lxml
  • etree对象的实例化
    • etree.parse(“text.html”)
    • etree.HTML(page_text)

xpath表达式

  • 最左侧的/表示:xpath表达式一定要从根标签逐层进行标签查找和定位

    • tree.xpath(’/html/body/div/p’)
  • 最左侧//表示:xpath表达式可以从任意位置定位标签

    • tree.xpath(’//p’)
  • 非最左侧的/:表示一个层级

  • 非最左侧的//:表示跨多个层级

    • tree.xpath(’/html/body//p’)
  • xpath可用| 管道符连接两个表达式,提高xpath的通用性

  • 定位标签的操作

    • 属性定位://tagName[@attrName=“value”]
      • //div[@class=‘song’]
    • 索引定位://tagName[index] 索引是从1开始
  • 取文本:

    • /text():直系文本内容
      • tree.xpath(’//a[@id=“feng”]/text()’)[0]
    • //text():所有的文本内容
  • 取属性:/@attrName

    • tree.xpath(’//a[@id=“feng”]/@href’)

通过xpath简单爬取个糗事百科吧

import requests
from lxml import etree

url = 'https://www.qiushibaike.com/text/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
response_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
div_list = tree.xpath("//div[@class='article block untagged mb15 typs_hot']")
for div in div_list:
    author = div.xpath("./div/a[2]/h2/text()")[0]   # 实现局部解析
    content = div.xpath("./a[1]/div/span/text()")   # 里面有
标签,需要处理一下
content = "".join(content) print(author, content)

使用管道符|连接两个表达式

import requests
from lxml import etree

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
url = "https://www.aqistudy.cn/historydata/"
response_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
citys = tree.xpath("//div[@class='bottom']/ul/li/a/text()| //div[@class='bottom']/ul/div[2]/li/a/text()")
print(citys)

解决中文乱码

数据.encode(“iso-8859-1”).decode(“gbk”) # iso-8859-1适用范围更广一些

爬些美女图片吧

import requests
from lxml import etree
import os

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}

dirName = './mv'
if not os.path.exists(dirName):
    os.mkdir(dirName)


url = "http://pic.netbian.com/4kmeinv/index_%d.html"
for page in range(1, 10):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/index.html'
    else:
        new_url = format(url%page)

    response_text = requests.get(url=new_url, headers=headers).text
    tree = etree.HTML(response_text)
    a_list = tree.xpath("//div[@class='slist']/ul/li/a")
    for i in a_list:
        pic_path = "http://pic.netbian.com/" + i.xpath("./img/@src")[0]
        pic_name = i.xpath("./b/text()")[0]+".jpg"
        pic_name = pic_name.encode("iso-8859-1").decode("gbk")      # 解决编码问题
        pic = requests.get(url=pic_path, headers=headers).content
        with open(dirName+"/"+pic_name, "wb") as fp:
            fp.write(pic)

你可能感兴趣的:(python,#,python爬虫)