爬虫文档学习 xpath bs4 selenium scrapy...

爬虫

一、介绍

1、什么是爬虫

1.1 爬虫(Spider)的概念

爬虫用于爬取数据, 又称之为数据采集程序

爬取的数据来源于网络,网络中的数据可以是由Web服务器(Nginx/Apache)、数据库服务器(MySQL、Redis)、索引库(ElastichSearch)、大数据(Hbase/Hive)、视频/图片库(FTP)、云存储等(OSS)提供的。

爬取的数据是公开的、非盈利的

1.2 Python爬虫

- 使用Python编写的爬虫脚本(程序)
- 可以完成定时、定量、指定目标(Web站点)的数据爬取。
- 主要使用多(单)线程/进程、网络请求库、数据解析、数据存储、任务调度等相关技术。
- Python爬虫工程师,可以完成接口测试、功能性测试、性能测试和集成测试。

2、爬虫与Web后端服务之间的关系

爬虫使用网络请求库,相当于客户端请求, Web后端服务根据请求响应数据。

爬虫即:

- 指定URL
- 向Web服务器发起HTTP请求,
- 正确地接收响应数据,
- 然后根据数据的类型(Content-Type)进行数据的解析及存储。

爬虫程序在发起请求前,需要伪造浏览器(UA伪装):

# UA:User-Agent(请求载体的身份标识)
# UA检测:门户网站的服务器会检测对应请求的载体身份标识,如果检测到请求的载体身份标识为某一款浏览器,
# 说明该请求是一个正常的请求。但是,如果检测到请求的载体身份标识不是基于某一款浏览器的,则表示该请求
# 为不正常的请求(爬虫),则服务器端就很有可能拒绝该次请求。

# UA伪装:让爬虫对应的请求载体身份标识伪装成某一款浏览器

然后再向服务器发起请求, 响应200的成功率高很多。

3、Python爬虫技术的相关库

网络请求:

  • urllib: 内置
  • requests / urllib3 :第三方
  • tornado client: 实现异步请求
  • selenium(UI自动测试)/Splash(基于WebKit内核):动态js渲染
  • appium(手机App 的爬虫或UI测试)

数据解析:

  • re正则
  • xpath
  • bs4
  • json:RESTful接口数据

数据存储:

  • pymysql
  • mongodb
  • elasticsearch: ES搜索引擎 (JavaScript:ECMAScript(ES脚本) + BOM + DOM)

多任务库:

  • 多线程 (threading)、线程队列 queue
  • 协程(asynio、 gevent/eventlet)

爬虫框架

  • scrapy
  • scrapy-redis 分布式(多机爬虫)

4、常见反爬虫的策略

  • UA(User-Agent)策略
  • 登录限制(Cookie/Token)策略
  • 请求频次(IP代理)策略
  • 验证码(图片-云打码,文字或物件图片点选、滑块)策略
  • 动态js(Selenium/Splash/api接口)策略

二、爬虫库urllib【重要】

2.1、urllib库

from urllib.request import urlopen

# 发起网络请求
resp = urllopen('http://www.hao123.com')
assert resp.code == 200
print('请求成功')
# 保存请求的网页
# f 变量接收open()函数返回的对象的__enter__()返回结果
with open('a.html', 'wb') as f:
     f.write(resp.read())

2.1.1 urllib.request 模块

urlopen(url | request: Request, data=None) data是bytes类型

from urllib.request import urlopen  # 导入urllib库下的request的urlopen模块
  • urlopen(url, data=None)可以直接发起url的请求, 如果data不为空时,则默认是POST请求,反之为GET请求。
  • urlopen()返回是response响应对象

urlretrieve(url, filename) 下载url的资源到指定的文件

from urllib.request import urlretrieve, url2pathname
# url2pathname通过url生成本地路径
import hashlib
import os

# 下载图片
def download_img(url):
    # md5 生成文件名 + url获取文件的扩展名
    filename = hashlib.md5(url.encode('utf-8')).hexdigest() + os.path.splitext(url2pathname(url))[-1]
    urlretrieve(url, filename) # 保存文件图片

if __name__ == '__main__':
    url = 'http://xxx.xxx.xxx/1.png'
    # print(url2pathname(url))  # P:\xxx.xxx.xxx\1.png
    # print(os.path.splitext(url2pathname(url))[-1])  # 分割出 url文件扩展名 .png
    download_img('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1598329718420&di=f35bcbfa8ce0d360f4ac5ee60efe0b39&imgtype=0&src=http%3A%2F%2Fe.hiphotos.baidu.com%2Fzhidao%2Fpic%2Fitem%2Fb3119313b07eca8031a31eeb902397dda1448313.jpg')

build_opener(*handlder) 构造浏览器对象

  • opener.open(url|request, data=None) 发起请求

Request 构造请求的类

  • 可以使用这个类来定制一个请求对象,来模拟浏览器登录

  • Request 构造请求类:

    ​ Request(url, data=None,headers=None,method=None)

from urllib.request import Request # 导入urllib库下的request的Request模块
from collections import namedtuple

# 封装请求对象
request = Request(url, headers=default_headers) # 创建请求实例
resp:HTTPResponse = urlopen(request)  # 打开请求

HTTPHandler HTTP协议请求处理器

ProxyHandler(proxies={‘http’: ‘http://proxy_ip:port’}) 代理处理

HTTPCookieProcessor(CookieJar())

  • http.cookiejar.CookieJar 类

2.1.2 urllib.parse

url 解析

from urllib.parse import quote, urlencode
  • quote() 仅对中文字符串进行url编码(只针对一个汉字进行编码);

  • urlencode(query: dict) 将参数的字典中所有的values转成url编码,结果是key=value&key=value形式,即以 application/x-www-form-urlencoded作为url编码类型,可以针对多个参数进行编码)。

    【提示】

    • query: dict 参数加冒号,表示参数的数据类型
    • json上传数据时,Content-Type要设置为application/json类型
    • data 请求头的Content-Type默认类型是:application/x-www-form-urlencoded

2.1.3 response

from http.client import HTTPResponse   # 导入http.client库下的HTTPResponse模块

response:HTTPResponse = urlopen(url, timeout=30)   #  打开请求
  • response.read()

    ​ 读取的是二进制数据,需要进行转码

    ​ 字节–>字符串,解码decode 【response.read().decode()】

    ​ 字符串–>字节,编码encode

  • response.code/ response.status/response.getcode()

    ​ 获取响应状态码

  • readline()

    ​ 读取当前行的数据-文本

  • readlines()

    ​ 读取所有行数据-文本 (按行读取字节数据,返回list列表[b’’,b’’])

  • geturl()

    ​ 请求的url

  • headers/getheaders()/info()

    ​ 获取响应头

2.1.4 解决SSL问题

解决Python低版本对SSL证书的支持

import ssl # 导入ssl包
ssl._create_default_https_context = ssl._create_unverified_context # 请求前设置https上下文

2.2、示例

spider01.py

需求:利用http://hao123.com网页测试相应方法

from http.client import HTTPResponse, HTTPMessage
from urllib.request import urlopen, Request, urlretrieve

# 解决Python低版本对SSL证书的支持 3.7不用
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

url = 'https://www.baidu.com'  # http+ssl证书 (Socket三次握手、SSL的Socket的三次握手)
url2 = 'http://hao123.com'     # HTTP的请求版本(HTTP/1.0、HTTP/1.1、HTTP、2.0)

# 发起HTTP的请求
# urlopen()默认请求的方法是GET,但data不为空时,则表示请求方式是POST
response:HTTPResponse = urlopen(url2, timeout=30)
print(type(response)) # 查看响应对象的类型
bytes = response.read()
# print(response.read().decode())  # 读取响应的字节数据并转成字符串
print(response.code, response.status, response.getcode()) # 获取响应状态码,
print(response.geturl()) # 响应成功的路径
print(response.headers)

print('*'*30)
print(response.info())

print('--'*30)
print(response.readlines())  # 按行读取字节数据,返回list列表[b'',b'']

spider02.py

import json
from urllib.request import urlopen, Request
from urllib.parse import quote, urlencode, urljoin

from http.client import HTTPResponse, HTTPMessage
from collections import namedtuple

# 声明不可修改的元组类: 可以修改属性名来访问不可变的属性
# resp = Response(text=, body=, charset=, ...)
Response = namedtuple('Response',('text','body','charset','mimetype','json','headers', 'status_code'))

default_headers = {
     
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'
}

def download(url,data=None,headers=None):
    if headers:
        default_headers.update(headers)

    # 基于Request类,封装请求对象
    request = Request(url, data=data, headers=default_headers)  # 如果修改全局变量,在修改行之前使用global关键字,如 global headers

    resp: HTTPResponse = urlopen(request)
    bytes = resp.read()  # 读取字节的响应数据

    # resp.headers -> HTTPMessage
    charset = resp.headers.get_content_charset() if resp.headers.get_content_charset() else 'utf-8'
    text = bytes.decode(charset)

    # 创建字典有哪些方式???
    params = {
     
        'body': bytes,
        'charset': charset,
        'mimetype': resp.headers.get_content_type(),
        'headers': dict(resp.headers),
        'text': text,
        'json': json.loads(text) if resp.headers.get_content_type() == 'application/json' else {
     },
        'status_code': resp.status
    }

    return Response(**params)

def test_hao123():
    resp = download('http://hao123.com')
    print(resp.status_code)
    print(resp.headers)
    print(resp.text)

def test_baidu():
    resp = download('https://www.baidu.com')
    print(resp.status_code)
    # print(resp.text)
    with open('baidu.html', 'w', encoding=resp.charset) as f:
        f.write(resp.text)

    print('file created')


def baidu_search(wd):
    # quote只针对单一的汉字进行编码(转义)
    # urlencode() 可以针对多个参数进行编码, 以字典的方式传参
    url = 'https://www.baidu.com/s?wd=%s' % quote(wd)  # escape
    headers = {
     
        'Cookie': 'BAIDUID=75523031597EAA3E8BB1A69893E941A3:FG=1; '
                 'BIDUPSID=287D2F5789E4D37673357E46D76A9234; '
                 'PSTM=1597633936; BD_UPN=13314752; '
            'COOKIE_SESSION=15827_2_6_6_7_8_0_1_5_4_268_2_21926_17_0_22_1598118126_1598257071_1598257049%7C'
                 '9%23865216_4_1598257049%7C2; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598;'
                 ' H_PS_645EC=2e46hskUSTamKGaU0Dz33sTXubuzMrp9hK4vtJVqscDTlEg6LOtPFshX9hc; '
                 'BDRCVFR[Fc9oatPmwxn]=srT4swvGNE6uzdhUL68mv3; BDRCVFR[gltLrB7qNCt]=mk3SLVN4HKm; '
                 'BDRCVFR[S4-dAuiWMmn]=I67x6TjHwwYf0; BD_HOME=1; delPer=0; BD_CK_SAM=1; '
                 'PSIN…V-oTH6aogfJ1O1b4WD4UoDdaEG0PSx8g0Kubrn8EogKKy2OTH9DF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; '
                 'H_BDCLCKID_SF=JbAtoKD-JKvJfJjkM4rHqR_Lqxby26nCa6T9aJ5nJDoVSITkhM6Keb8sXN5aJxvC5b7PQbovQpP-HJ7wb-ndeqtT3priW43TMTc8Kl0MLpbtbb0xyn_VMM3beMnMBMn8teOnaITg3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFRDj0Wej33jaRybTjLHDJKWJbatjrjDCvvQxQcy4LdjG5m2lbifnTHVPJj2CoDhqnqbUJj3Cue3-Aq54RM5eLD2KjtJU3UMM5vQ-OHQfbQ0bbOqP-jW5TaQJuy3R7JOpvwDxnxy5FvQRPH-Rv92DQMVU52QqcqEIQHQT3m5-5bbN3ht6T2-DA_oC8bJKJP; shifen[185669144806_26252]=1598257070; BDSVRTM=212'
    }
    resp = download(url, headers)
    if resp.status_code == 200:
        with open('%s.html' % wd , 'w', encoding=resp.charset) as f:
            f.write(resp.text)

        print('Save %s.html File OK' % wd)


def baidu_fanyi(word):
    url = 'https://fanyi.baidu.com/sug' # post
    # urlencode 一次将多个参数进行转义(URL编码)
    data = urlencode({
     
        'kw': word
    })  # kw=%E4%BC%AF%E7%88%B5

    # 发起 post请求
    resp = download(url, data.encode('utf-8'))
    print(resp.status_code)
    print(resp.json)

def xjgg(page=1, page_size=20):
    print('--download page %s ---' % page)

    url ='http://www.ccgp-xinjiang.gov.cn/front/search/category'
    data = json.dumps({
     
        'categoryCode': "ZcyAnnouncement2",
        'pageNo': page,
        'pageSize': page_size,
        'utm': "sites_group_front.5b1ba037.0.0.3e2f7230e5f011ea817d43a31c8c15bd"
    })
    # post 请求,上传的数据是json格式的字符串
    resp = download(url, data.encode('utf-8'), headers={
     
       'Content-Type': 'application/json;charset=utf-8'
    })
    print(resp.status_code, resp.mimetype)
    print(resp.json)
    # 保存为json文件
    ret = []
    for item in resp.json['hits']['hits']:
        item['_source']['id'] = item['_id']
        item['_source']['url'] = urljoin('http://www.ccgp-xinjiang.gov.cn', item['_source']['url'])
        ret.append(item['_source'])

    with open('top_%s.json' % page, 'w', encoding=resp.charset) as f:
        json.dump(ret, f)

    if page < 10:
        xjgg(page+1)

if __name__ == '__main__':
    # test_hao123()
    # test_baidu()
    # baidu_search('基督山伯爵')
    # baidu_fanyi('雎')
    xjgg()

spider03.py


三、requests库

requests库也是一个网络请求库, 基于urllib和urllib3封装的便捷使用的网络请求库。

使用场景:

- 接口测试
- 封装基于RESTful的webserver操作,如ES搜索引擎(ElasticSearch)的SDK操作
- 第三方网络资源请求(ali云的短信验证码)
- 下载站点资源(图片、网页、音频和视频)

3.1 安装环境(库)

pip install requests -i https://mirrors.aliyun.com/pypi/simple

3.2 核心的函数

  • requests.request() 所有请求方法的基本方法

    以下是request()方法的参数说明

    • method: str 指定请求方法, GET, POST, PUT, DELETE,OPTIONS,HEAD,

    • url: str 请求的资源接口(API),在RESTful规范中即是URI(统一资源标签识符)

    • params: dict , 用于GET请求的查询参数(Query String params);如:/s?wd=python3

    • data: dict , 用于POST/PUT/PATCH 请求的表单参数(Form Data) ;封装到body(请求体)中 请求头的Content-Type默认类型是:application/x-www-form-urlencoded。借助urllib.parse.urlencode()方法序列化。

    • json: dict 用于上传json数据的参数, 封装到body(请求体)中,请求头的Content-Type默认设置为application/json,借助json.dumps()将字典序列化,把对象序列化成json字符串。

    • files: dict, 结构 {‘name’: file-like-object | tuple}, 如果是tuple, 则有三种情况:

      • (‘filename’, file-like-object),file-like-object理解为open()方法返回Stream流对象

      • (‘filename’, file-like-object, content_type) ,content_type表示打开文件的

        mimetype(image/png、image/gif、image/webp矢量图)

      • (‘filename’, file-like-object, content_type, custom-headers)

      指定files用于上传文件, 一般使用post请求,默认请求头的Content-Typemultipart/form-data类型。

    • headers/cookies : dict

    • proxies: dict , 设置代理

    • auth: tuple , 用于授权的用户名和口令, 形式(‘username’, ‘pwd’)

  • requests.get() 发起GET请求, 查询数据

    可用参数:

    • url
    • params 请求路径里带有参数时使用
    • json
    • headers/cookies/auth
    # 百度搜索
    import requests
    
    def search(wd):
        headers = {
           
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
            'Cookies': 'BAIDUID=BB1900CD6732DD7E2D3AE34543595BDC:FG=1; BIDUPSID=BB1900CD6732DD7E94833ED9AE81FDDE; PSTM=1597146175; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[Fc9oatPmwxn]=srT4swvGNE6uzdhUL68mv3; delPer=0; PSINO=1; H_PS_PSSID=1440_32621_32328_32348_32497_32481; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm'
        }
        resp = requests.get('http://www.baidu.com/s', params={
           
            'wd':wd
        }, headers=headers)
        print(resp.status_code)
        print(resp.text)
        with open('%s.html' % wd, 'wb') as f:
            f.write(resp.content)
            
    if __name__ == '__main__':
        search('python3.8')
    
  • requests.post() 发起POST请求, 上传/添加数据

    可用参数:

    • url
    • data/files
    • json
    • headers/cookies/auth
    # 百度翻译
    import requests
    
    def fanyi(kw):
        url = 'https://fanyi.baidu.com/sug'
        resp = requests.post(url, data={
           
            'kw':kw
        })
    
        # response响应对象存在属性:status_code/headers/cookies/encode/text/content/json()
        print(resp.json())
       
    if __name__ == '__main__':
        fanyi('good')
    
  • requests.put() 发起PUT请求, 修改或更新数据

  • requests.patch() HTTP幂等性的问题,可能会出现重复处理, 不建议使用。用于更新数据

  • requests.delete() 发起DELETE请求,删除数据

3.3 requests.Response

以上的请求方法返回的对象类型是Response, 对象常用的属性如下:

  • status_code 响应状态码

  • url 请求的url

  • headers : dict 响应的头, 相对于urllib的响应对象的getheaders(),但不包含cookie。

  • cookies: 可迭代的对象,元素是Cookie类对象(name, value, path)

  • text : 响应的文本信息

  • content: 响应的字节数据

  • encoding: 响应数据的编码字符集, 如utf-8, gbk, gb2312

  • json(): 如果响应数据类型为application/json,则将响应的数据进行反序化成python的list或dict对象。

    • 扩展-javascript的序列化和反序列化
      • JSON.stringify(obj) 序列化

      • JSON.parse(text) 反序列化

四、数据解析方式之re正则

字符的表示

  • . 任意一个字符, 除了换行
  • [a-f] 范围内的任意一个字符
  • \w 字母、数字和下划线组成的任意的字符
  • \W
  • \d
  • \D
  • \s
  • \S

量词(数量)表示

  • * 0或多个
  • + 1或多个
  • ? 0 或 1 个
  • {n} n 个
  • {n,} 至少n个
  • {n, m} n~m个

分组表示

  • ( ) 普通的分组表示, 多个正则分组时, search().groups() 返回是元组

  • (?P 字符+数量)带有名称的分组, 多个正则分组时,search().groupdict()返回是字典, 字典的key即是分组名。

Python中的正则模块

  • re.compile() 一次生成正则对象,可以多次匹配查询
  • re.match(正则对象, 字符串)
  • re.search()
  • re.findall()
  • re.sub()
  • re.split()

4.1 示例

糗事百科糗图爬取:

import requests
import re
import os

if __name__ == '__main__':
    headers = {
     
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    }
    # 创建文件夹,保存所有图片
    if not os.path.exists('./qiutuLibs'):
       os.mkdir('./qiutuLibs')

    # 设置通用的url模板
    url = 'https://www.qiushibaike.com/imgrank/page/%d/'
    for pageNum in range(1,3):
        # 对应页码的url
        new_url = format(url%pageNum)

        # 使用通用爬虫对url对应的一整张页面进行爬取
        page_text = requests.get(url=new_url, headers=headers).text

        # 使用聚焦爬虫将页面中所有的糗图进行解析/提取
        ex = '
.*?' img_src_list = re.findall(ex,page_text,re.S) # print(img_src_list) for src in img_src_list: # 拼接处一个完整的路径 src = 'https:'+ src # 请求到了图片的二进制数据 img_data = requests.get(url=src, headers=headers).content # 生成图片名称 img_name = src.split('/')[-1] # 图片存储的路径 imgPath = './qiutuLibs/' + img_name with open(imgPath, 'wb') as fp: fp.write(img_data) print(img_name, '下载成功!!!')

五、BS4(BeatifulSoup)

5.1. 介绍

什么是BeatifulSoup
BeautifulSoup,和lxml一样,是一个html的解析器主要功能也是解析和提取数据
官网

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

优缺点:

效率没有lxml的效率高
接口设计人性化,使用方便

5.2. 使用

5.2.1. 数据解析原理

1、标签定位

2、提取标签、标签属性中存储的数据值

- bs4 数据解析的原理:
	- 1、实例化一个BeautifulSoup对象,并将页面源码数据加载到该对象中
	- 2、通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取

环境安装:

- pip install bs4
- pip install lxml(xpath中也要使用)

导入BeautifulSoup包:

from bs4 import BeautifulSoup

创建对象(对象实例化):

网上文件生成对象:soup = BeautifulSoup('网上下载的字符串', 'lxml')
本地文件生成对象:soup = BeautifulSoup(open('1.html'), 'lxml')

5.2.2. 数据解析的方法和属性

  1. soup.tagname 返回的是html中第一次出现的tagname标签

  2. soup.find():

    - find('tagName') == soup.div
    - 属性定位:
    	- soup.find('div', class_/id/attr='xxx')
    	- soup.find_all('tagName') 返回符合要求的所有标签(列表)
    
  3. select:

    - select('某种选择器 (id,class,标签...选择器)') 返回的是一个列表
    - 层级选择器:
    	- soup.select('.xxx > ul > li > a'): > 表示的是一个层级(父子标签)
    	- soup.select('.xxx > ul a'): 空格表示的是多个层级(不是相邻的)
    
  4. 获取标签之间的文本数据:

    - soup.a.text/string/get_text()
    	- text/get_text():可以获取某一个标签中所有的文本内容
    	- string:只可以获取该标签下面直系的文内容
    
  5. 获取标签的属性值

    -soup.a['href']
    

5.2.3 案例

示例:

需求:爬取三国演义小说所有的章节标题和章节内容

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
#需求:爬取三国演义小说所有的章节标题和章节内容http://www.shicimingju.com/book/sanguoyanyi.html
if __name__ == "__main__":
    #对首页的页面数据进行爬取
    headers = {
     
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    }
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url=url,headers=headers).text

    #在首页中解析出章节的标题和详情页的url
    #1.实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text,'lxml')
    #解析章节标题和详情页的url
    li_list = soup.select('.book-mulu > ul > li')
    fp = open('./sanguo.txt','w',encoding='utf-8')
    for li in li_list:
        title = li.a.string
        detail_url = 'http://www.shicimingju.com'+li.a['href']
        #对详情页发起请求,解析出章节内容
        detail_page_text = requests.get(url=detail_url,headers=headers).text
        #解析出详情页中相关的章节内容
        detail_soup = BeautifulSoup(detail_page_text,'lxml')
        div_tag = detail_soup.find('div',class_='chapter_content')
        #解析到了章节的内容
        content = div_tag.text
        fp.write(title+':'+content+'\n')
        print(title,'爬取成功!!!')

后期示例:

import requests
from bs4 import BeautifulSoup

from utils.ua_ import get_ua

def get(url, callback=None):
    headers = {
     
        'User-Agent': get_ua()
    }
    resp = requests.get(url, headers=headers, timeout=10)
    if resp.status_code == 200:
        resp.encoding = 'utf-8' if not resp.encoding else resp.encoding
        if callback is None:
            parse(url, resp.text)
        else:
            callback(url, resp.text)

def start_spider():
    get('http://www.qiushibaike.com/text')

def parse(url, html):
    # print(html)
    soup = BeautifulSoup(html, 'lxml')

    # 获取所有的文章:
for article in soup.find_all('div', class_='article'): # article: bs4.element.Tag类实例 print(article.get('class'), type(article)) img_src = 'https:'+article.find('img').get('src') # print(img_src) # 获取详情页面的URL info_url = 'http://www.qiushibaike.com'+article.find('a', class_="contentHerf").get('href') print(info_url) get(info_url, parse_content) # 发起请求 # 任务3:获取下一页的URL,并请求下一页 # next_url = '' # get(next_url) def parse_content(url, html): soup = BeautifulSoup(html, 'lxml') #select()方法返回是 实例,可以索引下标操作或迭代 author_a = soup.select('.side-left-userinfo')[0] img_tag = author_a.find('img') author_img = 'https:' + img_tag.get('src') author_name = img_tag.get('alt') # author_a.find('span').string # .text / .get_text() content = soup.find('div', class_="content") item_pipeline(dict(author_img=author_img, name=author_name, content=content.text)) def item_pipeline(item): print(item) # 任务2:保存起来 if __name__ == '__main__': start_spider()

六、数据解析方式之xpath

xpath属于xml/html解析数据的一种方式, 基于元素(Element)的树形结构(Node > Element)。选择某一元素时,根据元素的路径选择,如 /html/head/title获取</code>标签。</p> </blockquote> <pre><code>- XML 被设计用来传输和存储数据。XML 指可扩展标记语言(EXtensible Markup Language) - HTML 被设计用来显示数据。 </code></pre> <h3>6.1. xpath解析原理:</h3> <pre><code>- 1、实例化一个etree的对象,且需要将被解析的页面源码数据加载到该数据对象中 - 2、调用etree对象中的xpath 方法结合着xpath表达式标签的定位和内容的捕获 </code></pre> <h3>6.2. 环境安装:</h3> <pre><code>- pip install lxml </code></pre> <h3>6.3. 实例化一个etree对象:</h3> <pre><code>- 1、导包: from lxml import etree - 2、将本地的html文档中的源码数据加载到etree对象中: etree.parse(filePath) - 3、可以将从互联网上获取的源码数据加载到该对象中 etree.HTML('page_text') </code></pre> <h3>6.4. xpath(‘xpath表达式’)</h3> <pre><code>- /:表示的是从根节点开始定位。表示的是一个层级。 - //:表示的是多个层级。可以表示从任意位置开始定位。 - 属性定位://div[@class='song'] tag[@attrName="attrValue"] - 索引定位://div[@class="song"]/p[3] 索引是从1开始的。 - 取文本: - /text() 获取的是标签中直系的文本内容 - //text() 标签中非直系的文本内容(所有的文本内容) - 取属性: /@attrName ==>img/src </code></pre> <ol> <li> <p>[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Hx41BC9a-1602569751032)(E:\07-notes\picture\04-路径查询.png)]</p> <h3>6.5. 示例1:</h3> <p>需求:爬取58二手房中的房源信息和价格</p> <pre><code class="prism language-python"><span class="token comment">#!/usr/bin/python3</span> <span class="token comment"># 需求:爬取58二手房中的房源信息和价格</span> <span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> headers <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'</span> <span class="token punctuation">}</span> <span class="token comment"># 爬取到页面源码数据</span> url <span class="token operator">=</span> <span class="token string">'https://xa.58.com/ershoufang/'</span> page_text <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token operator">=</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span><span class="token punctuation">.</span>text <span class="token comment"># 数据解析</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_text<span class="token punctuation">)</span> <span class="token comment"># 存储的就是li标签对象</span> li_list <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//ul[@class="house-list-wrap"]/li'</span><span class="token punctuation">)</span> fp <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'58二手房'</span><span class="token punctuation">,</span> <span class="token string">'w'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> <span class="token comment"># 局部解析</span> title <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[2]/h2/a/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> price <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[3]/p//text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> price2 <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[3]/p//text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> price3 <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[3]/p//text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span>title <span class="token operator">+</span> <span class="token string">'\n'</span> <span class="token operator">+</span> price <span class="token operator">+</span> price2 <span class="token operator">+</span> price3 <span class="token operator">+</span> <span class="token string">'\n'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title <span class="token operator">+</span> <span class="token string">'\n'</span> <span class="token operator">+</span> price<span class="token punctuation">,</span> price2<span class="token punctuation">,</span> price3<span class="token punctuation">)</span> </code></pre> <p>示例2:</p> <pre><code>需求:解析下载图片数据 http://pic.netbian.com/4kmeinv/ </code></pre> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> os <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 创建文件夹,保存所有图片</span> <span class="token keyword">if</span> <span class="token operator">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span><span class="token string">'E:/03-图片/03-picture/picbz'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> os<span class="token punctuation">.</span>mkdir<span class="token punctuation">(</span><span class="token string">'E:/03-图片/03-picture/picbz'</span><span class="token punctuation">)</span> headers <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'</span> <span class="token punctuation">}</span> <span class="token comment"># url = 'http://www.netbian.com/meinv/'</span> url <span class="token operator">=</span> <span class="token string">'http://pic.netbian.com/4kmeinv/index_%d.html'</span> <span class="token keyword">for</span> pageNum <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span><span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span> new_url <span class="token operator">=</span> <span class="token builtin">format</span><span class="token punctuation">(</span>url <span class="token operator">%</span> pageNum<span class="token punctuation">)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token operator">=</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span> <span class="token comment"># 手动设定响应数据的编码格式</span> <span class="token comment"># response.encoding = 'utf-8'</span> page_text <span class="token operator">=</span> response<span class="token punctuation">.</span>text <span class="token comment"># 数据解析:src的属性值 alt属性</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_text<span class="token punctuation">)</span> li_list <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//ul[@class="clearfix"]/li'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> img_src <span class="token operator">=</span> <span class="token string">'http://pic.netbian.com'</span> <span class="token operator">+</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./a/img/@src'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> img_name <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./a/img/@alt'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">+</span> <span class="token string">'.jpg'</span> <span class="token comment"># 通用处理中文乱码的解决方案</span> img_name <span class="token operator">=</span> img_name<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'iso-8859-1'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">'gbk'</span><span class="token punctuation">)</span> <span class="token comment"># 请求图片进行持久化存储</span> img_data <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token operator">=</span>img_src<span class="token punctuation">,</span>headers<span class="token operator">=</span>headers<span class="token punctuation">)</span><span class="token punctuation">.</span>content img_path <span class="token operator">=</span> <span class="token string">'E:/03-图片/03-picture/picbz/'</span> <span class="token operator">+</span> img_name <span class="token comment"># print(img_name, img_src)</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>img_path<span class="token punctuation">,</span> <span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>img_data<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>img_name<span class="token punctuation">,</span> <span class="token string">'下载成功'</span><span class="token punctuation">)</span> </code></pre> </li> </ol> <h3>6.6 示例</h3> <h4>6.6.1 要求:站长之家照片采集</h4> <pre><code class="prism language-python"><span class="token comment"># xpath 是解决HTML页面中数据的提取</span> <span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree url <span class="token operator">=</span> <span class="token string">'http://sc.chinaz.com/tupian/'</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">if</span> resp<span class="token punctuation">.</span>status_code <span class="token operator">==</span><span class="token number">200</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--请求成功--'</span><span class="token punctuation">)</span> <span class="token comment"># 开始使用xpath</span> <span class="token comment"># 1、获取网页根元素</span> resp<span class="token punctuation">.</span>encoding<span class="token operator">=</span><span class="token string">'utf-8'</span> <span class="token comment"># 获取文本数据之前,可以设置字符集</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># 将下载的html文本转成xpath的根元素的Element</span> <span class="token comment"># with open('a.html', 'wb') as f:</span> <span class="token comment"># f.write(resp.content) # 将网页写入文件,方便查找标签</span> <span class="token comment"># 根据属性条件者直接或间接查找子元素</span> img_elements <span class="token operator">=</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="pic_wrap"]//img'</span><span class="token punctuation">)</span> <span class="token comment"># 返回列表类型list[<Element img at 0x24f3f4befc8>,...]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>img_elements<span class="token punctuation">)</span> <span class="token keyword">for</span> img_element <span class="token keyword">in</span> img_elements<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>img_element<span class="token punctuation">)</span> <span class="token comment"># <Element img at 0x20fa767ed88></span> <span class="token keyword">print</span><span class="token punctuation">(</span>img_element<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'alt'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 霸气手持香烟美女图片</span> <span class="token keyword">print</span><span class="token punctuation">(</span>img_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./@alt'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># 霸气手持香烟美女图片</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span><span class="token punctuation">}</span> <span class="token comment"># data['title'] = img_element.xpath('./@alt')[0] # 返回的是list[<str>,....]</span> <span class="token comment"># data['src'] = img_element.xpath('./@src2')[0]</span> <span class="token comment"># data['src'], data['title'] = img_element.xpath('./@alt |./@src2')</span> data<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span> <span class="token operator">=</span> img_element<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'alt'</span><span class="token punctuation">)</span> <span class="token comment"># 获取标签元素的属性</span> data<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span> <span class="token operator">=</span> img_element<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'src2'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span> <span class="token comment"># 获取下一页的连接</span> <span class="token comment"># xpath()返回的是list类型</span> page_next_url <span class="token operator">=</span> url <span class="token operator">+</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//a[@class="nextpage"]/@href'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>page_next_url<span class="token punctuation">)</span> </code></pre> <h4>6.6.2 代码优化</h4> <p>​ —> 脚本封装为函数并采集到下一页数据</p> <pre><code class="prism language-python"><span class="token keyword">import</span> os <span class="token keyword">import</span> random <span class="token keyword">import</span> time <span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlretrieve <span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">from</span> utils<span class="token punctuation">.</span>es <span class="token keyword">import</span> ESIndex index <span class="token operator">=</span> ESIndex<span class="token punctuation">(</span><span class="token string">'chinaz_zhb'</span><span class="token punctuation">,</span> <span class="token string">'10.36.172.79'</span> <span class="token punctuation">,</span> <span class="token number">9200</span><span class="token punctuation">)</span> index<span class="token punctuation">.</span>create_index<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 1.开始</span> <span class="token keyword">def</span> <span class="token function">start_spider</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'正在下载'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span> <span class="token keyword">if</span> resp<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'下载成功'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span> <span class="token comment"># 开始xpath解析(封装函数)</span> resp<span class="token punctuation">.</span>encoding<span class="token operator">=</span><span class="token string">'utf-8'</span> parse<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># //div[@class="pic_wrap"]//img</span> <span class="token comment"># index = 1</span> <span class="token comment"># 2. 解析函数</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> <span class="token comment"># 获取当前网页的根节点</span> <span class="token comment"># global index</span> <span class="token comment"># with open('%s.html' % index, 'w', encoding='utf-8') as f:</span> <span class="token comment"># f.write(html)</span> <span class="token comment">#</span> <span class="token comment"># index += 1</span> xpath_str <span class="token operator">=</span> <span class="token string">'//div[@class="pic_wrap"]//img'</span> <span class="token keyword">if</span> url<span class="token punctuation">.</span>endswith<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span>\ <span class="token keyword">else</span> <span class="token string">'//div[@id="container"]//img'</span> <span class="token keyword">print</span><span class="token punctuation">(</span>xpath_str<span class="token punctuation">)</span> <span class="token keyword">for</span> img <span class="token keyword">in</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span>xpath_str<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 返回列表类型list[<Element img at 0x24f3f4befc8>,...]</span> item <span class="token operator">=</span> <span class="token punctuation">{ </span><span class="token punctuation">}</span> item<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span> <span class="token operator">=</span> img<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'alt'</span><span class="token punctuation">)</span> <span class="token comment"># 获取标签元素的属性</span> item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span> <span class="token operator">=</span> img<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'src2'</span><span class="token punctuation">)</span> <span class="token comment"># 保存数据</span> item_pipeline<span class="token punctuation">(</span>item<span class="token punctuation">)</span> <span class="token comment"># 获取下一页的url</span> <span class="token keyword">if</span> url<span class="token punctuation">.</span>endswith<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> base_url <span class="token operator">=</span> url <span class="token keyword">else</span><span class="token punctuation">:</span> base_url <span class="token operator">=</span> url<span class="token punctuation">[</span><span class="token punctuation">:</span>url<span class="token punctuation">.</span>rfind<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span><span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">]</span> next_page_url <span class="token operator">=</span> base_url<span class="token operator">+</span>root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//a[@class="nextpage"]/@href'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>next_page_url<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span><span class="token number">0.2</span><span class="token punctuation">,</span> <span class="token number">3.5</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 休眠随机时间</span> <span class="token comment"># 开始循环拿数据</span> start_spider<span class="token punctuation">(</span>next_page_url<span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">'Referer'</span><span class="token punctuation">:</span> url<span class="token punctuation">,</span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token comment"># 3. 保存数据处理</span> <span class="token keyword">def</span> <span class="token function">item_pipeline</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span> filename <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span> <span class="token operator">+</span> item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span><span class="token punctuation">[</span>item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>rfind<span class="token punctuation">(</span><span class="token string">'.'</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token punctuation">]</span> <span class="token comment"># item['src'][item['src'].rfind('.'):]扩展名</span> download_img<span class="token punctuation">(</span>item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> filename<span class="token punctuation">)</span> <span class="token comment"># 保存数据到ES搜索引擎</span> index<span class="token punctuation">.</span>add_doc<span class="token punctuation">(</span><span class="token string">'images'</span><span class="token punctuation">,</span> <span class="token operator">**</span>item<span class="token punctuation">)</span> <span class="token comment"># 下载图片的方法</span> <span class="token keyword">def</span> <span class="token function">download_img</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> filename<span class="token punctuation">)</span><span class="token punctuation">:</span> urlretrieve<span class="token punctuation">(</span>url<span class="token punctuation">,</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token string">'images'</span><span class="token punctuation">,</span> filename<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> start_spider<span class="token punctuation">(</span><span class="token string">'http://sc.chinaz.com/tupian/'</span><span class="token punctuation">)</span> </code></pre> <h4>6.6.3 古诗文网站数据采集</h4> <pre><code class="prism language-python"><span class="token comment"># 古诗文网站数据采集</span> <span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">from</span> lxml<span class="token punctuation">.</span>etree <span class="token keyword">import</span> _Element <span class="token keyword">class</span> <span class="token class-name">GswSpider</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 起始的url</span> start_url <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://so.gushiwen.org/shiwen/'</span><span class="token punctuation">]</span> base_url <span class="token operator">=</span> <span class="token string">'https://so.gushiwen.org'</span> headers <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'</span> <span class="token punctuation">}</span> DEBUG <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token comment"># 打印函数</span> <span class="token keyword">def</span> <span class="token function">log</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> <span class="token operator">*</span>args<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">'\n'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>DEBUG<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token operator">*</span>args<span class="token punctuation">,</span> sep<span class="token operator">=</span>sep<span class="token punctuation">)</span> <span class="token comment"># 启动爬虫</span> <span class="token keyword">def</span> <span class="token function">start_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> self<span class="token punctuation">.</span>start_url<span class="token punctuation">:</span> self<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 封装请求函数(下载)</span> <span class="token keyword">def</span> <span class="token function">get</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> parse_func<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>self<span class="token punctuation">.</span>headers<span class="token punctuation">)</span> <span class="token comment"># 发起get请求</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>status_code<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">)</span> <span class="token comment"># 打印url和状态码 https://so.gushiwen.org/shiwen/ 200</span> <span class="token keyword">if</span> resp<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span> <span class="token comment"># 字符集处理</span> <span class="token keyword">if</span> parse_func <span class="token keyword">is</span> <span class="token boolean">None</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>parse<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> parse_func<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span> <span class="token comment"># 解析函数</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 解析 右侧类型</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> <span class="token comment"># 获取当前网页的根节点</span> <span class="token keyword">for</span> a_element <span class="token keyword">in</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="main3"]/div[last()]/div[1]//a'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> a_element<span class="token punctuation">.</span>text <span class="token comment"># 拿到text属性 小学古诗</span> href <span class="token operator">=</span> a_element<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token comment"># https://so.gushiwen.org/gushi/xiaoxue.aspx</span> <span class="token comment"># self.log(name, href, sep='--->') # 小学古诗--->https://so.gushiwen.org/gushi/xiaoxue.aspx</span> self<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> self<span class="token punctuation">.</span>parse_gsw_list<span class="token punctuation">)</span> <span class="token comment"># 第二次发起请求</span> <span class="token comment"># 解析某一个分类下的所有古诗文</span> <span class="token keyword">def</span> <span class="token function">parse_gsw_list</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> left_div<span class="token punctuation">:</span> _Element <span class="token operator">=</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="main3"]/div[1]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># self.log(type(left_div)) # <class 'lxml.etree._Element'></span> cate_name <span class="token operator">=</span> left_div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//h1/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 当前标签的文本 小学古诗文</span> <span class="token keyword">for</span> subtype_element <span class="token keyword">in</span> left_div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//div[@class="typecont"]'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> subtype_name <span class="token operator">=</span> subtype_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//strong/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 一年级上册</span> <span class="token keyword">except</span><span class="token punctuation">:</span> subtype_name <span class="token operator">=</span> <span class="token string">''</span> <span class="token comment"># self.log(cate_name, subtype_name, sep='=>') # 小学古诗文=>一年级上册</span> <span class="token keyword">print</span><span class="token punctuation">(</span>subtype_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//span/a'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">for</span> span_a <span class="token keyword">in</span> subtype_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//span/a'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> href <span class="token operator">=</span> span_a<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token comment"># https://so.gushiwen.org/shiwenv_ef9cd9ba44bb.aspx</span> <span class="token keyword">print</span><span class="token punctuation">(</span>href<span class="token punctuation">)</span> <span class="token keyword">if</span> href<span class="token punctuation">:</span> href <span class="token operator">=</span> href <span class="token keyword">if</span> href<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">'http'</span><span class="token punctuation">)</span> <span class="token keyword">else</span> self<span class="token punctuation">.</span>base_url <span class="token operator">+</span> href book_name <span class="token operator">=</span> span_a<span class="token punctuation">.</span>text <span class="token comment"># 诗名 江南</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span>cate_name<span class="token punctuation">,</span> subtype_name<span class="token punctuation">,</span> book_name<span class="token punctuation">,</span> href<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">'=>'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> self<span class="token punctuation">.</span>parse_gsw<span class="token punctuation">,</span> category<span class="token operator">=</span>cate_name<span class="token punctuation">,</span> subtype<span class="token operator">=</span>subtype_name<span class="token punctuation">)</span> <span class="token comment"># 第三次发起请求</span> <span class="token comment"># 解析某一古诗详情</span> <span class="token keyword">def</span> <span class="token function">parse_gsw</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> html<span class="token punctuation">,</span> category<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> subtype<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'parse_gsw'</span><span class="token punctuation">,</span> category<span class="token punctuation">,</span> subtype<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">)</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> <span class="token comment"># 提取诗文的内容</span> left_div <span class="token operator">=</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="main3"]/div[1]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> gsw_element <span class="token operator">=</span> left_div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[2]/div[1]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> name <span class="token operator">=</span> gsw_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//h1/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> era<span class="token punctuation">,</span> author <span class="token operator">=</span> gsw_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./p/a/text()'</span><span class="token punctuation">)</span> content <span class="token operator">=</span> <span class="token string">'\n'</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>c<span class="token punctuation">.</span>replace<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> c <span class="token keyword">in</span> gsw_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[last()]/text()'</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># self.log(name, author, era, sep='+')</span> <span class="token comment"># print(content)</span> self<span class="token punctuation">.</span>item_pipeline<span class="token punctuation">(</span><span class="token builtin">dict</span><span class="token punctuation">(</span>name<span class="token operator">=</span>name<span class="token punctuation">,</span> author<span class="token operator">=</span>author<span class="token punctuation">,</span> era<span class="token operator">=</span>era<span class="token punctuation">,</span> category<span class="token operator">=</span>category<span class="token punctuation">,</span> subtype<span class="token operator">=</span>subtype<span class="token punctuation">,</span> content<span class="token operator">=</span>content<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">item_pipeline</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span> <span class="token comment"># 写入到ES中</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> spider <span class="token operator">=</span> GswSpider<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># spider.DEBUG=False</span> spider<span class="token punctuation">.</span>start_spider<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <p>【提示】:</p> <p>1、解析类型数据获取标签举例:</p> <p>​ [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P9dVyAhw-1602569751036)(E:\07-notes\picture\06-古诗文采集标签举例.png)]</p> <p>2、古诗文采集解析某一分类数据标签举例:</p> <p>[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VSEskUR9-1602569751039)(E:\07-notes\picture\07-古诗文采集解析某一分类数据标签举例.png)]</p> <h4>6.6.4 协程版古诗文数据采集</h4> <pre><code class="prism language-python"><span class="token comment"># 古诗文网站的数据采集</span> <span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">from</span> lxml<span class="token punctuation">.</span>etree <span class="token keyword">import</span> _Element <span class="token keyword">import</span> asyncio <span class="token keyword">class</span> <span class="token class-name">GswSpider</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://so.gushiwen.org/shiwen/'</span><span class="token punctuation">]</span> base_url <span class="token operator">=</span> <span class="token string">'https://so.gushiwen.org'</span> headers <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'</span> <span class="token punctuation">}</span> DEBUG <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token comment"># 加装饰器</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">log</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> <span class="token operator">*</span>args<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">'\n'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>DEBUG<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token operator">*</span>args<span class="token punctuation">,</span> sep<span class="token operator">=</span>sep<span class="token punctuation">)</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">start_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> self<span class="token punctuation">.</span>start_urls<span class="token punctuation">:</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">get</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> parse_func<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>self<span class="token punctuation">.</span>headers<span class="token punctuation">)</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>status_code<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">)</span> <span class="token keyword">if</span> resp<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span> <span class="token keyword">if</span> parse_func <span class="token keyword">is</span> <span class="token boolean">None</span><span class="token punctuation">:</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>parse<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> parse_func<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> <span class="token keyword">for</span> a_element <span class="token keyword">in</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="main3"]/div[last()]/div[1]//a'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> a_element<span class="token punctuation">.</span>text href <span class="token operator">=</span> a_element<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span>name<span class="token punctuation">,</span> href<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">'->'</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> self<span class="token punctuation">.</span>parse_gsw_list<span class="token punctuation">)</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">parse_gsw_list</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 解析某一个分类下的所有诗文</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> left_div<span class="token punctuation">:</span> _Element <span class="token operator">=</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="main3"]/div[1]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># self.log(type(left_div))</span> cate_name <span class="token operator">=</span> left_div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//h1/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">for</span> subtype_element <span class="token keyword">in</span> left_div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//div[@class="typecont"]'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> subtype_name <span class="token operator">=</span> subtype_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//strong/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">except</span><span class="token punctuation">:</span> subtype_name <span class="token operator">=</span> <span class="token string">''</span> <span class="token keyword">for</span> span_a <span class="token keyword">in</span> subtype_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//span/a'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> href <span class="token operator">=</span> span_a<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> href<span class="token punctuation">:</span> href <span class="token operator">=</span> href <span class="token keyword">if</span> href<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">'http'</span><span class="token punctuation">)</span> <span class="token keyword">else</span> self<span class="token punctuation">.</span>base_url <span class="token operator">+</span> href book_name <span class="token operator">=</span> span_a<span class="token punctuation">.</span>text <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span>cate_name<span class="token punctuation">,</span> subtype_name<span class="token punctuation">,</span> book_name<span class="token punctuation">,</span> href<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">'>'</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> self<span class="token punctuation">.</span>parse_gsw<span class="token punctuation">,</span> category<span class="token operator">=</span>cate_name<span class="token punctuation">,</span> subtype<span class="token operator">=</span>subtype_name<span class="token punctuation">)</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">parse_gsw</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">,</span> html<span class="token punctuation">,</span> category<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> subtype<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'parse_gsw'</span><span class="token punctuation">,</span> category<span class="token punctuation">,</span> subtype<span class="token punctuation">,</span> sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">)</span> root <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>html<span class="token punctuation">)</span> left_div <span class="token operator">=</span> root<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@class="main3"]/div[1]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 提取诗文内容</span> gsw_element <span class="token operator">=</span> left_div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[2]/div[1]'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> name <span class="token operator">=</span> gsw_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//h1/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> era<span class="token punctuation">,</span> author <span class="token operator">=</span> gsw_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./p/a/text()'</span><span class="token punctuation">)</span> content <span class="token operator">=</span> <span class="token string">'\n'</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>c<span class="token punctuation">.</span>replace<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> c <span class="token keyword">in</span> gsw_element<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[last()]//text()'</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># self.log(name, author, era, sep='+')</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> self<span class="token punctuation">.</span>item_pipeline<span class="token punctuation">(</span><span class="token builtin">dict</span><span class="token punctuation">(</span>name<span class="token operator">=</span>name<span class="token punctuation">,</span> author<span class="token operator">=</span>author<span class="token punctuation">,</span> era<span class="token operator">=</span>era<span class="token punctuation">,</span> category<span class="token operator">=</span>category<span class="token punctuation">,</span> subtype<span class="token operator">=</span>subtype<span class="token punctuation">,</span> content<span class="token operator">=</span>content<span class="token punctuation">)</span><span class="token punctuation">)</span> @asyncio<span class="token punctuation">.</span>coroutine <span class="token keyword">def</span> <span class="token function">item_pipeline</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>item<span class="token punctuation">)</span> <span class="token comment"># 写入到ES中</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> spider <span class="token operator">=</span> GswSpider<span class="token punctuation">(</span><span class="token punctuation">)</span> spider<span class="token punctuation">.</span>DEBUG<span class="token operator">=</span><span class="token boolean">False</span> <span class="token comment"># asyncio.run(spider.start_spider()) # 获取事件模型对象</span> loop <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span> loop<span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span>spider<span class="token punctuation">.</span>start_spider<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 单协程对象启动</span> <span class="token comment"># tasks = [spider.start_spider(), spider.start_spider(), spider.start_spider()]</span> <span class="token comment"># loop.run_until_complete(asyncio.wait(tasks)) # 多协程对象启动</span> <span class="token comment"># spider.get('https://so.gushiwen.org/shiwenv_e4df1367a39a.aspx', spider.parse_gsw)</span> </code></pre> <h2>七、验证码</h2> <h3>7.1 Cookie:</h3> <pre><code>http/https协议特性:无状态。 没有请求到对应页面数据的原因: 发起的第二次基于个人主页页面请求的时候,服务器端并不知道该此请求是基于登录状态下的请求。 cookie:用来让服务器端记录客户端的相关状态。 - 手动处理:通过抓包工具获取cookie值,将该值封装到headers中。(不建议) - 自动处理: - cookie值的来源是哪里? - 模拟登录post请求后,由服务器端创建。 session会话对象: - 作用: 1.可以进行请求的发送。 2.如果请求过程中产生了cookie,则该cookie会被自动存储/携带在该session对象中。 - 创建一个session对象:session = requests.Session() - 使用session对象进行模拟登录post请求的发送(cookie就会被存储在session中) - session对象对个人主页对应的get请求进行发送(携带了cookie) </code></pre> <h2>八、代理</h2> <pre><code>代理:破解封IP这种反爬机制。 什么是代理: - 代理服务器。 代理的作用: - 突破自身IP访问的限制。 - 隐藏自身真实IP 代理相关的网站: - 快代理 - 西祠代理 - www.goubanjia.com 代理ip的类型: - http:应用到http协议对应的url中 - https:应用到https协议对应的url中 代理ip的匿名度: - 透明:服务器知道该次请求使用了代理,也知道请求对应的真实ip - 匿名:知道使用了代理,不知道真实ip - 高匿:不知道使用了代理,更不知道真实的ip </code></pre> <h2>九、多任务爬虫</h2> <blockquote> <p>在爬虫中使用异步实现高性能的数据爬取操作</p> </blockquote> <h3>1. 进程和线程</h3> <ul> <li> <p>multiprocessing模块(进程)</p> <ul> <li>Process 进程类</li> <li>Queue 进程间通信的队列 <ul> <li>put(item, timeout)</li> <li>item = get(timeout)</li> </ul> </li> </ul> <p>进程使用场景:</p> <ul> <li> <p>服务程序与客户程序分离,如:mysql服务和mysql客户端、Docker、Redis、ElasticSearch等服务都属于进程使用场景</p> </li> <li> <p>服务框架中使用,如:scrapy、Django、Flask。</p> </li> <li> <p>分离业务中使用,如:进程池的定时任务和计划编排等,Celery框架</p> </li> <li> <p>使用场景图示:</p> <p>​ [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8uyxkhNQ-1602569751043)(E:\07-notes\picture\08-进程使用场景.png)]</p> </li> </ul> </li> <li> <p>threading 模块(线程)</p> <ul> <li>Thread 线程类</li> <li>线程间通信(访问对象) <ul> <li>queue.Queue 线程队列</li> <li>回调函数(主线程声明, 子线程调用函数)</li> </ul> </li> </ul> </li> </ul> <h3>2.进程</h3> <h4>2.1 进程的生命周期</h4> <pre><code># 进程的生命周期(状态): # 1-> 创建 Xxx() # 1->2: 启动 .start() # 2-> 就绪,等待执行 # 3-> 运行,run()方法被执行,CPU分配时间执行片 # 3->2: 就绪, CPU执行片用完,则进入就绪状态 # 3->5 结束4 # 4-> 阻塞,在run()方法中,执行程序的过程遇到IO操作(IO操作:文件的读写、网络流的读写、网络请求) # 4->2: 阻塞结束后,进入就绪状态 # 5-> 结束, 代码运行完毕,程序运行结束 </code></pre> <p>图示:</p> <p>[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-58e83gBO-1602569751045)(E:\07-notes\picture\11-进程的生命周期图示.png)]</p> <h4>2.2 进程间的通信</h4> <pre><code class="prism language-python"><span class="token comment"># 进程之间的通信方式(进程间的内存是相互独立的):</span> <span class="token comment"># 1. multiprocessing.Queue 进程队列</span> <span class="token comment"># 2. multiprocessing.Pipe 进程管道</span> <span class="token comment"># 3. multiprocessing.Manager 共享内存 ,基于C实现的,Manager中所指放入数据类型都是c的</span> <span class="token comment"># 4. Linux下的socket.AF_UNIX 套接字</span> <span class="token comment"># 5. signals 信号(键盘事件监听等)</span> </code></pre> <h4>2.3 爬虫进程设计图示:</h4> <p>​ [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7XzryDDS-1602569751047)(E:\07-notes\picture\09-爬虫进程设计.png)]</p> <h4>2.4 进程队列</h4> <h5>2.4.1 说明</h5> <p>使用管道和少量的锁/信号量实现的进程共享的队列<br> 当进程首先将一个项目放到队列中时,启动一个将线程从缓冲区转移到管道中的Feeder线程</p> <h5>2.4.2 Queue对列的方法</h5> <pre><code>- Queue(maxsize=0) 最大进程任务数,<=0 表示不限制 - 输入: put(obj[, block[, timeout]]) - 获取输出: get([block[, timeout]]) - 队列大小: qsize() - 是否为空: empty() - 关闭: close() </code></pre> <pre><code class="prism language-python"><span class="token comment"># 需求: Boss派20项活给5个工人完成</span> <span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Process<span class="token punctuation">,</span> Queue <span class="token keyword">import</span> time <span class="token keyword">import</span> os <span class="token keyword">def</span> <span class="token function">boss</span><span class="token punctuation">(</span>q<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 大老板安排任务</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span><span class="token punctuation">:</span> msg <span class="token operator">=</span> <span class="token string">'Boss按排的任务: %d '</span> <span class="token operator">%</span>i q<span class="token punctuation">.</span>put<span class="token punctuation">(</span>msg<span class="token punctuation">)</span> <span class="token comment"># 如果msg量没有达到最高值,则直接存入,反之,则等待</span> <span class="token keyword">print</span><span class="token punctuation">(</span>time<span class="token punctuation">.</span>strftime<span class="token punctuation">(</span><span class="token string">'%x %X'</span><span class="token punctuation">,</span> time<span class="token punctuation">.</span>localtime<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span> msg<span class="token punctuation">)</span> <span class="token comment"># 08/29/20 09:12:06 Boss按排的任务: 0</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">0.5</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--boss任务派发完成----'</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">worker</span><span class="token punctuation">(</span>q<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">while</span> <span class="token boolean">True</span><span class="token punctuation">:</span> msg <span class="token operator">=</span> q<span class="token punctuation">.</span>get<span class="token punctuation">(</span>timeout<span class="token operator">=</span><span class="token number">5</span><span class="token punctuation">)</span> <span class="token comment"># 5秒内没有消息,表示工作完成</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'at {} 工人<{}> 收到: {}'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>time<span class="token punctuation">.</span>time<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> os<span class="token punctuation">.</span>getpid<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> msg<span class="token punctuation">)</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> q <span class="token operator">=</span> Queue<span class="token punctuation">(</span>maxsize<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 最大的消息数量</span> workers <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">:</span> p <span class="token operator">=</span> Process<span class="token punctuation">(</span>target<span class="token operator">=</span>worker<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span>q<span class="token punctuation">,</span> <span class="token punctuation">)</span><span class="token punctuation">)</span> p<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> workers<span class="token punctuation">.</span>append<span class="token punctuation">(</span>p<span class="token punctuation">)</span> <span class="token comment"># 将所有工人进程管理起来</span> boss<span class="token punctuation">(</span>q<span class="token punctuation">)</span> <span class="token comment"># 老板开始派活</span> <span class="token keyword">for</span> worker <span class="token keyword">in</span> workers<span class="token punctuation">:</span> worker<span class="token punctuation">.</span>terminate<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 解散工人</span> q<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--工作完成---over--'</span><span class="token punctuation">)</span> </code></pre> <h5>2.4.3 晒了吗网站多任务数据爬取</h5> <pre><code class="prism language-python"> </code></pre> <h4>2.5 进程管道 (半/全双工)</h4> <p>管道也叫无名管道,它是 UNIX 系统 IPC(进程间通信 (Inter-Process Communication) 的最古老形式</p> <p>管道用来连接不同进程之间的数据流</p> <pre><code>pipe.recv() 接收消息 pipe.send(str) 发送消息 </code></pre> <h5>2.5.1.1 半双工</h5> <pre><code class="prism language-python"><span class="token keyword">import</span> time <span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Process<span class="token punctuation">,</span>current_process <span class="token keyword">as</span> cp<span class="token punctuation">,</span>Pipe <span class="token keyword">def</span> <span class="token function">send_msg</span><span class="token punctuation">(</span>conn<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--send msg--'</span><span class="token punctuation">,</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> conn<span class="token punctuation">.</span>send<span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">,</span><span class="token number">3</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment">#管道通信尅发送任何类型对象</span> <span class="token comment"># conn.send('are you ok?')</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--msg已发送--'</span><span class="token punctuation">,</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">receive_msg</span><span class="token punctuation">(</span>conn<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--receive msg--'</span><span class="token punctuation">,</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">)</span> msg<span class="token operator">=</span>conn<span class="token punctuation">.</span>recv<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#阻塞到收到消息为止</span> <span class="token keyword">print</span><span class="token punctuation">(</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span><span class="token string">'receives msg->'</span><span class="token punctuation">,</span>msg<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># duplex=Flase 表示单双工</span> <span class="token comment"># conn1仅接收(只读) conn2仅发</span> conn1<span class="token punctuation">,</span>conn2<span class="token operator">=</span>Pipe<span class="token punctuation">(</span>duplex<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> p1<span class="token operator">=</span>Process<span class="token punctuation">(</span>target<span class="token operator">=</span>send_msg<span class="token punctuation">,</span>args<span class="token operator">=</span><span class="token punctuation">(</span>conn2<span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> p2 <span class="token operator">=</span> Process<span class="token punctuation">(</span>target<span class="token operator">=</span>receive_msg<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span>conn1<span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> p2<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> p1<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> p1<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> p2<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--over--'</span><span class="token punctuation">)</span> </code></pre> <h5>2.5.1.1 全双工</h5> <pre><code class="prism language-python"><span class="token keyword">import</span> time <span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Process<span class="token punctuation">,</span>current_process <span class="token keyword">as</span> cp<span class="token punctuation">,</span>Pipe <span class="token keyword">def</span> <span class="token function">a</span><span class="token punctuation">(</span>conn<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span><span class="token string">'--a msg--'</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> conn<span class="token punctuation">.</span>send<span class="token punctuation">(</span><span class="token string">'are you ok?'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span><span class="token string">'a->b sended msg'</span><span class="token punctuation">)</span> msg<span class="token operator">=</span>conn<span class="token punctuation">.</span>recv<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#等待消息到达</span> <span class="token keyword">print</span><span class="token punctuation">(</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span><span class="token string">'接收到b的消息'</span><span class="token punctuation">,</span>msg<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">b</span><span class="token punctuation">(</span>conn<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>cp<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span><span class="token string">'--b msg--'</span><span class="token punctuation">)</span> msg<span class="token operator">=</span>conn<span class="token punctuation">.</span>recv<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#等待消息到达</span> <span class="token keyword">if</span> msg<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'are you ok?'</span><span class="token punctuation">)</span><span class="token operator">></span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">:</span> conn<span class="token punctuation">.</span>send<span class="token punctuation">(</span><span class="token string">'ok!'</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> conn<span class="token punctuation">.</span>send<span class="token punctuation">(</span><span class="token string">'收到你的消息真开心'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># duplex=True 表示全双工</span> <span class="token comment"># conn1/conn2 可读可写</span> conn1<span class="token punctuation">,</span>conn2<span class="token operator">=</span>Pipe<span class="token punctuation">(</span>duplex<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> p1<span class="token operator">=</span>Process<span class="token punctuation">(</span>target<span class="token operator">=</span>a<span class="token punctuation">,</span>args<span class="token operator">=</span><span class="token punctuation">(</span>conn1<span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> p2 <span class="token operator">=</span> Process<span class="token punctuation">(</span>target<span class="token operator">=</span>b<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span>conn2<span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> p2<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> p1<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> p1<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> p2<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--over--'</span><span class="token punctuation">)</span> </code></pre> <h3>3. 线程</h3> <h4>3.1 线程的概念</h4> <pre><code>1、一个进程里面至少有一个线程,进程的概念只是一种抽象的概念,真正在CPU上面调度的是进程里的线程 2、线程是真正干活的,线程用的是进程里面包含的一堆资源,线程仅仅是一个调度单位,不包含资源 3、每一个进程在启动的时候都会默认创建一个线程, 这个线程叫主线程(MainThread) 4、一个进程(任务)里面可能对应多个分任务,如果一个进程里面只开启一个线程的话,多个分任务之间实际上是串行的执行效果,即一个程序里面只含有一条执行路径 </code></pre> <h4>3.2 线程与进程的关系</h4> <p><font color="blue"><strong>一个程序启动起来以后,至少有一个进程,这个进程至少有一个线程</strong></font></p> <h5>3.2.1 功能</h5> <ul> <li>进程,能够完成多任务,比如 在一台电脑上能够同时运行多个QQ。</li> <li>线程,能够完成多任务,比如 一个QQ中的多个聊天窗口。</li> </ul> <h5>3.2.2 定义的不同</h5> <ul> <li><strong>进程是系统进行资源分配和调度的一个独立单位</strong>.</li> <li><strong>线程是</strong>进程的一个实体,是<strong>CPU调度和分派的基本单位</strong>,它是比进程更小的能独立运行的基本单位.线程自己基本上不拥有系统资源,只拥有一点在运行中必不可少的资源(如程序计数器,一组寄存器和栈),但是它可与同属一个进程的其他的线程共享进程所拥有的全部资源。</li> </ul> <h5>3.2.3 区别</h5> <ul> <li> <p>一个程序至少有一个进程,一个进程至少有一个线程.</p> </li> <li> <p>线程的划分尺度小于进程(资源比进程少),使得多线程程序的并发性高。</p> </li> <li> <p>进程在执行过程中拥有独立的内存单元,而多个线程共享内存,从而极大地提高了程序的运行效率</p> </li> <li> <p><strong>线程不能够独立执行,必须依存在进程中</strong></p> </li> <li> <p>可以将进程理解为工厂中的一条流水线,而其中的线程就是这个流水线上的工人</p> </li> </ul> <p><strong>优缺点</strong></p> <p>线程和进程在使用上各有优缺点:线程执行开销小,但不利于资源的管理和保护;而进程正相反。</p> <pre><code>开发中多用 ---> 多进程 + 协程 进程:占用资源多,效率相对较低,便于管理和维护 线程:占用资源少,效率相对较高,不便于管理和维护 存在一个平衡 </code></pre> <p><strong>例:</strong></p> <p>开4个<strong>进程</strong> —> 一般有几个CPU就开几个进程(4核双线程就可以开8个进程),可以实现并行</p> <p>开400个<strong>线程</strong> —> 实现并发,一个CPU不断切换执行任务</p> <h4>3.3. 同步与异步</h4> <p>同步:即是指一个进程在执行某个请求的时候,若该请求需要一段时间才能返回信息,那么这个进程将会一直等待下去,直到收到返回信息才继续执行下去。</p> <p>异步:与同步相反,即进程不需要一直等下去,而是继续执行下面的操作,不管其他进程的状态。<br> 当有消息返回时系统会通知进行处理,这样可以提高执行的效率。</p> <h4>3.4. 串行与并发</h4> <ul> <li> <p>CPU地位:</p> <p>无论是串联、并行或并发,在用户看来都是同时运行的,不管是进程还是线程,都只是一个任务而已, 真正干活的是CPU,CPU来做这些任务,而一个cpu(单核)同一时刻只能执行一个任务</p> </li> <li> <p>串行:</p> <p>在执行多个任务时,一个任务接着一个任务执行,前一任务完成后,才能执行下一个任务。</p> </li> <li> <p>并行:</p> <p>多个任务同时运行,只有具备多个cpu才能实现并行,含有几个cpu,也就意味着在同一时刻可以执行几个任务</p> </li> <li> <p>并发:</p> <p>是伪并行,即看起来是同时运行的,实际上是单个CPU在多个程序之间来回的切换</p> </li> </ul> <h4>3.5. 线程案例</h4> <pre><code class="prism language-python"><span class="token keyword">import</span> random <span class="token keyword">import</span> time <span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread<span class="token punctuation">,</span> current_thread <span class="token keyword">as</span> ct<span class="token punctuation">,</span> Lock <span class="token keyword">from</span> queue <span class="token keyword">import</span> Queue <span class="token keyword">class</span> <span class="token class-name">DownloadThread</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">super</span><span class="token punctuation">(</span>DownloadThread<span class="token punctuation">,</span>self<span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span> <span class="token keyword">global</span> <span class="token builtin">sum</span> <span class="token keyword">print</span><span class="token punctuation">(</span>ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span> <span class="token string">'running...'</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span><span class="token number">0.5</span><span class="token punctuation">,</span> <span class="token number">3.5</span><span class="token punctuation">)</span><span class="token punctuation">)</span> n <span class="token operator">=</span> random<span class="token punctuation">.</span>randint<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">)</span> <span class="token comment"># lock 在上下文中使用时</span> <span class="token comment"># 进入上下文: 加锁 lock.acquire()</span> <span class="token comment"># 退出上下文: 释放锁 lock.release()</span> <span class="token keyword">with</span> lock<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span> <span class="token string">'产生了'</span><span class="token punctuation">,</span> n<span class="token punctuation">,</span> <span class="token string">'当前的sum->'</span><span class="token punctuation">,</span> <span class="token builtin">sum</span><span class="token punctuation">)</span> <span class="token builtin">sum</span> <span class="token operator">+=</span> n time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">0.2</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span> <span class="token string">'当前的sum->'</span><span class="token punctuation">,</span> <span class="token builtin">sum</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token builtin">sum</span> <span class="token operator">=</span> <span class="token number">100</span> <span class="token comment"># 可以在多个线程中使用</span> lock <span class="token operator">=</span> Lock<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 创建了10个线程</span> ts <span class="token operator">=</span> <span class="token punctuation">[</span> DownloadThread<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token comment"># 启动线程</span> <span class="token keyword">for</span> t <span class="token keyword">in</span> ts<span class="token punctuation">:</span> t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 等待所有线程执行完成</span> <span class="token keyword">for</span> t <span class="token keyword">in</span> ts<span class="token punctuation">:</span> t<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 阻塞方法</span> </code></pre> <h5>3.5.1. 安全锁</h5> <pre><code>- 创建锁 lock = threading.Lock() lock = threading.RLock() - 加锁 lock.acquire() - 解锁 lock.release() </code></pre> <h5>3.5.2. 线程本地变量</h5> <p><strong>理解:</strong></p> <p>​ ThreadLocal 变量,它本身是一个全局变量,但是每个线程却可以利用它来保存属于自己的私有数据,这些私有数据对其他线程也是不可见的</p> <pre><code>一、对 ThreadLocal 的理解   ThreadLocal,有的人叫它线程本地变量,也有的人叫它线程本地存储,其实意思一样。   ThreadLocal 在每一个变量中都会创建一个副本,每个线程都可以访问自己内部的副本变量。 二、为什么会出现 ThreadLocal 的技术应用   我们知道多线程环境下,每一个线程均可以使用所属进程的全局变量。如果一个线程对全局变量进行了修改,将会影响到其他所有的线程对全局变量的计算操作,从而出现数据混乱,即为脏数据。为了避免线程同时对变量进行修改,引入了线程同步机制,通过互斥锁、条件变量或者读写锁来控制对全局变量的访问。   只用全局变量并不能满足多线程环境的需求,很多时候线程还需要拥有自己的私有数据,这些数据对于其他线程来说是不可见的。因此线程中也可以使用局部变量,局部变量只有线程自身可以访问,同一个进程下的其他线程不可访问。   有时候使用局部变量不太方便,因此 Python 还提供了ThreadLocal 变量,它本身是一个全局变量,但是每个线程却可以利用它来保存属于自己的私有数据,这些私有数据对其他线程也是不可见的。   ThreadLocal 真正做到了线程之间的数据隔离,将线程的数据进行私有化 </code></pre> <pre><code class="prism language-python"><span class="token keyword">import</span> time <span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread<span class="token punctuation">,</span> current_thread <span class="token keyword">as</span> ct<span class="token punctuation">,</span> local <span class="token keyword">class</span> <span class="token class-name">Download</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">super</span><span class="token punctuation">(</span>Download<span class="token punctuation">,</span> self<span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>url <span class="token operator">=</span> url <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span> <span class="token comment"># 初始化请求头</span> <span class="token comment"># 设置代理或Cookie等</span> <span class="token keyword">print</span><span class="token punctuation">(</span>ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span> <span class="token string">'---set user-agent---'</span><span class="token punctuation">,</span> self<span class="token punctuation">.</span>url<span class="token punctuation">)</span> <span class="token comment"># id(current_thread())</span> <span class="token comment"># headers 是线程的本地变量,添加属性时,使用当前的线程对象的ID作为key,属性作为value()</span> <span class="token comment"># headers 本地变量实际的数据结构是 {id(thread):{属性名:属性值}}</span> headers<span class="token punctuation">.</span>user_agent <span class="token operator">=</span> <span class="token string">'%s Firefox 66.72'</span> <span class="token operator">%</span> self<span class="token punctuation">.</span>url time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name<span class="token punctuation">,</span> headers<span class="token punctuation">.</span>user_agent<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> headers <span class="token operator">=</span> local<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 线程的本地变量</span> <span class="token comment"># Python提供了 threading.local 类,将这个类实例化得到一个全局对象,</span> <span class="token comment"># 但是不同的线程使用这个对象存储的数据其它线程不可见(本质上就是不同的线程使用这个对象时为其创建一个独立的字典)。</span> urls <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token string">'http://www.baidu.com'</span><span class="token punctuation">,</span><span class="token string">'http://hao123.com'</span><span class="token punctuation">,</span><span class="token string">'http://jd.com'</span><span class="token punctuation">)</span> ts <span class="token operator">=</span> <span class="token punctuation">[</span> Download<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls <span class="token punctuation">]</span> <span class="token keyword">for</span> t <span class="token keyword">in</span> ts<span class="token punctuation">:</span>t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> t <span class="token keyword">in</span> ts<span class="token punctuation">:</span>t<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 输出结果:每一个线程读到的headers都不同,真正做到了线程之间的隔离</span> <span class="token comment"># Thread-1 ---set user-agent--- http://www.baidu.com</span> <span class="token comment"># Thread-2 ---set user-agent--- http://hao123.com</span> <span class="token comment"># Thread-3 ---set user-agent--- http://jd.com</span> <span class="token comment"># Thread-1 http://www.baidu.com Firefox 66.72</span> <span class="token comment"># Thread-3Thread-2 http://hao123.com Firefox 66.72</span> <span class="token comment"># http://jd.com Firefox 66.72</span> </code></pre> <h5>3.5.3. 线程条件变量</h5> <p>条件变量(Condition)</p> <p>作用:</p> <pre><code>实现多线程的数据安全问题,当数据不满足条件时,可以让线程挂起,反之唤醒其他等待线程。 内部使用线程锁 </code></pre> <p>原理:</p> <pre><code>互斥锁,主要作用是并行访问共享资源时,保护共享资源,防止出现脏数据。python 条件变量Condition也需要关联互斥锁,同时Condition自身提供了wait/notify/notifyAll方法,用于阻塞/通知其他并行线程,可以访问共享资源了。可以这么理解,Condition提供了一种多线程通信机制,假如线程1需要数据,那么线程1就阻塞等待,这时线程2就去制造数据,线程2制造好数据后,通知线程1可以去取数据了,然后线程1去获取数据 </code></pre> <ul> <li>用法:</li> </ul> <pre><code>条件变量利用线程间共享的全局变量进行同步的一种机制 </code></pre> <ul> <li>两个动作:</li> </ul> <pre><code>1、一个 线程等待"条件变量的条件成立"而挂起 2、另一个线程使“条件成立” </code></pre> <ul> <li>与互斥锁结合使用:</li> </ul> <pre><code>- 为了防止竞争,防止死锁 - 线程在改变条件状态前必须首先锁住互斥量(Lock) - 把条件变量到等待条件的线程列表上 - 对互斥锁解锁 </code></pre> <ul> <li>常用函数:</li> </ul> <pre><code class="prism language-python"><span class="token comment"># 创建条件变量</span> cond <span class="token operator">=</span> threading<span class="token punctuation">.</span>Condition<span class="token punctuation">(</span>threading<span class="token punctuation">.</span>Loca<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <pre><code>acquire(*args) 线程锁,注意线程条件变量中的所有相关函数使用必须在acquire() /release() 内部操作; release() 条件变量解锁 wait([timeout]) 等待唤醒,timeout表示超时 notify(n=1) 唤醒最大n个等待的线程 notifyAll()、notify_all() 唤醒所有等待的线程 </code></pre> <p></p> <p>示例:</p> <pre><code class="prism language-python"><span class="token keyword">import</span> time <span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread<span class="token punctuation">,</span> Condition<span class="token punctuation">,</span> Lock<span class="token punctuation">,</span> current_thread <span class="token keyword">as</span> ct <span class="token keyword">from</span> queue <span class="token keyword">import</span> Queue <span class="token keyword">class</span> <span class="token class-name">ConQueue</span><span class="token punctuation">(</span>Queue<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> maxsize<span class="token operator">=</span><span class="token number">10</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 初始化</span> <span class="token builtin">super</span><span class="token punctuation">(</span>ConQueue<span class="token punctuation">,</span> self<span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span>maxsize<span class="token punctuation">)</span> self<span class="token punctuation">.</span>cond <span class="token operator">=</span> Condition<span class="token punctuation">(</span>Lock<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 条件变量对象,传入一把锁</span> <span class="token keyword">def</span> <span class="token function">consume</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 消费的方法</span> name <span class="token operator">=</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name <span class="token operator">+</span> <span class="token string">'<%s>'</span> <span class="token operator">%</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>ident <span class="token comment"># 线程ID</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>acquire<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 加锁</span> <span class="token keyword">while</span> self<span class="token punctuation">.</span>empty<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 如果仓库为空</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">'当前仓库是空的'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>wait<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 等待、挂起</span> item <span class="token operator">=</span> self<span class="token punctuation">.</span>get_nowait<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 不用等待 获取数据</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>notify<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 唤醒所有等待线程</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>release<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 释放锁</span> <span class="token keyword">return</span> item <span class="token keyword">def</span> <span class="token function">product</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 生产的方法</span> name <span class="token operator">=</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name <span class="token operator">+</span> <span class="token string">'<%s>'</span> <span class="token operator">%</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>ident <span class="token comment"># 线程ID</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>acquire<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">while</span> self<span class="token punctuation">.</span>full<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">'仓库已满'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>wait<span class="token punctuation">(</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>put_nowait<span class="token punctuation">(</span>item<span class="token punctuation">)</span> <span class="token comment"># 不需要等待 存入仓库</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>notify<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 唤醒其他(消费)线程</span> self<span class="token punctuation">.</span>cond<span class="token punctuation">.</span>release<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 释放锁</span> <span class="token keyword">class</span> <span class="token class-name">ProducterThread</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 生产线程</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> con_queue<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">super</span><span class="token punctuation">(</span>ProducterThread<span class="token punctuation">,</span> self<span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span>name<span class="token operator">=</span><span class="token string">'Producyor'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>queue<span class="token punctuation">:</span>ConQueue <span class="token operator">=</span> con_queue <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span> name <span class="token operator">=</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name <span class="token operator">+</span> <span class="token string">'<%s>'</span> <span class="token operator">%</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>ident <span class="token comment"># 线程ID</span> <span class="token keyword">global</span> num <span class="token keyword">while</span> <span class="token boolean">True</span><span class="token punctuation">:</span> <span class="token keyword">with</span> lock<span class="token punctuation">:</span> item <span class="token operator">=</span> <span class="token string">'%s 面包'</span> <span class="token operator">%</span> num self<span class="token punctuation">.</span>queue<span class="token punctuation">.</span>product<span class="token punctuation">(</span>item<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">'生产了'</span><span class="token punctuation">,</span> item<span class="token punctuation">)</span> num <span class="token operator">+=</span> <span class="token number">1</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token keyword">class</span> <span class="token class-name">ConsumThead</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 消费线程</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> con_queue<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">super</span><span class="token punctuation">(</span>ConsumThead<span class="token punctuation">,</span> self<span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span>name<span class="token operator">=</span><span class="token string">'Consumer'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>queue <span class="token operator">=</span> con_queue <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span> name <span class="token operator">=</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>name <span class="token operator">+</span> <span class="token string">'<%s>'</span> <span class="token operator">%</span> ct<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>ident <span class="token keyword">while</span> <span class="token boolean">True</span><span class="token punctuation">:</span> item <span class="token operator">=</span> self<span class="token punctuation">.</span>queue<span class="token punctuation">.</span>consume<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 可能存在等待状态</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">'消费了'</span><span class="token punctuation">,</span> item<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">start</span><span class="token punctuation">(</span><span class="token operator">*</span>threads<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> t <span class="token keyword">in</span> threads<span class="token punctuation">:</span> t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">join</span><span class="token punctuation">(</span><span class="token operator">*</span>threads<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> t <span class="token keyword">in</span> threads<span class="token punctuation">:</span> t<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> queue <span class="token operator">=</span> ConQueue<span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> cs <span class="token operator">=</span> <span class="token punctuation">[</span>ConsumThead<span class="token punctuation">(</span>queue<span class="token punctuation">)</span> <span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">]</span> ps <span class="token operator">=</span> <span class="token punctuation">[</span>ProducterThread<span class="token punctuation">(</span>queue<span class="token punctuation">)</span> <span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">]</span> num <span class="token operator">=</span> <span class="token number">1</span> <span class="token comment"># 面包的序号</span> lock <span class="token operator">=</span> Lock<span class="token punctuation">(</span><span class="token punctuation">)</span> start<span class="token punctuation">(</span><span class="token operator">*</span>cs<span class="token punctuation">,</span> <span class="token operator">*</span>ps<span class="token punctuation">)</span> <span class="token comment"># 可以解包list或tuple元组</span> <span class="token comment"># start(*ps)</span> join<span class="token punctuation">(</span><span class="token operator">*</span>cs<span class="token punctuation">,</span> <span class="token operator">*</span>ps<span class="token punctuation">)</span> <span class="token comment"># join(*ps)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--over--'</span><span class="token punctuation">)</span> </code></pre> <h3>4. 协程</h3> <blockquote> <p>协程是线程的替代品, 区别在于线程由CPU调度, 协程由用户(程序)自己的调度的。</p> <p>协程需要事件监听模型(事件循环器),它采用IO多路复用原理,在多个协程之间进行调度</p> </blockquote> <h4>4.1 协程的定义原理</h4> <pre><code>- 1.协程是以协作式调度的单线程,协程又称之为“微线程”,它是在一个线程内完成函数或子程序之间调度 - 2.一个函数(子程序)在调用时都是按层级调用,如A调用B,B调用C,再依次返回结果 - 3.函数调用是通过栈实现的,栈中存放函数的局部变量 - 4.函数或子程序的调用一个入口, 一次返回,调用顺序是明确。 - 5.协程在执行子程序或函数时,函数内部是可以中断,转而执行别的子程序,在适当时再返回接着执行。 </code></pre> <h4>4.2 协程的事件模型</h4> <p>协程的事件模型(IO异步模型):</p> <pre><code>- selector 轮询 - poll 事件回调 - kqueue/epoll 增强式事件回调 </code></pre> <h4>4.3. 协程的三种方式</h4> <ul> <li>基于<strong>生成器 generator (过渡)</strong> <ul> <li>yield</li> <li>send()</li> </ul> </li> <li>Python3 之后引入了 <strong>asyncio模块</strong> <ul> <li>@asyncio.coroutine 协程装饰器, 可以在函数上使用此装饰器,使得函数变成协程对象</li> <li>在协程函数中,可以使用yield from 阻塞当前的协程,将执行的权限移交给 yield from 之后的协程对象。</li> <li>asyncio.get_event_loop() 获取事件循环模型对象, 等待所有的协程对象完成之后结束。</li> </ul> </li> <li>Python3.5之后,引入<strong>两个关键字</strong> <ul> <li><strong>async</strong> 替代 @asyncio.coroutine</li> <li><strong>await</strong> 替代 yield from</li> </ul> </li> </ul> <p>协程对象的运行方式:</p> <pre><code>- loop = asyncio.get_event_loop() - loop.run_until_comlete(协程对象) - 自定义的协程函数,由@asyncio.coroutine装饰器装饰的函数,即协程对象 - 通过asyncio.wait()协程函数封装多个自定义协程对象 </code></pre> <h5>4.3.1. 基于生成器</h5> <pre><code class="prism language-python"><span class="token comment"># 基于生成器方式实现协程</span> <span class="token comment"># yield/send</span> <span class="token comment"># 斐波那契数列: 1,1,2,3,5,8,.....</span> <span class="token keyword">import</span> random <span class="token keyword">import</span> time <span class="token keyword">def</span> <span class="token function">fib</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span><span class="token punctuation">:</span> a<span class="token punctuation">,</span> b <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span> index <span class="token operator">=</span> <span class="token number">0</span> <span class="token keyword">while</span> index <span class="token operator"><</span> n<span class="token punctuation">:</span> wait <span class="token operator">=</span> <span class="token keyword">yield</span> b <span class="token comment"># 将b值输出给调用者,等待调用者输入(传入)等待时间wait</span> <span class="token keyword">print</span><span class="token punctuation">(</span>wait<span class="token punctuation">,</span> <span class="token string">' 秒之后将会产出'</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span>wait<span class="token punctuation">)</span> a<span class="token punctuation">,</span> b <span class="token operator">=</span> b<span class="token punctuation">,</span> a<span class="token operator">+</span>b index <span class="token operator">+=</span> <span class="token number">1</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> f <span class="token operator">=</span> fib<span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span> n <span class="token operator">=</span> <span class="token builtin">next</span><span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token comment"># 从fib函数中获取数字</span> <span class="token keyword">while</span> <span class="token boolean">True</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--->'</span><span class="token punctuation">,</span> n<span class="token punctuation">)</span> wait_second <span class="token operator">=</span> random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span><span class="token number">0.1</span><span class="token punctuation">,</span> <span class="token number">2.0</span><span class="token punctuation">)</span> n <span class="token operator">=</span> f<span class="token punctuation">.</span>send<span class="token punctuation">(</span>wait_second<span class="token punctuation">)</span> <span class="token comment"># 向fib生成器函数发送等待时间(数据),等待fib函数产出下一个数字</span> <span class="token keyword">except</span><span class="token punctuation">:</span> <span class="token comment"># StopIteration异常:生成器产出完成</span> <span class="token keyword">break</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># for n in fib(10):</span> <span class="token comment"># print(n, end=' ')</span> main<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h5>4.3.2. 引入asyncio模块</h5> <pre><code class="prism language-python"><span class="token comment"># 斐波那契数列: 1,1,2,3,5,8,.....</span> <span class="token keyword">import</span> random <span class="token keyword">import</span> time <span class="token keyword">import</span> asyncio <span class="token keyword">from</span> asyncio <span class="token keyword">import</span> coroutine <span class="token keyword">from</span> utils<span class="token punctuation">.</span>ua_ <span class="token keyword">import</span> <span class="token operator">*</span> <span class="token keyword">import</span> requests @coroutine <span class="token keyword">def</span> <span class="token function">get</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--正在GET请求-->'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> get_ua<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token keyword">if</span> resp<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span> items <span class="token operator">=</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> parse<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> <span class="token string">'===解析完成===>'</span><span class="token punctuation">,</span> items<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">download</span><span class="token punctuation">(</span><span class="token operator">*</span>urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 下载任务的入口函数</span> <span class="token comment"># 创建异步协程的时间循环模型</span> loop <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 获取事件循环器</span> <span class="token comment"># loop.run(协程对象) 单个协程运行</span> <span class="token comment"># 生成批量的协程任务,并添加事件模型执行</span> <span class="token comment"># 启动循环知道结束为止</span> loop<span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span><span class="token punctuation">[</span> get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls <span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> @coroutine <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> <span class="token string">'正在解析'</span><span class="token punctuation">)</span> <span class="token comment"># time.sleep() # 当前线程挂起,阻塞</span> <span class="token keyword">yield</span> <span class="token keyword">from</span> asyncio<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span><span class="token number">0.1</span><span class="token punctuation">,</span> <span class="token number">3.0</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 当前协程挂起,阻塞</span> <span class="token keyword">return</span> <span class="token punctuation">{ </span><span class="token string">'url'</span><span class="token punctuation">:</span> url<span class="token punctuation">,</span> <span class="token string">'data'</span><span class="token punctuation">:</span>time<span class="token punctuation">.</span>localtime<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">}</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># download('http://www.baidu.com','http://hao123.com','http://jd.com')</span> <span class="token comment"># 单个运行协程的方式</span> coroutine_1 <span class="token operator">=</span> get<span class="token punctuation">(</span><span class="token string">'http://www.baidu.com'</span><span class="token punctuation">)</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span>coroutine_1<span class="token punctuation">)</span> </code></pre> <h5>4.3.3. async和await</h5> <pre><code class="prism language-python"><span class="token comment"># 斐波那契数列: 1,1,2,3,5,8,.....</span> <span class="token keyword">import</span> random <span class="token keyword">import</span> time <span class="token keyword">import</span> asyncio <span class="token keyword">from</span> asyncio <span class="token keyword">import</span> coroutine <span class="token keyword">from</span> utils<span class="token punctuation">.</span>ua_ <span class="token keyword">import</span> <span class="token operator">*</span> <span class="token keyword">import</span> requests <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">get</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'--正在GET请求-->'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> get_ua<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token keyword">if</span> resp<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span> items <span class="token operator">=</span> <span class="token keyword">await</span> parse<span class="token punctuation">(</span>url<span class="token punctuation">,</span> resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> <span class="token string">'===解析完成===>'</span><span class="token punctuation">,</span> items<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">download</span><span class="token punctuation">(</span><span class="token operator">*</span>urls<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 下载任务的入口函数</span> <span class="token comment"># 创建异步协程的时间循环模型</span> loop <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 获取事件循环器</span> <span class="token comment"># loop.run(协程对象) 单个协程运行</span> <span class="token comment"># 生成批量的协程任务,并添加事件模型执行</span> loop<span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span><span class="token punctuation">[</span> get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls <span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> html<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url<span class="token punctuation">,</span> <span class="token string">'正在解析'</span><span class="token punctuation">)</span> <span class="token comment"># time.sleep() # 当前线程挂起,阻塞</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span>random<span class="token punctuation">.</span>uniform<span class="token punctuation">(</span><span class="token number">0.1</span><span class="token punctuation">,</span> <span class="token number">3.0</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 当前协程挂起,阻塞</span> <span class="token keyword">return</span> <span class="token punctuation">{ </span><span class="token string">'url'</span><span class="token punctuation">:</span> url<span class="token punctuation">,</span> <span class="token string">'data'</span><span class="token punctuation">:</span>time<span class="token punctuation">.</span>localtime<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">}</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># download('http://www.baidu.com','http://hao123.com','http://jd.com')</span> <span class="token comment"># 单个运行协程的方式</span> coroutine_1 <span class="token operator">=</span> get<span class="token punctuation">(</span><span class="token string">'http://www.baidu.com'</span><span class="token punctuation">)</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span>coroutine_1<span class="token punctuation">)</span> </code></pre> <h2>十、selenium</h2> <blockquote> <p>Selenium是驱动浏览器(chrome, firefox, IE)进行浏览器相关操作(打开url, 点击网页中按钮功连接、输入文本)</p> </blockquote> <h3>10.1.什么是selenium模块</h3> <ul> <li> <p>基于浏览器自动化的一个模块。</p> </li> <li> <p>支持通过各种driver(FirfoxDriver,IternetExplorerDriver,OperaDriver,ChromeDriver)驱动真实浏览器完成测试</p> <p>selenium也是支持无界面浏览器操作的。比如说HtmlUnit和PhantomJs。</p> </li> </ul> <h3>10.2. 为什么使用selenium</h3> <ul> <li> <p>模拟浏览器功能,自动执行网页中的js代码,实现动态加载</p> </li> <li> <p>页面渲染<br> 在浏览器请求服务器的网页时, 执行页面的js,在js中将数据转成DOM元素(HTML标签)</p> </li> <li> <p>UI自动测试</p> <ul> <li>定位输入DOM节点</li> <li>点击某一个DOM节点(Button/a标签)</li> </ul> </li> </ul> <h3>10.3 使用selenium</h3> <p>安装环境:</p> <pre><code>pip install selenium </code></pre> <p>下载一个浏览器的驱动程序(谷歌浏览器)</p> <pre><code>- 下载路径:http://chromedriver.storage.googleapis.com/index.html - 驱动程序和浏览器的映射关系:http://blog.csdn.net/huilan_same/article/details/51896672 </code></pre> <p>导入模块:</p> <pre><code>from selenium import webdriver </code></pre> <p>实例化一个浏览器对象</p> <ul> <li> <p>编写基于浏览器自动化的操作代码</p> <ul> <li> <p>发起请求:get(url)</p> </li> <li> <p>标签定位:find系列的方法</p> </li> <li> <p>标签交互:send_keys(‘xxx’)</p> </li> <li> <p>执行js程序:excute_script(‘jsCode’)</p> </li> <li> <p>前进,后退:back(),forward()</p> </li> <li> <p>关闭浏览器:quit()</p> </li> </ul> </li> </ul> <p>【总结】元素定位:</p> <pre><code>1、find_element_by_id WebElement 元素 2、find_elements_by_name - 根据标签的name属性查找多个Dom元素 - form表单中的字段标签都具有name属性 - iframe标签具有name属性 3、find_elements_by_xpath - 基于xpath方式查找DOM元素 - 依赖lxml库 4、find_elements_by_tag_name - 根据标签名查找多个Dom元素 5、find_elements_by_class_name 6、find_elements_by_css_selector - #id - .class_name - div - div p>a 7、find_elements_by_link_text - 根据a标签的文本,根据a标签 </code></pre> <p>示例:</p> <p>需求:打开淘宝搜索IPhone,再打开百度–>回退–>前进</p> <pre><code class="prism language-python"><span class="token comment"># 需求:打开淘宝搜索IPhone,再打开百度-->回退-->前进</span> <span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver <span class="token keyword">from</span> time <span class="token keyword">import</span> sleep <span class="token comment"># 实例化浏览器(谷歌)</span> path <span class="token operator">=</span> r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span> bro <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>executable_path<span class="token operator">=</span>path<span class="token punctuation">)</span> <span class="token comment"># 发起请求</span> bro<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'https://www.taobao.com/'</span><span class="token punctuation">)</span> <span class="token comment"># 标签定位</span> search_input <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'q'</span><span class="token punctuation">)</span> <span class="token comment"># 标签交互</span> search_input<span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">'Iphone'</span><span class="token punctuation">)</span> <span class="token comment"># 执行一组js程序</span> <span class="token comment"># scrollTo(0,document.body.scrollHeight) scrollTo(x,y)屏幕滚动,x表示左右滚动,y表示上下滚动</span> <span class="token comment"># document.body.scrollHeight表示向下滚动一屏</span> bro<span class="token punctuation">.</span>execute_script<span class="token punctuation">(</span><span class="token string">'window.scrollTo(0,document.body.scrollHeight)'</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 点击搜索按钮</span> btn <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_css_selector<span class="token punctuation">(</span><span class="token string">'.btn-search'</span><span class="token punctuation">)</span> btn<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> bro<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'https://www.baidu.com'</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 回退</span> bro<span class="token punctuation">.</span>back<span class="token punctuation">(</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 前进</span> bro<span class="token punctuation">.</span>forward<span class="token punctuation">(</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span> <span class="token comment"># 退出程序(关闭浏览器)</span> bro<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome<span class="token punctuation">,</span> Firefox path <span class="token operator">=</span> r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span> chrome <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>path<span class="token punctuation">)</span> <span class="token comment"># chrome = Firefox(executable_path=r'D:\01-soft\12-spider_chromedriver\geckodriver.exe')</span> <span class="token comment"># 打开必应</span> chrome<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'http://cn.bing.com'</span><span class="token punctuation">)</span> <span class="token comment"># 截图工具</span> chrome<span class="token punctuation">.</span>save_screenshot<span class="token punctuation">(</span><span class="token string">'bing.png'</span><span class="token punctuation">)</span> chrome<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 关闭页签,如果浏览器只有一个页签,即也会退出浏览器</span> chrome<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 退出程序(关闭浏览器)</span> </code></pre> <h3>10.4. selenium处理iframe</h3> <pre><code>- selenium处理iframe - 如果定位的标签存在于iframe标签之中,则必须使用switch_to.frame(id) - 动作链(拖动):from selenium.webdriver import ActionChains - 实例化一个动作链对象:action = ActionChains(bro) - click_and_hold(div):长按且点击操作 - move_by_offset(x,y) - perform()让动作链立即执行 - action.release()释放动作链对象 </code></pre> <p>示例1:</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver <span class="token keyword">from</span> time <span class="token keyword">import</span> sleep <span class="token comment">#导入动作链对应的类</span> <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> ActionChains path <span class="token operator">=</span> r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span> bro <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>executable_path<span class="token operator">=</span>path<span class="token punctuation">)</span> bro<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'</span><span class="token punctuation">)</span> <span class="token comment">#如果定位的标签是存在于iframe标签之中的则必须通过如下操作在进行标签定位</span> bro<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>frame<span class="token punctuation">(</span><span class="token string">'iframeResult'</span><span class="token punctuation">)</span><span class="token comment">#切换浏览器标签定位的作用域</span> div <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'draggable'</span><span class="token punctuation">)</span> <span class="token comment">#动作链</span> action <span class="token operator">=</span> ActionChains<span class="token punctuation">(</span>bro<span class="token punctuation">)</span> <span class="token comment">#点击长按指定的标签</span> action<span class="token punctuation">.</span>click_and_hold<span class="token punctuation">(</span>div<span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment">#perform()立即执行动作链操作</span> <span class="token comment">#move_by_offset(x,y):x水平方向 y竖直方向</span> action<span class="token punctuation">.</span>move_by_offset<span class="token punctuation">(</span><span class="token number">17</span><span class="token punctuation">,</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>perform<span class="token punctuation">(</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">0.5</span><span class="token punctuation">)</span> <span class="token comment">#释放动作链</span> action<span class="token punctuation">.</span>release<span class="token punctuation">(</span><span class="token punctuation">)</span> bro<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <p>示例2:</p> <p>需求:模拟登陆qq空间</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver <span class="token keyword">from</span> time <span class="token keyword">import</span> sleep path <span class="token operator">=</span> r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span> bro <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>executable_path<span class="token operator">=</span>path<span class="token punctuation">)</span> bro<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'https://qzone.qq.com/'</span><span class="token punctuation">)</span> bro<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>frame<span class="token punctuation">(</span><span class="token string">'login_frame'</span><span class="token punctuation">)</span> a_tag <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">"switcher_plogin"</span><span class="token punctuation">)</span> a_tag<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> userName_tag <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'u'</span><span class="token punctuation">)</span> password_tag <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'p'</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> userName_tag<span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">'1174333100'</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> password_tag<span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">'zhb1174333100'</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> btn <span class="token operator">=</span> bro<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'login_button'</span><span class="token punctuation">)</span> btn<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> bro<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>10.5. 交互</h3> <pre><code>1.点击 click() 2.输入 send_keys() 3.模拟 JS滚动 var q = window.document.documentElement.scrollTop=10000 execute_script() 执行js代码 今日头条 </code></pre> <pre><code class="prism language-python"><span class="token comment"># ......</span> <span class="token comment"># 模拟滚动操作</span> <span class="token comment"># document.documentElement 表示当前页面元素,指<html></span> <span class="token comment"># 获取窗口高度</span> <span class="token keyword">print</span><span class="token punctuation">(</span>chrome<span class="token punctuation">.</span>get_window_rect<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> chrome<span class="token punctuation">.</span>get_window_size<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">for</span> n <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span><span class="token punctuation">:</span> script <span class="token operator">=</span> <span class="token string">'var q=window.document.documentElement.scrollTop=%s'</span> <span class="token operator">%</span> <span class="token punctuation">(</span><span class="token punctuation">(</span>n<span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token operator">*</span><span class="token number">500</span><span class="token punctuation">)</span> chrome<span class="token punctuation">.</span>execute_script<span class="token punctuation">(</span>script<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">0.5</span><span class="token punctuation">)</span> <span class="token comment"># ......</span> </code></pre> <h3>10.6. 页面异步ajax的解决办法</h3> <p>原因:</p> <p>由于网页中有ajax的异步执行的js, 导致driver.get()之后查找元素报 NoSuchElementException异常</p> <p>导包:</p> <pre><code>from selenium.webdriver.common.by import By from selenium.webdriver.support import ui from selenium.webdriver.support import expected_conditions as EC </code></pre> <p>解决:</p> <pre><code class="prism language-python"><span class="token comment"># 等待某一个Element出现为止,否则一直阻塞下去,不过可以设置一个超时时间</span> ui<span class="token punctuation">.</span>WebDriverWait<span class="token punctuation">(</span>driver<span class="token punctuation">,</span> <span class="token number">60</span><span class="token punctuation">)</span><span class="token punctuation">.</span>until<span class="token punctuation">(</span>EC<span class="token punctuation">.</span>visibility_of_all_elements_located<span class="token punctuation">(</span><span class="token punctuation">(</span>By<span class="token punctuation">.</span>CLASS_NAME<span class="token punctuation">,</span> <span class="token string">'soupager'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>10.7. switch的用法</h3> <p>原因:</p> <pre><code>- 当页面中出现对话框 alert,或内嵌窗口iframe - 如果查找的元素节点在alert或iframe中的话,则需要切入到alert或iframe中 </code></pre> <p>解决:</p> <pre><code>1. 查找iframe标签对象 iframe = driver.find_element_by_id('login_frame') 2. 切换到iframe中 driver.switch_to.frame(iframe) </code></pre> <h3>10.8. 获取浏览器的页签</h3> <pre><code class="prism language-python">brower<span class="token punctuation">.</span>window_handlers<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 第一个页签,一般都存在</span> brower<span class="token punctuation">.</span>window_handlers<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token comment"># 如果浏览器打开新的资源在新的页签时,可以获取,如果不存在第二个页签,则会报错</span> </code></pre> <p>退出:browser.quit()</p> <p>案例:</p> <p>登录邮箱(内嵌窗口)</p> <pre><code class="prism language-python"><span class="token keyword">import</span> time <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome chrome <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span><span class="token punctuation">)</span> chrome<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'https://mail.qq.com'</span><span class="token punctuation">)</span> <span class="token comment"># 阻塞方法,等待网页中的所有js执行完毕</span> <span class="token comment"># 以下两个输入框是在<iframe>内嵌窗口中</span> login_frame <span class="token operator">=</span> chrome<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'login_frame'</span><span class="token punctuation">)</span> chrome<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>frame<span class="token punctuation">(</span>login_frame<span class="token punctuation">)</span> <span class="token comment"># 切入到内嵌窗口中</span> uesr_input <span class="token operator">=</span> chrome<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'u'</span><span class="token punctuation">)</span> pwd_input <span class="token operator">=</span> chrome<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'p'</span><span class="token punctuation">)</span> uesr_input<span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">'123344556@qq.com'</span><span class="token punctuation">)</span> pwd_input<span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">'zdx1233456'</span><span class="token punctuation">)</span> <span class="token comment"># 查找登录按钮,并点击</span> chrome<span class="token punctuation">.</span>find_element_by_id<span class="token punctuation">(</span><span class="token string">'login_button'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>10.9 无头浏览器</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver <span class="token keyword">from</span> time <span class="token keyword">import</span> sleep <span class="token comment">#实现无可视化界面</span> <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options <span class="token comment">#实现规避检测</span> <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> ChromeOptions <span class="token comment">#实现无可视化界面的操作</span> chrome_options <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> chrome_options<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--headless'</span><span class="token punctuation">)</span> chrome_options<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-gpu'</span><span class="token punctuation">)</span> <span class="token comment">#实现规避检测</span> option <span class="token operator">=</span> ChromeOptions<span class="token punctuation">(</span><span class="token punctuation">)</span> option<span class="token punctuation">.</span>add_experimental_option<span class="token punctuation">(</span><span class="token string">'excludeSwitches'</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token string">'enable-automation'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment">#如何实现让selenium规避被检测到的风险</span> path <span class="token operator">=</span> r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span> bro <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>path<span class="token punctuation">,</span>chrome_options<span class="token operator">=</span>chrome_options<span class="token punctuation">,</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span> <span class="token comment">#无可视化界面(无头浏览器) phantomJs</span> bro<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'https://www.baidu.com'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>bro<span class="token punctuation">.</span>page_source<span class="token punctuation">)</span> sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> bro<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>10.10 12306模拟登陆</h3> <h4>10.10.1. 超级鹰验证码使用</h4> <h2>十一、scrapy框架</h2> <h3>11.1. 介绍</h3> <p>什么是scrapy:</p> <p>​ Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。</p> <pre><code>功能:高性能的持久化存储,异步的数据下载,高性能的数据解析,分布式 </code></pre> <p>官方网站:</p> <pre><code>https://doc.scrapy.org/en/latest/ http://www.scrapyd.cn/doc/ 中文 http://scrapy-chs.readthedocs.io/zh_CN/latest/ 中文 </code></pre> <h3>11.2. scrapy框架的基本使用</h3> <pre><code>- 环境的安装: ``` - mac or linux:pip install scrapy - windows: - pip install wheel - 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted - 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl - pip install pywin32 - pip install scrapy 测试:在终端里录入scrapy指令,没有报错即表示安装成功! ``` </code></pre> <ul> <li> <p>创建一个工程:</p> <pre><code>- scrapy startproject xxxPro </code></pre> </li> <li> <p>cd xxxPro</p> <p>​ 在spiders子目录中创建一个爬虫文件</p> <pre><code>- scrapy genspider spiderName www.xxx.com </code></pre> </li> <li> <p>执行工程:</p> <pre><code>- scrapy crawl spiderName </code></pre> </li> </ul> <h3>11.3. 框架组成</h3> <h4>11.3.1. 五个核心</h4> <ul> <li> <p><strong>engine 引擎</strong>, 协调其它四个组件之间的联系,即与其它四个组件进行通信,也是scrapy框架的核心。自动运行,无需关注,会自动组织所有的请求对象,分发给下载器</p> </li> <li> <p><strong>spider 爬虫类</strong>, 爬虫程序的编写代码所在, 也是<strong>发起请求的开始的位置</strong>。spider发起的请求,经过engine转入到scheduler中。</p> <p><strong>请求成功之后的数据解析</strong></p> <pre><code>- scrapy.Spider 普通的爬虫 - scrapy.CrawlSpider - 可设置规则的爬虫类 - Rule 规则类 - 开始的函数 - start_requests() </code></pre> </li> <li> <p><strong>scheduler 调度器</strong>, 调度所有的请求(优先级高,则会先执行)。当执行某一个请求时,由engine转入到downloader中。</p> </li> <li> <p><strong>donwloader 下载器</strong>, 实现请求任务的执行,从网络上请求数据,将请求到的数据封装成响应对象,并将响应的对象返回给engine。engine将数据响应的数据对象(以回调接口方式)回传给它的爬虫类对象进行解析。</p> </li> <li> <p><strong>itempipeline 数据管道</strong>, 当spider解析完成后,将数据经engine转入到此(数据管道)。再根据数据类型,进行数据处理(图片、文本)</p> <pre><code>1. 清理HTML数据 2. 验证爬取的数据(检查item包含某些字段) 3. 查重(并丢弃) 4. 将爬取结果保存到数据库中 5. 对图片数据进行下载 </code></pre> <p>scrapy框架逻辑图:</p> <p></p> </li> </ul> <p>流程:</p> <pre><code>1、爬虫引擎获得初始请求开始抓取。 2、爬虫引擎开始请求调度程序,并准备对下一次的请求进行抓取。 3、爬虫调度器返回下一个请求给爬虫引擎。 4、引擎请求发送到下载器,通过下载中间件下载网络数据。 5、一旦下载器完成页面下载,将下载结果返回给爬虫引擎。 6、引擎将下载器的响应通过中间件返回给爬虫进行处理。 7、爬虫处理响应,并通过中间件返回处理后的items,以及新的请求给引擎。 8、引擎发送处理后的items到项目管道,然后把处理结果返回给调度器,调度器计划处理下一个请求抓取。 9、重复该过程(继续步骤1),直到爬取完所有的url请求 </code></pre> <h4>11.3.3. scrapy使用</h4> <ul> <li>创建项目命令 <ul> <li>scrapy startproject 项目名称</li> </ul> </li> <li>创建爬虫命令 <ul> <li>scrapy genspider 爬虫名 域名</li> </ul> </li> <li>启动爬虫命令 <ul> <li>scrapy crawl 爬虫名</li> </ul> </li> <li>调试爬虫命令 <ul> <li>scrapy shell url</li> <li>scrapy shell <ul> <li>fetch(url)</li> </ul> </li> </ul> </li> </ul> <p>目录结构:</p> <pre><code>- spiders - __init__.py - 自定义的爬虫文件.py - __init__.py - items.py 定义数据结构的地方,是一个继承自scrapy.Item类, 属性字段的类型是 scrapy.Field() 注意: scrapy.Item类,实际是一个dict字典, 所以在spider的parse()函数返回的迭代元素应该是dict字典对象,且字典的key与 item的Field()相对应 - middlewares.py 中间件, 用于调整业务逻辑 - pipelines.py 管道文件,里面只有一个类,用于处理下载数据的后续处理 - settings.py 配置文件 比如: 是否遵守robots协议 User-Agent定义等 </code></pre> <h4>11.3.3. 爬虫文件</h4> <p>解析函数:</p> <ul> <li> <p>parse_detail(self, response: Response)</p> <ul> <li> <p>解析数据的回调函数,response保存了下载的数据,可以在此函数内对其进行解析,通常使用xpath,parse()函数,如果有返回值,必须返回可迭代的对象</p> </li> <li> <p>Response的类方法:</p> <pre><code>- selector() - css() 样式选择器 , 返回Selector选择器的可迭代(列表)对象 - scrapy.selector.SelectorList 选择器列表 - x()/xpath() - scrapy.selector.Selector 选择器 - 样式选择器提取属性或文本 - `::text` 提取文本 - `::attr("属性名")` 提取属性 - xpath() xpath路径 xpath路径,同lxml的xpath()写法 - 选择器常用方法 - css()/xpath() - extract() 提取选择中所有内容,返回是list - extract_first()/get() 提取每个选择器中的内容, 返回是文本 </code></pre> <pre><code>- response 是 scrapy.http.response.HtmlResponse类对象 - response.css('.class属性') 拿到class属性的标签 - response.css('.contentHerf::attr("href")') 获取标签的href属性 - response.xpath() scrapy.selector.Selector 返回Selector对象 内部写法:self.selector.xpath() - extract()/getall() Selector对象的方法,用于获取Selector对象的内容 即提取是Selector对象中的data属性 response.xpath('//title/text()').extract() 返回list response.css('').xpath() 先使用css选择标签元素,再通过xpath提取内容 - extract_first()/get() 提取第一条内容 - css()方法中字符串:#id, .class, div, div>p, ::attr(‘属性名’), ::text 标签文本 </code></pre> </li> <li> <p>Request类</p> <p>scrapy.http.Request</p> <p>请求对象的属性:</p> <pre><code>- url - callback 解释数据的回调函数对象 - headers 请求头 - priority 请求的优先 Request()中的meta属性可以向下一个解析函数传递数据(元数据) 注意:meta是dict字典格式,value不能是一个引用对象 (scrapy 1.5版本) Request()中的priority 请求在scheduler调度器中的优先级,值越高,级别越高,则优先下载 Request()中的dont_filter为False表示过滤重复下载的请求,为True则不过滤 </code></pre> </li> </ul> </li> </ul> <h4>11.3.4. 示例:</h4> <pre><code class="prism language-python"><span class="token comment"># 示例</span> <span class="token keyword">import</span> scrapy <span class="token keyword">from</span> scrapy <span class="token keyword">import</span> Request <span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>http <span class="token keyword">import</span> Response<span class="token punctuation">,</span> HtmlResponse <span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>selector <span class="token keyword">import</span> SelectorList <span class="token keyword">from</span> qsbk <span class="token keyword">import</span> cookie_ <span class="token keyword">class</span> <span class="token class-name">TxtSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 继承父类</span> name <span class="token operator">=</span> <span class="token string">'jokes'</span> <span class="token comment"># 糗事百科的段子</span> allowed_domains <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'qiushibaike.com'</span><span class="token punctuation">]</span> <span class="token comment"># 限制请求URL中的域(host)是否允许下载</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://www.qiushibaike.com/text/'</span><span class="token punctuation">]</span> <span class="token comment"># 起始请求的url资源列表</span> BASE_URL <span class="token operator">=</span> <span class="token string">'https://www.qiushibaike.com'</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">:</span> HtmlResponse<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 获取所有文章</span> <span class="token keyword">for</span> article_div_selector <span class="token keyword">in</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.article'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> author_item <span class="token operator">=</span> article_div_selector<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.author img'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>attrib author_item<span class="token punctuation">[</span><span class="token string">'name'</span><span class="token punctuation">]</span> <span class="token operator">=</span> author_item<span class="token punctuation">.</span>pop<span class="token punctuation">(</span><span class="token string">'alt'</span><span class="token punctuation">)</span> author_item<span class="token punctuation">[</span><span class="token string">'detail_href'</span><span class="token punctuation">]</span> <span class="token operator">=</span> article_div_selector<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.contentHerf::attr("href")'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># yield author_item</span> <span class="token comment"># 发起详情的请求</span> <span class="token comment"># Request()中的meta属性可以向下一个解析函数传递数据(元数据)</span> <span class="token comment"># 注意:meta是dict字典格式,value不能是一个引用对象 (scrapy 1.5版本)</span> <span class="token comment"># Request()中的priority 请求在scheduler调度器中的优先级,值越高,级别越高,则优先下载</span> <span class="token comment"># Request()中的dont_filter为False表示过滤重复下载的请求,为True则不过滤</span> <span class="token keyword">yield</span> Request<span class="token punctuation">(</span>self<span class="token punctuation">.</span>BASE_URL <span class="token operator">+</span> author_item<span class="token punctuation">[</span><span class="token string">'detail_href'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse_detail<span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{ </span><span class="token string">'Referer'</span><span class="token punctuation">:</span> response<span class="token punctuation">.</span>url<span class="token punctuation">}</span><span class="token punctuation">,</span> cookies<span class="token operator">=</span>cookie_<span class="token punctuation">.</span>get_cookies<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> meta<span class="token operator">=</span><span class="token punctuation">{ </span><span class="token string">'author'</span><span class="token punctuation">:</span> author_item<span class="token punctuation">[</span><span class="token string">'name'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">'author_head'</span><span class="token punctuation">:</span> author_item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span><span class="token punctuation">}</span><span class="token punctuation">,</span> priority<span class="token operator">=</span><span class="token number">100</span><span class="token punctuation">,</span> dont_filter<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">parse_detail</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">:</span> Response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># response.request</span> <span class="token comment"># print('parse_detail-->', response.request.meta)</span> <span class="token comment"># print('parse_detail-->', response.meta)</span> item <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'author'</span><span class="token punctuation">:</span> response<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'author'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">'author_head'</span><span class="token punctuation">:</span> response<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'author_head'</span><span class="token punctuation">]</span> <span class="token punctuation">}</span> item<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span> <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.article-title::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'publish_time'</span><span class="token punctuation">]</span> <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.stats-time::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'content'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">'\n'</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>c<span class="token punctuation">.</span>replace<span class="token punctuation">(</span><span class="token string">'\xa0'</span><span class="token punctuation">,</span> <span class="token string">' '</span><span class="token punctuation">)</span> <span class="token keyword">for</span> c <span class="token keyword">in</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.content::text'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>getall<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> item <span class="token keyword">def</span> <span class="token function">parse_test</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">:</span> HtmlResponse<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># scrapy.http.response.html.HtmlResponse</span> <span class="token comment"># print(type(response), response)</span> <span class="token comment"># class: contentHerf ->a标签</span> <span class="token comment"># css()/xpath() -> 'scrapy.selector.unified.SelectorList</span> <span class="token comment"># scrapy.selector.SelectorList/Selector</span> <span class="token comment"># -> xpath()/css() 查询子孙元素</span> <span class="token comment"># -> get()/extract_first(),getall()/extract() 提取是Selector对象中的data属性</span> <span class="token comment"># -> attrib 属性方法, 只有Selector类实例里面存在(要求:选择的是元素不是元素的属性)</span> <span class="token comment"># css()方法中字符串:#id, .class, div, div>p, ::attr(‘属性名’), ::text 标签文本</span> a_elements<span class="token punctuation">:</span> SelectorList <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.contentHerf::attr("href")'</span><span class="token punctuation">)</span> <span class="token comment"># list[<Selector>,...]</span> author_elements <span class="token operator">=</span> response<span class="token punctuation">.</span>css<span class="token punctuation">(</span><span class="token string">'.author'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//img'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> i<span class="token punctuation">,</span> author_element <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>author_elements<span class="token punctuation">)</span><span class="token punctuation">:</span> item<span class="token punctuation">:</span> <span class="token builtin">dict</span> <span class="token operator">=</span> author_element<span class="token punctuation">.</span>attrib <span class="token comment"># 拿到src和alt两个属性的dict</span> <span class="token comment"># 修改alt key的名称为 name</span> item<span class="token punctuation">[</span><span class="token string">'name'</span><span class="token punctuation">]</span> <span class="token operator">=</span> item<span class="token punctuation">.</span>pop<span class="token punctuation">(</span><span class="token string">'alt'</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'detail_href'</span><span class="token punctuation">]</span> <span class="token operator">=</span> a_elements<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> item </code></pre> <h4>11.3.5. scrapy shell</h4> <p>终端调试工具:</p> <pre><code>- 终端输入scrapy shell "http://www.baidu.com" 在终端会得到一个response对象,可以直接使用 - response.xpath() 使用xpath路径查询特定元素,返回一个selector对象的列表 - response.css() 使用css_selector查询元素,返回一个selector对象 </code></pre> <h3>11.4. scrapy 持久化存储</h3> <h4>11.4.1 基于终端命令存储</h4> <pre><code>- 基于终端指令: - 要求:只可以将parse方法的返回值存储到本地的文本文件中 - 注意:持久化存储对应的文本文件的类型只可以为:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle - 指令:scrapy crawl xxx -o filePath - 好处:简介高效便捷 - 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中) </code></pre> <h4>11.4.2. 基于管道持久化存储</h4> <pre><code>- 基于管道: - 编码流程: - 数据解析 - 在item类中定义相关的属性 - 将解析的数据封装存储到item类型的对象 - 将item类型的对象提交给管道进行持久化存储的操作 - 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作 - 在配置文件中开启管道 - 好处: - 通用性强。 </code></pre> <h4>11.4.3. 示例:</h4> <p>spider文件—> qiubai.py</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">from</span> qiubaiPro<span class="token punctuation">.</span>items <span class="token keyword">import</span> QiubaiproItem <span class="token keyword">class</span> <span class="token class-name">QiubaiSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'qiubai'</span> <span class="token comment"># allowed_domains = ['www.xxx.com']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://www.qiushibaike.com/text/'</span><span class="token punctuation">]</span> <span class="token comment"># 基于命令存储</span> <span class="token comment"># def parse(self, response):</span> <span class="token comment"># #解析:作者的名称+段子内容</span> <span class="token comment"># div_list = response.xpath('//div[@id="content-left"]/div')</span> <span class="token comment"># all_data = [] #存储所有解析到的数据</span> <span class="token comment"># for div in div_list:</span> <span class="token comment"># #xpath返回的是列表,但是列表元素一定是Selector类型的对象</span> <span class="token comment"># #extract可以将Selector对象中data参数存储的字符串提取出来</span> <span class="token comment"># # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()</span> <span class="token comment"># author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()</span> <span class="token comment"># #列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来</span> <span class="token comment"># content = div.xpath('./a[1]/div/span//text()').extract()</span> <span class="token comment"># content = ''.join(content)</span> <span class="token comment">#</span> <span class="token comment"># dic = { </span> <span class="token comment"># 'author':author,</span> <span class="token comment"># 'content':content</span> <span class="token comment"># }</span> <span class="token comment">#</span> <span class="token comment"># all_data.append(dic)</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># return all_data</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 解析:作者的名称+段子内容</span> div_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@id="content-left"]/div'</span><span class="token punctuation">)</span> all_data <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 存储所有解析到的数据</span> <span class="token keyword">for</span> div <span class="token keyword">in</span> div_list<span class="token punctuation">:</span> <span class="token comment"># xpath返回的是列表,但是列表元素一定是Selector类型的对象</span> <span class="token comment"># extract可以将Selector对象中data参数存储的字符串提取出来</span> <span class="token comment"># author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()</span> author <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div[1]/a[2]/h2/text() | ./div[1]/span/h2/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来</span> content <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./a[1]/div/span//text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> content <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>content<span class="token punctuation">)</span> item <span class="token operator">=</span> QiubaiproItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'author'</span><span class="token punctuation">]</span> <span class="token operator">=</span> author item<span class="token punctuation">[</span><span class="token string">'content'</span><span class="token punctuation">]</span> <span class="token operator">=</span> content <span class="token keyword">yield</span> item <span class="token comment"># 将item提交给了管道</span> </code></pre> <p>items.py 文件 —>在item类中定义相关的属性</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">class</span> <span class="token class-name">QiubaiproItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># define the fields for your item here like:</span> author <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> content <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># pass</span> </code></pre> <p>pipeline.py文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> pymysql <span class="token keyword">class</span> <span class="token class-name">QiubaiproPipeline</span><span class="token punctuation">(</span><span class="token builtin">object</span><span class="token punctuation">)</span><span class="token punctuation">:</span> fp <span class="token operator">=</span> <span class="token boolean">None</span> <span class="token comment">#重写父类的一个方法:该方法只在开始爬虫的时候被调用一次</span> <span class="token keyword">def</span> <span class="token function">open_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'开始爬虫......'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>fp <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'./qiubai.txt'</span><span class="token punctuation">,</span><span class="token string">'w'</span><span class="token punctuation">,</span>encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token comment">#专门用来处理item类型对象</span> <span class="token comment">#该方法可以接收爬虫文件提交过来的item对象</span> <span class="token comment">#该方法没接收到一个item就会被调用一次</span> <span class="token keyword">def</span> <span class="token function">process_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> author <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">'author'</span><span class="token punctuation">]</span> content<span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">'content'</span><span class="token punctuation">]</span> self<span class="token punctuation">.</span>fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span>author<span class="token operator">+</span><span class="token string">':'</span><span class="token operator">+</span>content<span class="token operator">+</span><span class="token string">'\n'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> item <span class="token comment"># 就会传递给下一个即将被执行的管道类</span> <span class="token keyword">def</span> <span class="token function">close_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'结束爬虫!'</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>fp<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#管道文件中一个管道类对应将一组数据存储到一个平台或者载体中</span> <span class="token keyword">class</span> <span class="token class-name">mysqlPileLine</span><span class="token punctuation">(</span><span class="token builtin">object</span><span class="token punctuation">)</span><span class="token punctuation">:</span> conn <span class="token operator">=</span> <span class="token boolean">None</span> cursor <span class="token operator">=</span> <span class="token boolean">None</span> <span class="token keyword">def</span> <span class="token function">open_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>conn <span class="token operator">=</span> pymysql<span class="token punctuation">.</span>Connect<span class="token punctuation">(</span>host<span class="token operator">=</span><span class="token string">'116.85.7.220'</span><span class="token punctuation">,</span> port<span class="token operator">=</span><span class="token number">3307</span><span class="token punctuation">,</span> user<span class="token operator">=</span><span class="token string">'root'</span><span class="token punctuation">,</span> password<span class="token operator">=</span><span class="token string">'root'</span><span class="token punctuation">,</span> db<span class="token operator">=</span><span class="token string">'qiubai'</span><span class="token punctuation">,</span> charset<span class="token operator">=</span><span class="token string">'utf8'</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">process_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>item<span class="token punctuation">,</span>spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>cursor <span class="token operator">=</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>cursor<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">try</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>cursor<span class="token punctuation">.</span>execute<span class="token punctuation">(</span><span class="token string">'insert into qiubai values("%s","%s")'</span><span class="token operator">%</span><span class="token punctuation">(</span>item<span class="token punctuation">[</span><span class="token string">"author"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>item<span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>rollback<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">return</span> item <span class="token keyword">def</span> <span class="token function">close_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>cursor<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <p>settings.py文件中开启管道</p> <pre><code class="prism language-python"><span class="token comment"># ...</span> ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'qiubaiPro.pipelines.QiubaiproPipeline'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">,</span> <span class="token string">'qiubaiPro.pipelines.mysqlPileLine'</span><span class="token punctuation">:</span> <span class="token number">301</span><span class="token punctuation">,</span> <span class="token comment">#300表示的是优先级,数值越小优先级越高</span> <span class="token punctuation">}</span> <span class="token comment"># ...</span> </code></pre> <p>【扩展】:</p> <pre><code>- 面试题:将爬取到的数据一份存储到本地一份存储到数据库,如何实现? - 管道文件中一个管道类对应的是将数据存储到一种平台 - 爬虫文件提交的item只会给管道文件中第一个被执行的管道类接受 - process_item中的return item表示将item传递给下一个即将被执行的管道类 </code></pre> <h4>11.4.4. 全站数据爬取</h4> <pre><code>- 基于Spider的全站数据爬取 - 就是将网站中某板块下的全部页码对应的页面数据进行爬取 - 需求:爬取校花网中的照片的名称 - 实现方式: - 将所有页面的url添加到start_urls列表(不推荐) - 自行手动进行请求发送(推荐) - 手动请求发送: - yield scrapy.Request(url,callback):callback专门用做于数据解析 </code></pre> <p>示例:spider.py文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">class</span> <span class="token class-name">XiaohuaSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'xiaohua'</span> <span class="token comment"># allowed_domains = ['www.xxx.com']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'http://www.521609.com/meinvxiaohua/'</span><span class="token punctuation">]</span> <span class="token comment">#生成一个通用的url模板(不可变)</span> url <span class="token operator">=</span> <span class="token string">'http://www.521609.com/meinvxiaohua/list12%d.html'</span> page_num <span class="token operator">=</span> <span class="token number">2</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> li_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="content"]/div[2]/div[2]/ul/li'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> img_name <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./a[2]/b/text() | ./a[2]/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>img_name<span class="token punctuation">)</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>page_num <span class="token operator"><=</span> <span class="token number">11</span><span class="token punctuation">:</span> new_url <span class="token operator">=</span> <span class="token builtin">format</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>url<span class="token operator">%</span>self<span class="token punctuation">.</span>page_num<span class="token punctuation">)</span> self<span class="token punctuation">.</span>page_num <span class="token operator">+=</span> <span class="token number">1</span> <span class="token comment">#手动请求发送:callback回调函数是专门用作于数据解析</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>url<span class="token operator">=</span>new_url<span class="token punctuation">,</span>callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse<span class="token punctuation">)</span> </code></pre> <p>11.5. 请求传参</p> <pre><code>- 请求传参 - 使用场景:如果爬取解析的数据不在同一张页面中。(深度爬取) - 需求:爬取boss的岗位名称,岗位描述 </code></pre> <p>示例:boos.py文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">from</span> bossPro<span class="token punctuation">.</span>items <span class="token keyword">import</span> BossproItem <span class="token keyword">class</span> <span class="token class-name">BossSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'boss'</span> <span class="token comment"># allowed_domains = ['www.xxx.com']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position='</span><span class="token punctuation">]</span> url <span class="token operator">=</span> <span class="token string">'https://www.zhipin.com/c101010100/?query=python&page=%d'</span> page_num <span class="token operator">=</span> <span class="token number">2</span> <span class="token comment"># 回调函数接受item </span> <span class="token comment"># 详情页数据解析</span> <span class="token keyword">def</span> <span class="token function">parse_detail</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> item <span class="token operator">=</span> response<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'item'</span><span class="token punctuation">]</span> job_desc <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> job_desc <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>job_desc<span class="token punctuation">)</span> <span class="token comment"># print(job_desc)</span> item<span class="token punctuation">[</span><span class="token string">'job_desc'</span><span class="token punctuation">]</span> <span class="token operator">=</span> job_desc <span class="token keyword">yield</span> item <span class="token comment"># 解析首页中的岗位名称</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> li_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="main"]/div/div[3]/ul/li'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> item <span class="token operator">=</span> BossproItem<span class="token punctuation">(</span><span class="token punctuation">)</span> job_name <span class="token operator">=</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//div[@class="info-primary"]/h3/a/div[1]/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'job_name'</span><span class="token punctuation">]</span> <span class="token operator">=</span> job_name <span class="token comment"># print(job_name)</span> detail_url <span class="token operator">=</span> <span class="token string">'https://www.zhipin.com'</span> <span class="token operator">+</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'.//div[@class="info-primary"]/h3/a/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 对详情页发请求获取详情页的页面源码数据</span> <span class="token comment"># 手动请求的发送</span> <span class="token comment"># 请求传参:meta={},可以将meta字典传递给请求对应的回调函数</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>detail_url<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse_detail<span class="token punctuation">,</span> meta<span class="token operator">=</span><span class="token punctuation">{ </span><span class="token string">'item'</span><span class="token punctuation">:</span> item<span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token comment"># 分页操作</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>page_num <span class="token operator"><=</span> <span class="token number">3</span><span class="token punctuation">:</span> new_url <span class="token operator">=</span> <span class="token builtin">format</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>url <span class="token operator">%</span> self<span class="token punctuation">.</span>page_num<span class="token punctuation">)</span> self<span class="token punctuation">.</span>page_num <span class="token operator">+=</span> <span class="token number">1</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>new_url<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse<span class="token punctuation">)</span> </code></pre> <h3>11.4. 图片数据 Imagepipeline</h3> <pre><code>- 图片数据爬取之ImagesPipeline - 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别? - 字符串:只需要基于xpath进行解析且提交管道进行持久化存储 - 图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据 - ImagesPipeline: - 只需要将img的src的属性值进行解析,提交到管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据, 而且还会帮我们进行持久化存储。 - 使用流程: - 数据解析(图片的地址) - 将存储图片地址的item提交到制定的管道类 - 在管道文件中自定制一个基于ImagesPipeLine的一个管道类 - get_media_request - file_path - item_completed # 将item返回给下一个管道方法 - 在配置文件中: - 指定图片存储的目录:IMAGES_STORE = './imgs_zhb' - 指定开启的管道:自定制的管道类 </code></pre> <p>示例:</p> <p>需求:爬取站长素材中的高清图片</p> <p>1、爬虫脚本(解析数据)img.py</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">from</span> imgspro<span class="token punctuation">.</span>items <span class="token keyword">import</span> ImgsproItem <span class="token keyword">class</span> <span class="token class-name">ImgSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'img'</span> <span class="token comment"># allowed_domains = ['www.xxx.com']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'http://sc.chinaz.com/tupian/'</span><span class="token punctuation">]</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> div_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//div[@id="container"]/div'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> div <span class="token keyword">in</span> div_list<span class="token punctuation">:</span> <span class="token comment"># 注意:使用伪装属性</span> src <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/a/img/@src2'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># print(src)</span> <span class="token comment"># 实例化item对象</span> item <span class="token operator">=</span> ImgsproItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span> <span class="token operator">=</span> src <span class="token keyword">yield</span> item <span class="token comment"># 提交item到管道</span> </code></pre> <p>2、items.py 文件 —>在item类中定义相关的属性</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">class</span> <span class="token class-name">ImgsproItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># define the fields for your item here like:</span> src <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># pass</span> </code></pre> <p>3、在管道文件中自定制一个基于ImagesPipeLine的一个管道类</p> <p>pipeline.py文件</p> <pre><code class="prism language-python"><span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>pipelines<span class="token punctuation">.</span>images <span class="token keyword">import</span> ImagesPipeline <span class="token keyword">import</span> scrapy <span class="token keyword">class</span> <span class="token class-name">ImgsPileLine</span><span class="token punctuation">(</span>ImagesPipeline<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment">#就是可以根据图片地址进行图片数据的请求</span> <span class="token keyword">def</span> <span class="token function">get_media_requests</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">,</span> info<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>item<span class="token punctuation">[</span><span class="token string">'src'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment">#指定图片存储的路径</span> <span class="token keyword">def</span> <span class="token function">file_path</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> response<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> info<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span><span class="token punctuation">:</span> imgName <span class="token operator">=</span> request<span class="token punctuation">.</span>url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token keyword">return</span> imgName <span class="token keyword">def</span> <span class="token function">item_completed</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> results<span class="token punctuation">,</span> item<span class="token punctuation">,</span> info<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">return</span> item <span class="token comment">#返回给下一个即将被执行的管道类</span> </code></pre> <p>4、在配置文件中:</p> <p>settings.py文件</p> <pre><code class="prism language-python"><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> LOG_LEVEL <span class="token operator">=</span> <span class="token string">'ERROR'</span> USER_AGENT <span class="token operator">=</span> 'Mozilla<span class="token operator">/</span><span class="token number">5.0</span> <span class="token punctuation">(</span>Macintosh<span class="token punctuation">;</span> Intel Mac OS X 10_12_0<span class="token punctuation">)</span> AppleWebKit<span class="token operator">/</span><span class="token number">537.36</span> <span class="token punctuation">(</span>KHTML<span class="token punctuation">,</span> like Gecko<span class="token punctuation">)</span> <span class="token comment"># Obey robots.txt rules</span> ROBOTSTXT_OBEY <span class="token operator">=</span> <span class="token boolean">False</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'imgsPro.pipelines.ImgsPileLine'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">,</span> <span class="token punctuation">}</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment">#指定图片存储的目录</span> IMAGES_STORE <span class="token operator">=</span> <span class="token string">'./imgs_zhb'</span> </code></pre> <h3>11.5. 中间件</h3> <pre><code>中间件 - 爬虫中间件 - 下载中间件【重要】 - 位置:引擎和下载器之间 - 作用:批量拦截到整个工程中所有的请求和响应 - 拦截请求: - UA伪装:process_request - 代理IP:process_exception:return request - 拦截响应: - 篡改响应数据,响应对象 - 需求:爬取网易新闻中的新闻数据(标题和内容) - 1.通过网易新闻的首页解析出五大板块对应的详情页的url(没有动态加载) - 2.每一个板块对应的新闻标题都是动态加载出来的(动态加载) - 3.通过解析出每一条新闻详情页的url获取详情页的页面源码,解析出新闻内容 </code></pre> <h4>11.5.1. 拦截请求:</h4> <p>middlewares.py</p> <pre><code class="prism language-python"><span class="token keyword">from</span> scrapy <span class="token keyword">import</span> signals <span class="token keyword">import</span> random <span class="token keyword">class</span> <span class="token class-name">MiddleproDownloaderMiddleware</span><span class="token punctuation">(</span><span class="token builtin">object</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Not all methods need to be defined. If a method is not defined,</span> <span class="token comment"># scrapy acts as if the downloader middleware does not modify the</span> <span class="token comment"># passed objects.</span> <span class="token comment"># 定义一个'User-Agent'池</span> user_agent_list <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"</span><span class="token punctuation">,</span> <span class="token string">"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "</span> <span class="token string">"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"</span> <span class="token punctuation">]</span> <span class="token comment"># 定义两个代理</span> PROXY_http <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">'153.180.102.104:80'</span><span class="token punctuation">,</span> <span class="token string">'195.208.131.189:56055'</span><span class="token punctuation">,</span> <span class="token punctuation">]</span> PROXY_https <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">'120.83.49.90:9000'</span><span class="token punctuation">,</span> <span class="token string">'95.189.112.214:35508'</span><span class="token punctuation">,</span> <span class="token punctuation">]</span> <span class="token comment">#拦截请求</span> <span class="token keyword">def</span> <span class="token function">process_request</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment">#UA伪装</span> request<span class="token punctuation">.</span>headers<span class="token punctuation">[</span><span class="token string">'User-Agent'</span><span class="token punctuation">]</span> <span class="token operator">=</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>self<span class="token punctuation">.</span>user_agent_list<span class="token punctuation">)</span> <span class="token comment">#为了验证代理的操作是否生效</span> request<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'proxy'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">'http://183.146.213.198:80'</span> <span class="token keyword">return</span> <span class="token boolean">None</span> <span class="token comment">#拦截所有的响应</span> <span class="token keyword">def</span> <span class="token function">process_response</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> response<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Called with the response returned from the downloader.</span> <span class="token comment"># Must either;</span> <span class="token comment"># - return a Response object</span> <span class="token comment"># - return a Request object</span> <span class="token comment"># - or raise IgnoreRequest</span> <span class="token keyword">return</span> response <span class="token comment">#拦截发生异常的请求</span> <span class="token keyword">def</span> <span class="token function">process_exception</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> exception<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> request<span class="token punctuation">.</span>url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">':'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">'http'</span><span class="token punctuation">:</span> <span class="token comment">#代理</span> request<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'proxy'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">'http://'</span><span class="token operator">+</span>random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>self<span class="token punctuation">.</span>PROXY_http<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> request<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'proxy'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">'https://'</span> <span class="token operator">+</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span>self<span class="token punctuation">.</span>PROXY_https<span class="token punctuation">)</span> <span class="token keyword">return</span> request <span class="token comment">#将修正之后的请求对象进行重新的请求发送</span> </code></pre> <h4>11.5.2. 拦截响应 :</h4> <p>需求:爬取网易新闻数据</p> <p>wangyi.py文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">from</span> selenium <span class="token keyword">import</span> webdriver <span class="token keyword">from</span> wangyiPro<span class="token punctuation">.</span>items <span class="token keyword">import</span> WangyiproItem <span class="token keyword">class</span> <span class="token class-name">WangyiSpider</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Spider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'wangyi'</span> <span class="token comment"># allowed_domains = ['www.cccom']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://news.163.com/'</span><span class="token punctuation">]</span> models_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 存储五个板块对应详情页的url</span> <span class="token comment"># 解析五大板块对应详情页的url</span> <span class="token comment"># 实例化一个浏览器对象</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>bro <span class="token operator">=</span> webdriver<span class="token punctuation">.</span>Chrome<span class="token punctuation">(</span>executable_path<span class="token operator">=</span>r<span class="token string">'D:\01-soft\12-spider_chromedriver\chromedriver.exe'</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">parse</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> li_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li'</span><span class="token punctuation">)</span> alist <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">6</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">]</span> <span class="token keyword">for</span> index <span class="token keyword">in</span> alist<span class="token punctuation">:</span> model_url <span class="token operator">=</span> li_list<span class="token punctuation">[</span>index<span class="token punctuation">]</span><span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./a/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>models_urls<span class="token punctuation">.</span>append<span class="token punctuation">(</span>model_url<span class="token punctuation">)</span> <span class="token comment"># 依次对每一个板块对应的页面进行请求</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> self<span class="token punctuation">.</span>models_urls<span class="token punctuation">:</span> <span class="token comment"># 对每一个板块的url进行请求发送</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>url<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse_model<span class="token punctuation">)</span> <span class="token comment"># 每一个板块对应的新闻标题相关的内容都是动态加载</span> <span class="token keyword">def</span> <span class="token function">parse_model</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 解析每一个板块页面中对应新闻的标题和新闻详情页</span> <span class="token comment"># response.xpath()</span> div_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> div <span class="token keyword">in</span> div_list<span class="token punctuation">:</span> title <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div[1]/h3/a/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> new_detail_url <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div[1]/h3/a/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> item <span class="token operator">=</span> WangyiproItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span> <span class="token operator">=</span> title <span class="token comment"># 对新闻详情页的url发起请求</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>url<span class="token operator">=</span>new_detail_url<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parse_detail<span class="token punctuation">,</span> meta<span class="token operator">=</span><span class="token punctuation">{ </span><span class="token string">'item'</span><span class="token punctuation">:</span> item<span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">parse_detail</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 解析新闻内容</span> content <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="endText"]//text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> content <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>content<span class="token punctuation">)</span> item <span class="token operator">=</span> response<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">'item'</span><span class="token punctuation">]</span> item<span class="token punctuation">[</span><span class="token string">'content'</span><span class="token punctuation">]</span> <span class="token operator">=</span> content <span class="token keyword">yield</span> item <span class="token comment"># 退出浏览器</span> <span class="token keyword">def</span> <span class="token function">closed</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>bro<span class="token punctuation">.</span>quit<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <p>middleware.py文件</p> <pre><code class="prism language-python"><span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>http <span class="token keyword">import</span> HtmlResponse <span class="token keyword">from</span> time <span class="token keyword">import</span> sleep <span class="token keyword">class</span> <span class="token class-name">WangyiproDownloaderMiddleware</span><span class="token punctuation">(</span><span class="token builtin">object</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Not all methods need to be defined. If a method is not defined,</span> <span class="token comment"># scrapy acts as if the downloader middleware does not modify the</span> <span class="token comment"># passed objects.</span> <span class="token keyword">def</span> <span class="token function">process_request</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Called for each request that goes through the downloader</span> <span class="token comment"># middleware.</span> <span class="token comment"># Must either:</span> <span class="token comment"># - return None: continue processing this request</span> <span class="token comment"># - or return a Response object</span> <span class="token comment"># - or return a Request object</span> <span class="token comment"># - or raise IgnoreRequest: process_exception() methods of</span> <span class="token comment"># installed downloader middleware will be called</span> <span class="token keyword">return</span> <span class="token boolean">None</span> <span class="token comment">#该方法拦截五大板块对应的响应对象,进行篡改</span> <span class="token keyword">def</span> <span class="token function">process_response</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> response<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token comment">#spider爬虫对象</span> bro <span class="token operator">=</span> spider<span class="token punctuation">.</span>bro <span class="token comment">#获取了在爬虫类中定义的浏览器对象</span> <span class="token comment">#挑选出指定的响应对象进行篡改</span> <span class="token comment">#通过url指定request</span> <span class="token comment">#通过request指定response</span> <span class="token keyword">if</span> request<span class="token punctuation">.</span>url <span class="token keyword">in</span> spider<span class="token punctuation">.</span>models_urls<span class="token punctuation">:</span> bro<span class="token punctuation">.</span>get<span class="token punctuation">(</span>request<span class="token punctuation">.</span>url<span class="token punctuation">)</span> <span class="token comment">#五个板块对应的url进行请求</span> sleep<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> page_text <span class="token operator">=</span> bro<span class="token punctuation">.</span>page_source <span class="token comment">#包含了动态加载的新闻数据</span> <span class="token comment">#response #五大板块对应的响应对象</span> <span class="token comment">#针对定位到的这些response进行篡改</span> <span class="token comment">#实例化一个新的响应对象(符合需求:包含动态加载出的新闻数据),替代原来旧的响应对象</span> <span class="token comment">#如何获取动态加载出的新闻数据?</span> <span class="token comment">#基于selenium便捷的获取动态加载数据</span> new_response <span class="token operator">=</span> HtmlResponse<span class="token punctuation">(</span>url<span class="token operator">=</span>request<span class="token punctuation">.</span>url<span class="token punctuation">,</span>body<span class="token operator">=</span>page_text<span class="token punctuation">,</span>encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">,</span>request<span class="token operator">=</span>request<span class="token punctuation">)</span> <span class="token keyword">return</span> new_response <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token comment">#response #其他请求对应的响应对象</span> <span class="token keyword">return</span> response <span class="token keyword">def</span> <span class="token function">process_exception</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> request<span class="token punctuation">,</span> exception<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Called when a download handler or a process_request()</span> <span class="token comment"># (from other downloader middleware) raises an exception.</span> <span class="token comment"># Must either:</span> <span class="token comment"># - return None: continue processing this exception</span> <span class="token comment"># - return a Response object: stops process_exception() chain</span> <span class="token comment"># - return a Request object: stops process_exception() chain</span> <span class="token keyword">pass</span> </code></pre> <p>items.py文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">class</span> <span class="token class-name">WangyiproItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># define the fields for your item here like:</span> title <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> content <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <p>settings.py文件</p> <pre><code class="prism language-python"><span class="token comment"># 打开下载中间件</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> DOWNLOADER_MIDDLEWARES <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'wangyiPro.middlewares.WangyiproDownloaderMiddleware'</span><span class="token punctuation">:</span> <span class="token number">543</span><span class="token punctuation">,</span> <span class="token punctuation">}</span> </code></pre> <h4>11.5.3. 爬虫中间件</h4> <p>SpiderMiddleware</p> <pre><code>@classmethod from_crawler(cls, crawler) 当创建了spider之后创建当前的中间件类实例 同时, 连接打开爬虫类的信号处理 process_spider_input(self, response, spider): 可以返回 None 和 raise Exception 返回None,表示放行,不拦截响应被解析 raise Exception 抛出异常,到达了process_spider_exception()方法中 process_spider_output(self, response, result, spider) 可以返回 item和request 默认: for r in result: yield r process_spider_exception(self, response, exception, spider) 可以返回 None/Request/Item process_start_requests(self, start_requests, spider) 必须返回Request </code></pre> <h4>11.5.4. 下载中间件</h4> <pre><code>process_request(self, request,spider): 返回对象 :None|Request|Response|raise IgnoreRequest 1. 可以返回哪些对象?? 返回None,继续处理这个请求,或者返回一个响应对象或者返回一个请求对象,或者或触发IgnoreRequest 2. 什么时候使用此函数 下载器向引擎返回响应的时候 process_response(self,request, response, spider) 1. 可以返回对象 Request|Response|raise IgnoreRequest 2. 使用场景是什么? 从下载器返回响应时调用 process_exception(self,request,exception, spider) 返回对象:None </code></pre> <h3>11.6. 总结核心模块和类</h3> <pre><code>scrapy.Spider 普通爬虫类的父类 - name 爬虫名, 在scrapy crawl 命令中使用 - start_urls 起始的请求URL资源列表 - allowed_domains 允许访问的服务器域名列表 - start_requests() 方法,爬虫启动后执行的第一个方法(流程中的第一步:发起请求) - logger 当前爬虫的日志记录器 - parse() 默认请求成功后,对响应的数据默认解析的方法 </code></pre> <pre><code>scrapy.Spider 普通爬虫类的父类 - name 爬虫名称 - start ——urls 起始的请求URL资源列表 </code></pre> <pre><code>scrapy.Request -初始化时参数: url, method, body, encoding, callback, headers, cookiesn dont_filter, priority, meta - meta dict格式, 可以设置proxy 代理 </code></pre> <pre><code>scrapy.http.Response/TextResponse/HtmlResponse - status 响应状态码 - meta 响应的原信息,包含request中的meta信息 - url 请求的URL - request 请求对象 - headers 响应头 - body 字节数据 - text 文本数据 - css()/xpath() 提取HTML元素信息(基于lxml/bs4) </code></pre> <pre><code>scrapy.Item 类, 类似于dict, 作用: 解析出不同结构的数据时,使用不同的Item类,便于数据管道处理。 scrapy.Filed 类,用于Item子类中声明字段属性(数据属性) </code></pre> <pre><code>scrapy.signals 信号 - spider_opened 打开爬虫 - spider_closed 关闭爬虫 - spider_error 爬虫出现异常 </code></pre> <pre><code>优先级: - 请求优先级: 值高 == 优先级大, 值低 == 优先级小 配置settings中的优先级 - 管道优先级: 值高 == 优先级小, 值低 == 优先级大 - 中间件优先级: 值高 == 优先级小, 值低 == 优先级大 </code></pre> <h2>十二、规则爬虫</h2> <p><font color="blue" size="6"><strong>crawlspider</strong></font></p> <p>CrawlSpider是一个类,它的父类就是scrapy.Spider,所以CrawlSpider不仅有Spider的功能,还有自己独有的功能</p> <p>CrawlSpider可以定义规则,再解析html内容的时候,可以<strong>根据链接规则提取出指定的链接,然后再向这些链接发送请求</strong>,所以,如果有需要跟进链接的需求,就可以使用CrawlSpider来实现</p> <h3>12.1. 流程</h3> <pre><code>- CrawlSpider:类,Spider的一个子类 - 全站数据爬取的方式 - 基于Spider:手动请求 - 基于CrawlSpider - CrawlSpider的使用: - 创建一个工程 - cd XXX - 创建爬虫文件(CrawlSpider): - scrapy genspider -t crawl xxx www.xxxx.com - 链接提取器: - 作用:根据指定的规则(allow)进行指定链接的提取 - 规则解析器: - 作用:将链接提取器提取到的链接进行指定规则(callback)的解析 #需求:爬取sun网站中的编号,新闻标题,新闻内容,标号 - 分析:爬取的数据没有在同一张页面中。 - 1.可以使用链接提取器提取所有的页码链接 - 2.让链接提取器提取所有的新闻详情页的链接 </code></pre> <p>#需求:爬取sun网站中的编号,新闻标题,新闻内容,标号</p> <p>sun.py文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>linkextractors <span class="token keyword">import</span> LinkExtractor <span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>spiders <span class="token keyword">import</span> CrawlSpider<span class="token punctuation">,</span> Rule <span class="token keyword">from</span> sunPro<span class="token punctuation">.</span>items <span class="token keyword">import</span> SunproItem<span class="token punctuation">,</span>DetailItem <span class="token comment">#需求:爬取sun网站中的编号,新闻标题,新闻内容,标号</span> <span class="token keyword">class</span> <span class="token class-name">SunSpider</span><span class="token punctuation">(</span>CrawlSpider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'sun'</span> <span class="token comment"># allowed_domains = ['www.xxx.com']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'http://wz.sun0769.com/index.php/question/questionType?type=4&page='</span><span class="token punctuation">]</span> <span class="token comment">#链接提取器:根据指定规则(allow="正则")进行指定链接的提取</span> link <span class="token operator">=</span> LinkExtractor<span class="token punctuation">(</span>allow<span class="token operator">=</span>r<span class="token string">'type=4&page=\d+'</span><span class="token punctuation">)</span> link_detail <span class="token operator">=</span> LinkExtractor<span class="token punctuation">(</span>allow<span class="token operator">=</span>r<span class="token string">'question/\d+/\d+\.shtml'</span><span class="token punctuation">)</span> rules <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token comment">#规则解析器:将链接提取器提取到的链接进行指定规则(callback)的解析操作</span> Rule<span class="token punctuation">(</span>link<span class="token punctuation">,</span> callback<span class="token operator">=</span><span class="token string">'parse_item'</span><span class="token punctuation">,</span> follow<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token comment">#follow=True:可以将链接提取器 继续作用到 连接提取器提取到的链接 所对应的页面中</span> Rule<span class="token punctuation">(</span>link_detail<span class="token punctuation">,</span>callback<span class="token operator">=</span><span class="token string">'parse_detail'</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token comment">#http://wz.sun0769.com/html/question/201907/421001.shtml</span> <span class="token comment">#http://wz.sun0769.com/html/question/201907/420987.shtml</span> <span class="token comment">#解析新闻编号和新闻的标题</span> <span class="token comment">#如下两个解析方法中是不可以实现请求传参!</span> <span class="token comment">#如法将两个解析方法解析的数据存储到同一个item中,可以以此存储到两个item</span> <span class="token keyword">def</span> <span class="token function">parse_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment">#注意:xpath表达式中不可以出现tbody标签</span> tr_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="morelist"]/div/table[2]//tr/td/table//tr'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> tr <span class="token keyword">in</span> tr_list<span class="token punctuation">:</span> new_num <span class="token operator">=</span> tr<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./td[1]/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> new_title <span class="token operator">=</span> tr<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./td[2]/a[2]/@title'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> item <span class="token operator">=</span> SunproItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span> <span class="token operator">=</span> new_title item<span class="token punctuation">[</span><span class="token string">'new_num'</span><span class="token punctuation">]</span> <span class="token operator">=</span> new_num <span class="token keyword">yield</span> item <span class="token comment">#解析新闻内容和新闻编号</span> <span class="token keyword">def</span> <span class="token function">parse_detail</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>response<span class="token punctuation">)</span><span class="token punctuation">:</span> new_id <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[9]/table[1]//tr/td[2]/span[2]/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> new_content <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[9]/table[2]//tr[1]//text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> new_content <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>new_content<span class="token punctuation">)</span> <span class="token comment"># print(new_id,new_content)</span> item <span class="token operator">=</span> DetailItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'content'</span><span class="token punctuation">]</span> <span class="token operator">=</span> new_content item<span class="token punctuation">[</span><span class="token string">'new_id'</span><span class="token punctuation">]</span> <span class="token operator">=</span> new_id <span class="token keyword">yield</span> item </code></pre> <p>items.py</p> <pre><code class="prism language-python"><span class="token keyword">import</span> scrapy <span class="token keyword">class</span> <span class="token class-name">SunproItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># define the fields for your item here like:</span> title <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> new_num <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">class</span> <span class="token class-name">DetailItem</span><span class="token punctuation">(</span>scrapy<span class="token punctuation">.</span>Item<span class="token punctuation">)</span><span class="token punctuation">:</span> new_id <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> content <span class="token operator">=</span> scrapy<span class="token punctuation">.</span>Field<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <p>pipeline.py</p> <pre><code class="prism language-python"><span class="token keyword">class</span> <span class="token class-name">SunproPipeline</span><span class="token punctuation">(</span><span class="token builtin">object</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">process_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment">#如何判定item的类型</span> <span class="token comment">#将数据写入数据库时,如何保证数据的一致性</span> <span class="token keyword">if</span> item<span class="token punctuation">.</span>__class__<span class="token punctuation">.</span>__name__ <span class="token operator">==</span> <span class="token string">'DetailItem'</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>item<span class="token punctuation">[</span><span class="token string">'new_id'</span><span class="token punctuation">]</span><span class="token punctuation">,</span>item<span class="token punctuation">[</span><span class="token string">'content'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">pass</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>item<span class="token punctuation">[</span><span class="token string">'new_num'</span><span class="token punctuation">]</span><span class="token punctuation">,</span>item<span class="token punctuation">[</span><span class="token string">'title'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">return</span> item </code></pre> <p>settings.py</p> <pre><code class="prism language-python"><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> ITEM_PIPELINES <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'sunPro.pipelines.SunproPipeline'</span><span class="token punctuation">:</span> <span class="token number">300</span><span class="token punctuation">,</span> <span class="token punctuation">}</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> </code></pre> <h2>十三、分布式爬虫</h2> <pre><code>- 分布式爬虫 - 概念:我们需要搭建一个分布式的机群,让其对一组资源进行分布联合爬取。 - 作用:提升爬取数据的效率 - 如何实现分布式? - 安装一个scrapy-redis的组件 - 原生的scarapy是不可以实现分布式爬虫,必须要让scrapy结合着scrapy-redis组件一起实现分布式爬虫。 - 为什么原生的scrapy不可以实现分布式? - 调度器不可以被分布式机群共享 - 管道不可以被分布式机群共享 - scrapy-redis组件作用: - 可以给原生的scrapy框架提供可以被共享的管道和调度器 - 实现流程 - 创建一个工程 - 创建一个基于CrawlSpider的爬虫文件 - 修改当前的爬虫文件: - 导包:from scrapy_redis.spiders import RedisCrawlSpider - 将start_urls和allowed_domains进行注释 - 添加一个新属性:redis_key = 'sun' 可以被共享的调度器队列的名称 - 编写数据解析相关的操作 - 将当前爬虫类的父类修改成RedisCrawlSpider - 修改配置文件settings - 指定使用可以被共享的管道: ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } - 指定调度器: # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据 SCHEDULER_PERSIST = True - 指定redis服务器: - redis相关操作配置: - 配置redis的配置文件: - linux或者mac:redis.conf - windows:redis.windows.conf - 代开配置文件修改: - 将bind 127.0.0.1进行删除 - 关闭保护模式:protected-mode yes改为no - 结合着配置文件开启redis服务 - redis-server 配置文件 - 启动客户端: - redis-cli - 执行工程: - scrapy runspider xxx.py - 向调度器的队列中放入一个起始的url: - 调度器的队列在redis的客户端中 - lpush xxx www.xxx.com - 爬取到的数据存储在了redis的proName:items这个数据结构中 </code></pre> <p>示例:</p> <p>day10 —> dm530项目</p> <p>存储MondoDB</p> <p>—>NoSQL仓库.xmind</p> <h2>十四、增量式爬虫</h2> <pre><code>增量式爬虫 - 概念:监测网站数据更新的情况,只会爬取网站最新更新出来的数据。 - 分析: - 指定一个起始url - 基于CrawlSpider获取其他页码链接 - 基于Rule将其他页码链接进行请求 - 从每一个页码对应的页面源码中解析出每一个电影详情页的URL - 核心:检测电影详情页的url之前有没有请求过 - 将爬取过的电影详情页的url存储 - 存储到redis的set数据结构 - 对详情页的url发起请求,然后解析出电影的名称和简介 - 进行持久化存储 </code></pre> <p>movie.py</p> <pre><code class="prism language-python"><span class="token comment"># -*- coding: utf-8 -*-</span> <span class="token keyword">import</span> scrapy <span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>linkextractors <span class="token keyword">import</span> LinkExtractor <span class="token keyword">from</span> scrapy<span class="token punctuation">.</span>spiders <span class="token keyword">import</span> CrawlSpider<span class="token punctuation">,</span> Rule <span class="token keyword">from</span> redis <span class="token keyword">import</span> Redis <span class="token keyword">from</span> moviePro<span class="token punctuation">.</span>items <span class="token keyword">import</span> MovieproItem <span class="token keyword">class</span> <span class="token class-name">MovieSpider</span><span class="token punctuation">(</span>CrawlSpider<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> <span class="token string">'movie'</span> <span class="token comment"># allowed_domains = ['www.ccc.com']</span> start_urls <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'https://www.4567tv.tv/frim/index1.html'</span><span class="token punctuation">]</span> rules <span class="token operator">=</span> <span class="token punctuation">(</span> Rule<span class="token punctuation">(</span>LinkExtractor<span class="token punctuation">(</span>allow<span class="token operator">=</span>r<span class="token string">'/frim/index1-\d+\.html'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> callback<span class="token operator">=</span><span class="token string">'parse_item'</span><span class="token punctuation">,</span> follow<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">)</span> <span class="token comment"># 创建redis链接对象</span> conn <span class="token operator">=</span> Redis<span class="token punctuation">(</span>host<span class="token operator">=</span><span class="token string">'127.0.0.1'</span><span class="token punctuation">,</span> port<span class="token operator">=</span><span class="token number">6379</span><span class="token punctuation">)</span> <span class="token comment">#用于解析每一个页码对应页面中的电影详情页的url</span> <span class="token keyword">def</span> <span class="token function">parse_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> li_list <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[1]/div/div/div/div[2]/ul/li'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> <span class="token comment"># 获取详情页的url</span> detail_url <span class="token operator">=</span> <span class="token string">'https://www.4567tv.tv'</span> <span class="token operator">+</span> li<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/a/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 将详情页的url存入redis的set中</span> ex <span class="token operator">=</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>sadd<span class="token punctuation">(</span><span class="token string">'urls'</span><span class="token punctuation">,</span> detail_url<span class="token punctuation">)</span> <span class="token keyword">if</span> ex <span class="token operator">==</span> <span class="token number">1</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'该url没有被爬取过,可以进行数据的爬取'</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> scrapy<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>url<span class="token operator">=</span>detail_url<span class="token punctuation">,</span> callback<span class="token operator">=</span>self<span class="token punctuation">.</span>parst_detail<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'数据还没有更新,暂无新数据可爬取!'</span><span class="token punctuation">)</span> <span class="token comment"># 解析详情页中的电影名称和类型,进行持久化存储</span> <span class="token keyword">def</span> <span class="token function">parst_detail</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> response<span class="token punctuation">)</span><span class="token punctuation">:</span> item <span class="token operator">=</span> MovieproItem<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'name'</span><span class="token punctuation">]</span> <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[1]/div/div/div/div[2]/h1/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract_first<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'desc'</span><span class="token punctuation">]</span> <span class="token operator">=</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]//text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> item<span class="token punctuation">[</span><span class="token string">'desc'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>item<span class="token punctuation">[</span><span class="token string">'desc'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">yield</span> item </code></pre> <p>pipelines.py</p> <pre><code class="prism language-python"><span class="token keyword">class</span> <span class="token class-name">MovieproPipeline</span><span class="token punctuation">(</span><span class="token builtin">object</span><span class="token punctuation">)</span><span class="token punctuation">:</span> conn <span class="token operator">=</span> <span class="token boolean">None</span> <span class="token keyword">def</span> <span class="token function">open_spider</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span>spider<span class="token punctuation">)</span><span class="token punctuation">:</span> self<span class="token punctuation">.</span>conn <span class="token operator">=</span> spider<span class="token punctuation">.</span>conn <span class="token keyword">def</span> <span class="token function">process_item</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> item<span class="token punctuation">,</span> spider<span class="token punctuation">)</span><span class="token punctuation">:</span> dic <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">'name'</span><span class="token punctuation">:</span>item<span class="token punctuation">[</span><span class="token string">'name'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">'desc'</span><span class="token punctuation">:</span>item<span class="token punctuation">[</span><span class="token string">'desc'</span><span class="token punctuation">]</span> <span class="token punctuation">}</span> <span class="token comment"># print(dic)</span> self<span class="token punctuation">.</span>conn<span class="token punctuation">.</span>lpush<span class="token punctuation">(</span><span class="token string">'movieData'</span><span class="token punctuation">,</span>dic<span class="token punctuation">)</span> <span class="token keyword">return</span> item </code></pre> <p><a href="http://img.e-com-net.com/image/info8/0ff08718e7594a32932ee19bd8150369.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/0ff08718e7594a32932ee19bd8150369.jpg" alt="爬虫文档学习 xpath bs4 selenium scrapy..._第1张图片" width="650" height="423" style="border:1px solid black;"></a></p> </div> </div> </div> </div> </div> <!--PC和WAP自适应版--> <div id="SOHUCS" sid="1354080890392752128"></div> <script type="text/javascript" src="/views/front/js/chanyan.js"></script> <!-- 文章页-底部 动态广告位 --> <div class="youdao-fixed-ad" id="detail_ad_bottom"></div> </div> <div class="col-md-3"> <div class="row" id="ad"> <!-- 文章页-右侧1 动态广告位 --> <div id="right-1" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_1"> </div> </div> <!-- 文章页-右侧2 动态广告位 --> <div id="right-2" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_2"></div> </div> <!-- 文章页-右侧3 动态广告位 --> <div id="right-3" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_3"></div> </div> </div> </div> </div> </div> </div> <div class="container"> <h4 class="pt20 mb15 mt0 border-top">你可能感兴趣的:(python)</h4> <div id="paradigm-article-related"> <div class="recommend-post mb30"> <ul class="widget-links"> <li><a href="/article/1835511912843014144.htm" title="理解Gunicorn:Python WSGI服务器的基石" target="_blank">理解Gunicorn:Python WSGI服务器的基石</a> <span class="text-muted">范范0825</span> <a class="tag" taget="_blank" href="/search/ipython/1.htm">ipython</a><a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/%E8%BF%90%E7%BB%B4/1.htm">运维</a> <div>理解Gunicorn:PythonWSGI服务器的基石介绍Gunicorn,全称GreenUnicorn,是一个为PythonWSGI(WebServerGatewayInterface)应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具,Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置,帮助初学者快速上手。1.什么是Gunico</div> </li> <li><a href="/article/1835506869838376960.htm" title="Python数据分析与可视化实战指南" target="_blank">Python数据分析与可视化实战指南</a> <span class="text-muted">William数据分析</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE/1.htm">数据</a> <div>在数据驱动的时代,Python因其简洁的语法、强大的库生态系统以及活跃的社区,成为了数据分析与可视化的首选语言。本文将通过一个详细的案例,带领大家学习如何使用Python进行数据分析,并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前,我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学</div> </li> <li><a href="/article/1835505858939809792.htm" title="python os.environ" target="_blank">python os.environ</a> <span class="text-muted">江湖偌大</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/1.htm">深度学习</a> <div>os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值,输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息(INFO)os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息(INFO\WARNING)os.environ['TF_CPP_MIN_LOG_LEVEL']='</div> </li> <li><a href="/article/1835505606245576704.htm" title="Python中os.environ基本介绍及使用方法" target="_blank">Python中os.environ基本介绍及使用方法</a> <span class="text-muted">鹤冲天Pro</span> <a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%9C%8D%E5%8A%A1%E5%99%A8/1.htm">服务器</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi</div> </li> <li><a href="/article/1835505226933694464.htm" title="Pyecharts数据可视化大屏:打造沉浸式数据分析体验" target="_blank">Pyecharts数据可视化大屏:打造沉浸式数据分析体验</a> <span class="text-muted">我的运维人生</span> <a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">信息可视化</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a><a class="tag" taget="_blank" href="/search/%E8%BF%90%E7%BB%B4%E5%BC%80%E5%8F%91/1.htm">运维开发</a><a class="tag" taget="_blank" href="/search/%E6%8A%80%E6%9C%AF%E5%85%B1%E4%BA%AB/1.htm">技术共享</a> <div>Pyecharts数据可视化大屏:打造沉浸式数据分析体验在当今这个数据驱动的时代,如何将海量数据以直观、生动的方式展现出来,成为了数据分析师和企业决策者关注的焦点。Pyecharts,作为一款基于Python的开源数据可视化库,凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力,成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏,并通过实际代码案例</div> </li> <li><a href="/article/1835504217729626112.htm" title="Python教程:一文了解使用Python处理XPath" target="_blank">Python教程:一文了解使用Python处理XPath</a> <span class="text-muted">旦莫</span> <a class="tag" taget="_blank" href="/search/Python%E8%BF%9B%E9%98%B6/1.htm">Python进阶</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath?2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代</div> </li> <li><a href="/article/1835503965563875328.htm" title="python os.environ_python os.environ 读取和设置环境变量" target="_blank">python os.environ_python os.environ 读取和设置环境变量</a> <span class="text-muted">weixin_39605414</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/os.environ/1.htm">os.environ</a> <div>>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA</div> </li> <li><a href="/article/1835497664922349568.htm" title="使用Faiss进行高效相似度搜索" target="_blank">使用Faiss进行高效相似度搜索</a> <span class="text-muted">llzwxh888</span> <a class="tag" taget="_blank" href="/search/faiss/1.htm">faiss</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>在现代AI应用中,快速和高效的相似度搜索是至关重要的。Faiss(FacebookAISimilaritySearch)是一个专门用于快速相似度搜索和聚类的库,特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索,并结合Python代码演示其基本用法。什么是Faiss?Faiss是一个由FacebookAIResearch团队开发的开源库,主要用于高维向量的相似性搜索和聚类。Faiss</div> </li> <li><a href="/article/1835497665853485056.htm" title="python是什么意思中文-在python中%是什么意思" target="_blank">python是什么意思中文-在python中%是什么意思</a> <span class="text-muted">编程大乐趣</span> <div>Python中%有两种:1、数值运算:%代表取模,返回除法的余数。如:>>>7%212、%操作符(字符串格式化,stringformatting),说明如下:%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+,-,''或0。+表示右对齐。-表示左对齐。''为一个空格,表示在正数的左侧填充一个空格,从而与负数对齐。0表示使用0填</div> </li> <li><a href="/article/1835495644123459584.htm" title="Day1笔记-Python简介&标识符和关键字&输入输出" target="_blank">Day1笔记-Python简介&标识符和关键字&输入输出</a> <span class="text-muted">~在杰难逃~</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/%E5%A4%A7%E6%95%B0%E6%8D%AE/1.htm">大数据</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a> <div>大家好,从今天开始呢,杰哥开展一个新的专栏,当然,数据分析部分也会不定时更新的,这个新的专栏主要是讲解一些Python的基础语法和知识,帮助0基础的小伙伴入门和学习Python,感兴趣的小伙伴可以开始认真学习啦!一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码,再通过语言处理程序执行向计算机发送指令,让计算机完成对应的工作,编程</div> </li> <li><a href="/article/1835495517774245888.htm" title="python八股文面试题分享及解析(1)" target="_blank">python八股文面试题分享及解析(1)</a> <span class="text-muted">Shawn________</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果:21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型,不仅仅改变</div> </li> <li><a href="/article/1835493753557708800.htm" title="每日算法&面试题,大厂特训二十八天——第二十天(树)" target="_blank">每日算法&面试题,大厂特训二十八天——第二十天(树)</a> <span class="text-muted">肥学</span> <a class="tag" taget="_blank" href="/search/%E2%9A%A1%E7%AE%97%E6%B3%95%E9%A2%98%E2%9A%A1%E9%9D%A2%E8%AF%95%E9%A2%98%E6%AF%8F%E6%97%A5%E7%B2%BE%E8%BF%9B/1.htm">⚡算法题⚡面试题每日精进</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E7%AE%97%E6%B3%95/1.htm">算法</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84/1.htm">数据结构</a> <div>目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题,最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧!!特别介绍小白练手专栏,适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章</div> </li> <li><a href="/article/1835493626688401408.htm" title="Python快速入门 —— 第三节:类与对象" target="_blank">Python快速入门 —— 第三节:类与对象</a> <span class="text-muted">孤华暗香</span> <a class="tag" taget="_blank" href="/search/Python%E5%BF%AB%E9%80%9F%E5%85%A5%E9%97%A8/1.htm">Python快速入门</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>第三节:类与对象目标:了解面向对象编程的基础概念,并学会如何定义类和创建对象。内容:类与对象:定义类:class关键字。类的构造函数:__init__()。类的属性和方法。对象的创建与使用。示例:classStudent:def__init__(self,name,age,major):self.name&#</div> </li> <li><a href="/article/1835492869062881280.htm" title="pyecharts——绘制柱形图折线图" target="_blank">pyecharts——绘制柱形图折线图</a> <span class="text-muted">2224070247</span> <a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">信息可视化</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">数据可视化</a> <div>一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd)数据可视化团队研发的ECharts1.0发布到GitHub网站以来,ECharts一直备受业界权威的关注并获得广泛好评,成为目前成熟且流行的数据可视化图表工具,被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言,也加入ECharts的使用行列,并研发出方便Python开发者使用的数据</div> </li> <li><a href="/article/1835491859351302144.htm" title="Python 实现图片裁剪(附代码) | Python工具" target="_blank">Python 实现图片裁剪(附代码) | Python工具</a> <span class="text-muted">剑客阿良_ALiang</span> <div>前言本文提供将图片按照自定义尺寸进行裁剪的工具方法,一如既往的实用主义。环境依赖ffmpeg环境安装,可以参考我的另一篇文章:windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg,而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装:pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了,上代码</div> </li> <li><a href="/article/1835491353451130880.htm" title="【华为OD技术面试真题 - 技术面】- python八股文真题题库(4)" target="_blank">【华为OD技术面试真题 - 技术面】- python八股文真题题库(4)</a> <span class="text-muted">算法大师</span> <a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BAod/1.htm">华为od</a><a class="tag" taget="_blank" href="/search/%E9%9D%A2%E8%AF%95/1.htm">面试</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>华为OD面试真题精选专栏:华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例:文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片(Slicing)操作**基本切片语法</div> </li> <li><a href="/article/1835490974911000576.htm" title="python os 环境变量" target="_blank">python os 环境变量</a> <span class="text-muted">CV矿工</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/numpy/1.htm">numpy</a> <div>环境变量:环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里,比如数据库密码,个人账户密码,如果写进自己本机的环境变量里,程序用的时候通过os.environ.get()取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量:os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类</div> </li> <li><a href="/article/1835490218845761536.htm" title="Python爬虫解析工具之xpath使用详解" target="_blank">Python爬虫解析工具之xpath使用详解</a> <span class="text-muted">eqa11</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E7%88%AC%E8%99%AB/1.htm">爬虫</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中,数据提取是一个至关重要的环节。xpath作为一门</div> </li> <li><a href="/article/1835483915071090688.htm" title="【华为OD技术面试真题 - 技术面】- python八股文真题题库(1)" target="_blank">【华为OD技术面试真题 - 技术面】- python八股文真题题库(1)</a> <span class="text-muted">算法大师</span> <a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BAod/1.htm">华为od</a><a class="tag" taget="_blank" href="/search/%E9%9D%A2%E8%AF%95/1.htm">面试</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>华为OD面试真题精选专栏:华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选1.数据预处理流程数据预处理的主要步骤工具和库2.介绍线性回归、逻辑回归模型线性回归(LinearRegression)模型形式:关键点:逻辑回归(LogisticRegression)模型形式:关键点:参数估计与评估:3.python浅拷贝及深拷贝浅拷贝(Shal</div> </li> <li><a href="/article/1835483159630802944.htm" title="nosql数据库技术与应用知识点" target="_blank">nosql数据库技术与应用知识点</a> <span class="text-muted">皆过客,揽星河</span> <a class="tag" taget="_blank" href="/search/NoSQL/1.htm">NoSQL</a><a class="tag" taget="_blank" href="/search/nosql/1.htm">nosql</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E5%A4%A7%E6%95%B0%E6%8D%AE/1.htm">大数据</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84/1.htm">数据结构</a><a class="tag" taget="_blank" href="/search/%E9%9D%9E%E5%85%B3%E7%B3%BB%E5%9E%8B%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">非关系型数据库</a> <div>Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)</div> </li> <li><a href="/article/1835481269690003456.htm" title="《Python数据分析实战终极指南》" target="_blank">《Python数据分析实战终极指南》</a> <span class="text-muted">xjt921122</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>对于分析师来说,大家在学习Python数据分析的路上,多多少少都遇到过很多大坑**,有关于技能和思维的**:Excel已经没办法处理现有的数据量了,应该学Python吗?找了一大堆Python和Pandas的资料来学习,为什么自己动手就懵了?跟着比赛类公开数据分析案例练了很久,为什么当自己面对数据需求还是只会数据处理而没有分析思路?学了对比、细分、聚类分析,也会用PEST、波特五力这类分析法,为啥</div> </li> <li><a href="/article/1835477362700021760.htm" title="Python中深拷贝与浅拷贝的区别" target="_blank">Python中深拷贝与浅拷贝的区别</a> <span class="text-muted">yuxiaoyu.</span> <div>转自:http://blog.csdn.net/u014745194/article/details/70271868定义:在Python中对象的赋值其实就是对象的引用。当创建一个对象,把它赋值给另一个变量的时候,python并没有拷贝这个对象,只是拷贝了这个对象的引用而已。浅拷贝:拷贝了最外围的对象本身,内部的元素都只是拷贝了一个引用而已。也就是,把对象复制一遍,但是该对象中引用的其他对象我不复</div> </li> <li><a href="/article/1835476983614631936.htm" title="Python开发常用的三方模块如下:" target="_blank">Python开发常用的三方模块如下:</a> <span class="text-muted">换个网名有点难</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>Python是一门功能强大的编程语言,拥有丰富的第三方库,这些库为开发者提供了极大的便利。以下是100个常用的Python库,涵盖了多个领域:1、NumPy,用于科学计算的基础库。2、Pandas,提供数据结构和数据分析工具。3、Matplotlib,一个绘图库。4、Scikit-learn,机器学习库。5、SciPy,用于数学、科学和工程的库。6、TensorFlow,由Google开发的开源机</div> </li> <li><a href="/article/1835473704432267264.htm" title="Python编译器" target="_blank">Python编译器</a> <span class="text-muted">鹿鹿~</span> <a class="tag" taget="_blank" href="/search/Python%E7%BC%96%E8%AF%91%E5%99%A8/1.htm">Python编译器</a><a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/%E5%90%8E%E7%AB%AF/1.htm">后端</a> <div>嘿嘿嘿我又来了啊有些小盆友可能不知道Python其实是有编译器的,也就是PyCharm。你们可能会问到这个是干嘛的又不可以吃也不可以穿好像没有什么用,其实你还说对了这个还真的不可以吃也不可以穿,但是它用来干嘛的呢。用来编译你所打出的代码进行运行(可能这里说的有点不对但是只是个人认为)现在我们来说说PyCharm是用来干嘛的。PyCharm是一种PythonIDE,带有一整套可以帮助用户在使用Pyt</div> </li> <li><a href="/article/1835471437754888192.htm" title="一文掌握python面向对象魔术方法(二)" target="_blank">一文掌握python面向对象魔术方法(二)</a> <span class="text-muted">程序员neil</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>接上篇:一文掌握python面向对象魔术方法(一)-CSDN博客目录六、迭代和序列化:1、__iter__(self):定义迭代器,使得类可以被for循环迭代。2、__getitem__(self,key):定义索引操作,如obj[key]。3、__setitem__(self,key,value):定义赋值操作,如obj[key]=value。4、__delitem__(self,key):定义</div> </li> <li><a href="/article/1835471185589137408.htm" title="一文掌握python常用的list(列表)操作" target="_blank">一文掌握python常用的list(列表)操作</a> <span class="text-muted">程序员neil</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>目录一、创建列表1.直接创建列表:2.使用list()构造器3.使用列表推导式4.创建空列表二、访问列表元素1.列表支持通过索引访问元素,索引从0开始:2.还可以使用切片操作访问列表的一部分:三、修改列表元素四、添加元素1.append():在末尾添加元素2.insert():在指定位置插入元素五、删除元素1.del:删除指定位置的元素2.remove():删除指定值的第一个匹配项3.pop():</div> </li> <li><a href="/article/1835469798838988800.htm" title="Python实现简单的机器学习算法" target="_blank">Python实现简单的机器学习算法</a> <span class="text-muted">master_chenchengg</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%8A%9E%E5%85%AC%E6%95%88%E7%8E%87/1.htm">办公效率</a><a class="tag" taget="_blank" href="/search/python%E5%BC%80%E5%8F%91/1.htm">python开发</a><a class="tag" taget="_blank" href="/search/IT/1.htm">IT</a> <div>Python实现简单的机器学习算法开篇:初探机器学习的奇妙之旅搭建环境:一切从安装开始必备工具箱第一步:安装Anaconda和JupyterNotebook小贴士:如何配置Python环境变量算法初体验:从零开始的Python机器学习线性回归:让数据说话数据准备:从哪里找数据编码实战:Python实现线性回归模型评估:如何判断模型好坏逻辑回归:从分类开始理论入门:什么是逻辑回归代码实现:使用skl</div> </li> <li><a href="/article/1835465134710026240.htm" title="python中的深拷贝与浅拷贝" target="_blank">python中的深拷贝与浅拷贝</a> <span class="text-muted">anshejd70787</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>深拷贝和浅拷贝浅拷贝的时候,修改原来的对象,浅拷贝的对象不会发生改变。1、对象的赋值对象的赋值实际上是对象之间的引用:当创建一个对象,然后将这个对象赋值给另外一个变量的时候,python并没有拷贝这个对象,而只是拷贝了这个对象的引用。当对对象做赋值或者是参数传递或者作为返回值的时候,总是传递原始对象的引用,而不是一个副本。如下所示:>>>aList=["kel","abc",123]>>>bLis</div> </li> <li><a href="/article/1835463874560749568.htm" title="用Python实现简单的猜数字游戏" target="_blank">用Python实现简单的猜数字游戏</a> <span class="text-muted">程序媛了了</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%B8%B8%E6%88%8F/1.htm">游戏</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>猜数字游戏代码:importrandomdefpythonit():a=random.randint(1,100)n=int(input("输入你猜想的数字:"))whilen!=a:ifn>a:print("很遗憾,猜大了")n=int(input("请再次输入你猜想的数字:"))elifna::如果玩家猜的数字n大于随机数字a,则输出"很遗憾,猜大了",并提示玩家再次输入。elifn<a::如</div> </li> <li><a href="/article/1835463875160535040.htm" title="用Python实现读取统计单词个数" target="_blank">用Python实现读取统计单词个数</a> <span class="text-muted">程序媛了了</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%B8%B8%E6%88%8F/1.htm">游戏</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>完整实例代码:fromcollectionsimportCounterdefpythonit():danci={}withopen("pythonit.txt","r",encoding="utf-8")asf:foriinf:words=i.strip().split()forwordinwords:ifwordnotindanci:danci[word]=1else:danci[word]+=</div> </li> <li><a href="/article/48.htm" title="开发者关心的那些事" target="_blank">开发者关心的那些事</a> <span class="text-muted">圣子足道</span> <a class="tag" taget="_blank" href="/search/ios/1.htm">ios</a><a class="tag" taget="_blank" href="/search/%E6%B8%B8%E6%88%8F/1.htm">游戏</a><a class="tag" taget="_blank" href="/search/%E7%BC%96%E7%A8%8B/1.htm">编程</a><a class="tag" taget="_blank" href="/search/apple/1.htm">apple</a><a class="tag" taget="_blank" href="/search/%E6%94%AF%E4%BB%98/1.htm">支付</a> <div>我要在app里添加IAP,必须要注册自己的产品标识符(product identifiers)。产品标识符是什么? 产品标识符(Product Identifiers)是一串字符串,它用来识别你在应用内贩卖的每件商品。App Store用产品标识符来检索产品信息,标识符只能包含大小写字母(A-Z)、数字(0-9)、下划线(-)、以及圆点(.)。你可以任意排列这些元素,但我们建议你创建标识符时使用</div> </li> <li><a href="/article/175.htm" title="负载均衡器技术Nginx和F5的优缺点对比" target="_blank">负载均衡器技术Nginx和F5的优缺点对比</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/nginx/1.htm">nginx</a><a class="tag" taget="_blank" href="/search/F5/1.htm">F5</a> <div>        对于数据流量过大的网络中,往往单一设备无法承担,需要多台设备进行数据分流,而负载均衡器就是用来将数据分流到多台设备的一个转发器。         目前有许多不同的负载均衡技术用以满足不同的应用需求,如软/硬件负载均衡、本地/全局负载均衡、更高</div> </li> <li><a href="/article/302.htm" title="LeetCode[Math] - #9 Palindrome Number" target="_blank">LeetCode[Math] - #9 Palindrome Number</a> <span class="text-muted">Cwind</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/Algorithm/1.htm">Algorithm</a><a class="tag" taget="_blank" href="/search/%E9%A2%98%E8%A7%A3/1.htm">题解</a><a class="tag" taget="_blank" href="/search/LeetCode/1.htm">LeetCode</a><a class="tag" taget="_blank" href="/search/Math/1.htm">Math</a> <div>原题链接:#9 Palindrome Number   要求: 判断一个整数是否是回文数,不要使用额外的存储空间   难度:简单   分析: 题目限制不允许使用额外的存储空间应指不允许使用O(n)的内存空间,O(1)的内存用于存储中间结果是可以接受的。于是考虑将该整型数反转,然后与原数字进行比较。 注:没有看到有关负数是否可以是回文数的明确结论,例如</div> </li> <li><a href="/article/429.htm" title="画图板的基本实现" target="_blank">画图板的基本实现</a> <span class="text-muted">15700786134</span> <a class="tag" taget="_blank" href="/search/%E7%94%BB%E5%9B%BE%E6%9D%BF/1.htm">画图板</a> <div> 要实现画图板的基本功能,除了在qq登陆界面中用到的组件和方法外,还需要添加鼠标监听器,和接口实现。 首先,需要显示一个JFrame界面: public class DrameFrame extends JFrame {              //显示</div> </li> <li><a href="/article/556.htm" title="linux的ps命令" target="_blank">linux的ps命令</a> <span class="text-muted">被触发</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a> <div>Linux中的ps命令是Process Status的缩写。ps命令用来列出系统中当前运行的那些进程。ps命令列出的是当前那些进程的快照,就是执行ps命令的那个时刻的那些进程,如果想要动态的显示进程信息,就可以使用top命令。 要对进程进行监测和控制,首先必须要了解当前进程的情况,也就是需要查看当前进程,而 ps 命令就是最基本同时也是非常强大的进程查看命令。使用该命令可以确定有哪些进程正在运行</div> </li> <li><a href="/article/683.htm" title="Android 音乐播放器 下一曲 连续跳几首歌" target="_blank">Android 音乐播放器 下一曲 连续跳几首歌</a> <span class="text-muted">肆无忌惮_</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>最近在写安卓音乐播放器的时候遇到个问题。在MediaPlayer播放结束时会回调 player.setOnCompletionListener(new OnCompletionListener() { @Override public void onCompletion(MediaPlayer mp) { mp.reset(); Log.i("H</div> </li> <li><a href="/article/810.htm" title="java导出txt文件的例子" target="_blank">java导出txt文件的例子</a> <span class="text-muted">知了ing</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/servlet/1.htm">servlet</a> <div>代码很简单就一个servlet,如下: package com.eastcom.servlet; import java.io.BufferedOutputStream; import java.io.IOException; import java.net.URLEncoder; import java.sql.Connection; import java.sql.Resu</div> </li> <li><a href="/article/937.htm" title="Scala stack试玩, 提高第三方依赖下载速度" target="_blank">Scala stack试玩, 提高第三方依赖下载速度</a> <span class="text-muted">矮蛋蛋</span> <a class="tag" taget="_blank" href="/search/scala/1.htm">scala</a><a class="tag" taget="_blank" href="/search/sbt/1.htm">sbt</a> <div>原文地址: http://segmentfault.com/a/1190000002894524 sbt下载速度实在是惨不忍睹, 需要做些配置优化 下载typesafe离线包, 保存为ivy本地库 wget http://downloads.typesafe.com/typesafe-activator/1.3.4/typesafe-activator-1.3.4.zip 解压r</div> </li> <li><a href="/article/1064.htm" title="phantomjs安装(linux,附带环境变量设置) ,以及casperjs安装。" target="_blank">phantomjs安装(linux,附带环境变量设置) ,以及casperjs安装。</a> <span class="text-muted">alleni123</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/spider/1.htm">spider</a> <div>1. 首先从官网 http://phantomjs.org/下载phantomjs压缩包,解压缩到/root/phantomjs文件夹。 2. 安装依赖 sudo yum install fontconfig freetype libfreetype.so.6 libfontconfig.so.1 libstdc++.so.6 3. 配置环境变量 vi /etc/profil</div> </li> <li><a href="/article/1191.htm" title="JAVA IO FileInputStream和FileOutputStream,字节流的打包输出" target="_blank">JAVA IO FileInputStream和FileOutputStream,字节流的打包输出</a> <span class="text-muted">百合不是茶</span> <a class="tag" taget="_blank" href="/search/java%E6%A0%B8%E5%BF%83%E6%80%9D%E6%83%B3/1.htm">java核心思想</a><a class="tag" taget="_blank" href="/search/JAVA+IO%E6%93%8D%E4%BD%9C/1.htm">JAVA IO操作</a><a class="tag" taget="_blank" href="/search/%E5%AD%97%E8%8A%82%E6%B5%81/1.htm">字节流</a> <div>在程序设计语言中,数据的保存是基本,如果某程序语言不能保存数据那么该语言是不可能存在的,JAVA是当今最流行的面向对象设计语言之一,在保存数据中也有自己独特的一面,字节流和字符流 1,字节流是由字节构成的,字符流是由字符构成的 字节流和字符流都是继承的InputStream和OutPutStream ,java中两种最基本的就是字节流和字符流   类 FileInputStream</div> </li> <li><a href="/article/1318.htm" title="Spring基础实例(依赖注入和控制反转)" target="_blank">Spring基础实例(依赖注入和控制反转)</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/spring/1.htm">spring</a> <div>前提条件:在http://www.springsource.org/download网站上下载Spring框架,并将spring.jar、log4j-1.2.15.jar、commons-logging.jar加载至工程1.武器接口 package com.bijian.spring.base3; public interface Weapon { void kil</div> </li> <li><a href="/article/1445.htm" title="HR看重的十大技能" target="_blank">HR看重的十大技能</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/%E6%8F%90%E5%8D%87/1.htm">提升</a><a class="tag" taget="_blank" href="/search/%E8%83%BD%E5%8A%9B/1.htm">能力</a><a class="tag" taget="_blank" href="/search/HR/1.htm">HR</a><a class="tag" taget="_blank" href="/search/%E6%88%90%E9%95%BF/1.htm">成长</a> <div>    一个人掌握何种技能取决于他的兴趣、能力和聪明程度,也取决于他所能支配的资源以及制定的事业目标,拥有过硬技能的人有更多的工作机会。但是,由于经济发展前景不确定,掌握对你的事业有所帮助的技能显得尤为重要。以下是最受雇主欢迎的十种技能。   一、解决问题的能力   每天,我们都要在生活和工作中解决一些综合性的问题。那些能够发现问题、解决问题并迅速作出有效决</div> </li> <li><a href="/article/1572.htm" title="【Thrift一】Thrift编译安装" target="_blank">【Thrift一】Thrift编译安装</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/thrift/1.htm">thrift</a> <div>什么是Thrift The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and s</div> </li> <li><a href="/article/1699.htm" title="【Avro三】Hadoop MapReduce读写Avro文件" target="_blank">【Avro三】Hadoop MapReduce读写Avro文件</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/mapreduce/1.htm">mapreduce</a> <div>Avro是Doug Cutting(此人绝对是神一般的存在)牵头开发的。 开发之初就是围绕着完善Hadoop生态系统的数据处理而开展的(使用Avro作为Hadoop MapReduce需要处理数据序列化和反序列化的场景),因此Hadoop MapReduce集成Avro也就是自然而然的事情。 这个例子是一个简单的Hadoop MapReduce读取Avro格式的源文件进行计数统计,然后将计算结果</div> </li> <li><a href="/article/1826.htm" title="nginx定制500,502,503,504页面" target="_blank">nginx定制500,502,503,504页面</a> <span class="text-muted">ronin47</span> <a class="tag" taget="_blank" href="/search/nginx%E3%80%80%E9%94%99%E8%AF%AF%E6%98%BE%E7%A4%BA/1.htm">nginx 错误显示</a> <div>server { listen 80; error_page 500/500.html; error_page 502/502.html; error_page 503/503.html; error_page 504/504.html; location /test {return502;}} 配置很简单,和配</div> </li> <li><a href="/article/1953.htm" title="java-1.二叉查找树转为双向链表" target="_blank">java-1.二叉查找树转为双向链表</a> <span class="text-muted">bylijinnan</span> <a class="tag" taget="_blank" href="/search/%E4%BA%8C%E5%8F%89%E6%9F%A5%E6%89%BE%E6%A0%91/1.htm">二叉查找树</a> <div> import java.util.ArrayList; import java.util.List; public class BSTreeToLinkedList { /* 把二元查找树转变成排序的双向链表 题目: 输入一棵二元查找树,将该二元查找树转换成一个排序的双向链表。 要求不能创建任何新的结点,只调整指针的指向。 10 / \ 6 14 / \ </div> </li> <li><a href="/article/2080.htm" title="Netty源码学习-HTTP-tunnel" target="_blank">Netty源码学习-HTTP-tunnel</a> <span class="text-muted">bylijinnan</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/netty/1.htm">netty</a> <div>Netty关于HTTP tunnel的说明: http://docs.jboss.org/netty/3.2/api/org/jboss/netty/channel/socket/http/package-summary.html#package_description 这个说明有点太简略了 一个完整的例子在这里: https://github.com/bylijinnan</div> </li> <li><a href="/article/2207.htm" title="JSONUtil.serialize(map)和JSON.toJSONString(map)的区别" target="_blank">JSONUtil.serialize(map)和JSON.toJSONString(map)的区别</a> <span class="text-muted">coder_xpf</span> <a class="tag" taget="_blank" href="/search/jquery/1.htm">jquery</a><a class="tag" taget="_blank" href="/search/json/1.htm">json</a><a class="tag" taget="_blank" href="/search/map/1.htm">map</a><a class="tag" taget="_blank" href="/search/val%28%29/1.htm">val()</a> <div> JSONUtil.serialize(map)和JSON.toJSONString(map)的区别   数据库查询出来的map有一个字段为空   通过System.out.println()输出 JSONUtil.serialize(map): {"one":"1","two":"nul</div> </li> <li><a href="/article/2334.htm" title="Hibernate缓存总结" target="_blank">Hibernate缓存总结</a> <span class="text-muted">cuishikuan</span> <a class="tag" taget="_blank" href="/search/%E5%BC%80%E6%BA%90/1.htm">开源</a><a class="tag" taget="_blank" href="/search/ssh/1.htm">ssh</a><a class="tag" taget="_blank" href="/search/javaweb/1.htm">javaweb</a><a class="tag" taget="_blank" href="/search/hibernate%E7%BC%93%E5%AD%98/1.htm">hibernate缓存</a><a class="tag" taget="_blank" href="/search/%E4%B8%89%E5%A4%A7%E6%A1%86%E6%9E%B6/1.htm">三大框架</a> <div>一、为什么要用Hibernate缓存? Hibernate是一个持久层框架,经常访问物理数据库。 为了降低应用程序对物理数据源访问的频次,从而提高应用程序的运行性能。 缓存内的数据是对物理数据源中的数据的复制,应用程序在运行时从缓存读写数据,在特定的时刻或事件会同步缓存和物理数据源的数据。   二、Hibernate缓存原理是怎样的? Hibernate缓存包括两大类:Hib</div> </li> <li><a href="/article/2461.htm" title="CentOs6" target="_blank">CentOs6</a> <span class="text-muted">dalan_123</span> <a class="tag" taget="_blank" href="/search/centos/1.htm">centos</a> <div>首先su - 切换到root下面1、首先要先安装GCC GCC-C++ Openssl等以来模块:yum -y install make gcc gcc-c++ kernel-devel m4 ncurses-devel openssl-devel2、再安装ncurses模块yum -y install ncurses-develyum install ncurses-devel3、下载Erang</div> </li> <li><a href="/article/2588.htm" title="10款用 jquery 实现滚动条至页面底端自动加载数据效果" target="_blank">10款用 jquery 实现滚动条至页面底端自动加载数据效果</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a> <div>  无限滚动自动翻页可以说是web2.0时代的一项堪称伟大的技术,它让我们在浏览页面的时候只需要把滚动条拉到网页底部就能自动显示下一页的结果,改变了一直以来只能通过点击下一页来翻页这种常规做法。 无限滚动自动翻页技术的鼻祖是微博的先驱:推特(twitter),后来必应图片搜索、谷歌图片搜索、google reader、箱包批发网等纷纷抄袭了这一项技术,于是靠滚动浏览器滚动条</div> </li> <li><a href="/article/2715.htm" title="ImageButton去边框&Button或者ImageButton的背景透明" target="_blank">ImageButton去边框&Button或者ImageButton的背景透明</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/imagebutton/1.htm">imagebutton</a> <div>在ImageButton中载入图片后,很多人会觉得有图片周围的白边会影响到美观,其实解决这个问题有两种方法 一种方法是将ImageButton的背景改为所需要的图片。如:android:background="@drawable/XXX" 第二种方法就是将ImageButton背景改为透明,这个方法更常用 在XML里;    <ImageBut</div> </li> <li><a href="/article/2842.htm" title="JSP之c:foreach" target="_blank">JSP之c:foreach</a> <span class="text-muted">eksliang</span> <a class="tag" taget="_blank" href="/search/jsp/1.htm">jsp</a><a class="tag" taget="_blank" href="/search/forearch/1.htm">forearch</a> <div>原文出自:http://www.cnblogs.com/draem0507/archive/2012/09/24/2699745.html <c:forEach>标签用于通用数据循环,它有以下属性 属 性 描 述 是否必须 缺省值 items 进行循环的项目 否 无 begin 开始条件 否 0 end 结束条件 否 集合中的最后一个项目 step 步长 否 1</div> </li> <li><a href="/article/2969.htm" title="Android实现主动连接蓝牙耳机" target="_blank">Android实现主动连接蓝牙耳机</a> <span class="text-muted">gqdy365</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>在Android程序中可以实现自动扫描蓝牙、配对蓝牙、建立数据通道。蓝牙分不同类型,这篇文字只讨论如何与蓝牙耳机连接。 大致可以分三步: 一、扫描蓝牙设备: 1、注册并监听广播: BluetoothAdapter.ACTION_DISCOVERY_STARTED BluetoothDevice.ACTION_FOUND BluetoothAdapter.ACTION_DIS</div> </li> <li><a href="/article/3096.htm" title="android学习轨迹之四:org.json.JSONException: No value for" target="_blank">android学习轨迹之四:org.json.JSONException: No value for</a> <span class="text-muted">hyz301</span> <a class="tag" taget="_blank" href="/search/json/1.htm">json</a> <div>org.json.JSONException: No value for items  在JSON解析中会遇到一种错误,很常见的错误   06-21 12:19:08.714 2098-2127/com.jikexueyuan.secret I/System.out﹕ Result:{"status":1,"page":1,&</div> </li> <li><a href="/article/3223.htm" title="干货分享:从零开始学编程 系列汇总" target="_blank">干货分享:从零开始学编程 系列汇总</a> <span class="text-muted">justjavac</span> <a class="tag" taget="_blank" href="/search/%E7%BC%96%E7%A8%8B/1.htm">编程</a> <div>程序员总爱重新发明轮子,于是做了要给轮子汇总。 从零开始写个编译器吧系列 (知乎专栏) 从零开始写一个简单的操作系统 (伯乐在线) 从零开始写JavaScript框架 (图灵社区) 从零开始写jQuery框架 (蓝色理想 ) 从零开始nodejs系列文章 (粉丝日志) 从零开始编写网络游戏 </div> </li> <li><a href="/article/3350.htm" title="jquery-autocomplete 使用手册" target="_blank">jquery-autocomplete 使用手册</a> <span class="text-muted">macroli</span> <a class="tag" taget="_blank" href="/search/jquery/1.htm">jquery</a><a class="tag" taget="_blank" href="/search/Ajax/1.htm">Ajax</a><a class="tag" taget="_blank" href="/search/%E8%84%9A%E6%9C%AC/1.htm">脚本</a> <div>jquery-autocomplete学习 一、用前必备 官方网站:http://bassistance.de/jquery-plugins/jquery-plugin-autocomplete/ 当前版本:1.1 需要JQuery版本:1.2.6 二、使用 <script src="./jquery-1.3.2.js" type="text/ja</div> </li> <li><a href="/article/3477.htm" title="PLSQL-Developer或者Navicat等工具连接远程oracle数据库的详细配置以及数据库编码的修改" target="_blank">PLSQL-Developer或者Navicat等工具连接远程oracle数据库的详细配置以及数据库编码的修改</a> <span class="text-muted">超声波</span> <a class="tag" taget="_blank" href="/search/oracle/1.htm">oracle</a><a class="tag" taget="_blank" href="/search/plsql/1.htm">plsql</a> <div>  在服务器上将Oracle安装好之后接下来要做的就是通过本地机器来远程连接服务器端的oracle数据库,常用的客户端连接工具就是PLSQL-Developer或者Navicat这些工具了。刚开始也是各种报错,什么TNS:no listener;TNS:lost connection;TNS:target hosts...花了一天的时间终于让PLSQL-Developer和Navicat等这些客户</div> </li> <li><a href="/article/3604.htm" title="数据仓库数据模型之:极限存储--历史拉链表" target="_blank">数据仓库数据模型之:极限存储--历史拉链表</a> <span class="text-muted">superlxw1234</span> <a class="tag" taget="_blank" href="/search/%E6%9E%81%E9%99%90%E5%AD%98%E5%82%A8/1.htm">极限存储</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E4%BB%93%E5%BA%93/1.htm">数据仓库</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%A8%A1%E5%9E%8B/1.htm">数据模型</a><a class="tag" taget="_blank" href="/search/%E6%8B%89%E9%93%BE%E5%8E%86%E5%8F%B2%E8%A1%A8/1.htm">拉链历史表</a> <div>在数据仓库的数据模型设计过程中,经常会遇到这样的需求: 1. 数据量比较大; 2. 表中的部分字段会被update,如用户的地址,产品的描述信息,订单的状态等等; 3. 需要查看某一个时间点或者时间段的历史快照信息,比如,查看某一个订单在历史某一个时间点的状态,    比如,查看某一个用户在过去某一段时间内,更新过几次等等; 4. 变化的比例和频率不是很大,比如,总共有10</div> </li> <li><a href="/article/3731.htm" title="10点睛Spring MVC4.1-全局异常处理" target="_blank">10点睛Spring MVC4.1-全局异常处理</a> <span class="text-muted">wiselyman</span> <a class="tag" taget="_blank" href="/search/spring+mvc/1.htm">spring mvc</a> <div>10.1 全局异常处理 使用@ControllerAdvice注解来实现全局异常处理; 使用@ControllerAdvice的属性缩小处理范围 10.2 演示 演示控制器 package com.wisely.web; import org.springframework.stereotype.Controller; import org.spring</div> </li> </ul> </div> </div> </div> <div> <div class="container"> <div class="indexes"> <strong>按字母分类:</strong> <a href="/tags/A/1.htm" target="_blank">A</a><a href="/tags/B/1.htm" target="_blank">B</a><a href="/tags/C/1.htm" target="_blank">C</a><a href="/tags/D/1.htm" target="_blank">D</a><a href="/tags/E/1.htm" target="_blank">E</a><a href="/tags/F/1.htm" target="_blank">F</a><a href="/tags/G/1.htm" target="_blank">G</a><a href="/tags/H/1.htm" target="_blank">H</a><a href="/tags/I/1.htm" target="_blank">I</a><a href="/tags/J/1.htm" target="_blank">J</a><a href="/tags/K/1.htm" target="_blank">K</a><a href="/tags/L/1.htm" target="_blank">L</a><a href="/tags/M/1.htm" target="_blank">M</a><a href="/tags/N/1.htm" target="_blank">N</a><a href="/tags/O/1.htm" target="_blank">O</a><a href="/tags/P/1.htm" target="_blank">P</a><a href="/tags/Q/1.htm" target="_blank">Q</a><a href="/tags/R/1.htm" target="_blank">R</a><a href="/tags/S/1.htm" target="_blank">S</a><a href="/tags/T/1.htm" target="_blank">T</a><a href="/tags/U/1.htm" target="_blank">U</a><a href="/tags/V/1.htm" target="_blank">V</a><a href="/tags/W/1.htm" target="_blank">W</a><a href="/tags/X/1.htm" target="_blank">X</a><a href="/tags/Y/1.htm" target="_blank">Y</a><a href="/tags/Z/1.htm" target="_blank">Z</a><a href="/tags/0/1.htm" target="_blank">其他</a> </div> </div> </div> <footer id="footer" class="mb30 mt30"> <div class="container"> <div class="footBglm"> <a target="_blank" href="/">首页</a> - <a target="_blank" href="/custom/about.htm">关于我们</a> - <a target="_blank" href="/search/Java/1.htm">站内搜索</a> - <a target="_blank" href="/sitemap.txt">Sitemap</a> - <a target="_blank" href="/custom/delete.htm">侵权投诉</a> </div> <div class="copyright">版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved. <!-- <a href="https://beian.miit.gov.cn/" rel="nofollow" target="_blank">京ICP备09083238号</a><br>--> </div> </div> </footer> <!-- 代码高亮 --> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shCore.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shLegacy.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shAutoloader.js"></script> <link type="text/css" rel="stylesheet" href="/static/syntaxhighlighter/styles/shCoreDefault.css"/> <script type="text/javascript" src="/static/syntaxhighlighter/src/my_start_1.js"></script> </body> </html>