越奋斗，越幸运

18.Python爬虫之Scrapy框架

scrapy 框架

01. Scrapy 链接
02. Scrapy 的爬虫流程
03. Scrapy入门
04. setting.py文件中的常用设置

4.1. logging模块的使用
4.2. ==scrapy项目中的setting.py常用配置内容（待续）==

05. scrapy框架糗事百科爬虫案例
06. scrapy.Request知识点
07. 思考 parse()方法的工作机制
08. CrawlSpider爬虫

微信小程序crawlspider爬虫

09. Scrapy 发送post请求案例(人人网登录案例)
10. scrapy框架豆瓣网登录案例(验证码识别技术)（待爬）
11. scrapy 下载图片和文件方法（汽车之家宝马五系高清图片下载）
12. crawl spider 下载图片和文件方法（汽车之家宝马五系高清图片下载）
13. 下载器中间件-设置随机请求头
14. [ip代理中间件(快代理)](https://pan.baidu.com/s/1U6KnIFOYhS9NT7iXd4t84g)
15. Scrapy Shell
16. 攻克Boss直聘反爬虫（待调整）
17. 动态网页的数据爬取

17.1.安装Selenium
17.2. 安装chromedriver
17.3 第一个小案例
17.4. 定位元素
17.5. selenium 操作表单元素
17.6. 行为链
17.7. cookie的操作
17.8. 页面等待
17.9. 切换页面
17.10. selenium 使用代理
WebElement元素

18. Selenium 拉勾网爬虫
19. Scrapy+Selenium爬取简书网整站，并且存入到mysql当中
20. selenium设置代理和UserAgent
21. [http://httpbin.org 测试接口解析](https://blog.csdn.net/chang995196962/article/details/91362364)

01. Scrapy 链接

Scrapy中文维护站点
Scrapy框架官方网址

02. Scrapy 的爬虫流程

Scrapy Engine（引擎）
- 总指挥：负责数据和信号的在不同模块之间的传递（Scrapy已经实现）
Scheduler（调度器）
- 一个队列，存放引擎发过来的request请求（Scrapy已经实现）
Downloader（下载器）
- 下载把引擎发过来的requests请求，并发回给引擎（Scrapy已经实现）
Spider（爬虫）
- 处理引擎发来的response，提取数据，提取url，并交给引擎（需要手写）
Item Pipeline（管道）
- 处理引擎传过来的数据，比如存储（需要手写）
Downloader Middlewares(下载中间件)
- 可以自定义的下载扩展，比如设置代理，请求头，cookie等信息
Spider Middlewares（中间件）
- 可以自定义requests请求和进行response过滤

03. Scrapy入门

安装： conda install scrapy
创建一个scrapy项目
scrapy startproject mySpider
生成一个爬虫
scrapy genspider xiaofan "xiaofan.com"（scrapy genspider 爬虫的名字允许爬取的范围）
提取数据
完善spider，使用xpath等方法
保存数据
pipeline中保存数据
运行爬虫（命令行形式）
scrapy crawl 爬虫的名字
通过脚本运行爬虫
- 在项目根目录新建脚本start.py，运行start.py文件即可
```
from scrapy import cmdline

cmdline.execute('scrapy crawl qsbk_spider'.split())
```

python爬虫scrapy之如何同时执行多个scrapy爬行任务

from scrapy import cmdline

cmdline.execute('scrapy crawlall'.split())

scrapy保存信息的最简单的方法主要有四种，-o 输出指定格式的文件，，命令如下：

# json格式，默认为Unicode编码
scrapy crawl itcast -o teachers.json

# json lines格式，默认为Unicode编码
scrapy crawl itcast -o teachers.jsonlines

# csv 逗号表达式，可用Excel打开
scrapy crawl itcast -o teachers.csv

# xml格式
scrapy crawl itcast -o teachers.xml

项目结构截图及主要文件的作用

04. setting.py文件中的常用设置

4.1. logging模块的使用

scrapy项目
- settings中设置LOG_LEVEL=“WARNING”
- settings中设置LOG_FILE="./a.log" # 设置日志保存的位置，设置后终端不会显示日志内容
- import logging. 实例化logger的方式在任何文件中使用logger输入内容
普通项目中
- import logging
- logging.basicConfig(…) # 设置日志输出的样式，格式
- 实例化一个logger=logging.getLogger(name)
- 在任何py文件中调用logger即可

4.2. scrapy项目中的setting.py常用配置内容（待续）

# 1.导包
import logging
import datetime
import os


# 2.项目名称 TODO 需要修改
BOT_NAME = 'position_project'

# 3.模块名称
SPIDER_MODULES = ['{}.spiders'.format(BOT_NAME)]
NEWSPIDER_MODULE = '{}.spiders'.format(BOT_NAME)

# 4.遵守机器人协议（默认为True）
ROBOTSTXT_OBEY = False

# 5.用户代理（使用的浏览器类型）
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 ' \
             'Safari/537.36 '

# 6.默认请求头信息（USER_AGENT 单独配置）
DEFAULT_REQUEST_HEADERS = {
    "authority": "www.zhipin.com",
    "method": "GET",
    "path": "/c101010100/?query=python&page=1",
    "scheme": "https",
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"zh-CN,zh;q=0.9",
    "cache-control":"max-age=0",
    "sec-fetch-mode":"navigate",
    "sec-fetch-site":"none",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "cookie":"_uab_collina=155192752626463196786582; lastCity=101010100; _bl_uid=nCk6U2X3qyL0knn41r97gqj6tbaI; __c=1577356639; __g=-; __l=l=%2Fwww.zhipin.com%2Fweb%2Fcommon%2Fsecurity-check.html%3Fseed%3D4xwicvOb7q2EkZGCt80nTLZ0vDg%252BzlibDrgh%252F8ybn%252BU%253D%26name%3D89ea5a4b%26ts%3D1577356638307%26callbackUrl%3D%252Fc101010100%252F%253Fquery%253Dpython%2526page%253D1%26srcReferer%3D&r=&friend_source=0&friend_source=0; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1577356640; toUrl=https%3A%2F%2Fwww.zhipin.com%2Fc101010100%2F%3Fquery%3Dpython%26page%3D1%26ka%3Dpage-1; __a=29781409.1551927520.1573210066.1577356639.145.7.53.84; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1577413477; __zp_stoken__=7afdOJ%2Bdzh7nyTlE0EwBT40ChjblHK0zWyGrgNKjNseeImeToJrFVjotrvwrJmc4SAz4ALJJLFiwM6VXR8%2FhRZvbdbnbdscb5I9tbPbE0vSsxADMIDYNDK7qJTzOfZJNR7%2BP",
    "referer":"https://www.zhipin.com/c101010100/?query=python&page=1",

}


# 7.格式化日志输出的格式，日志文件每分钟生成一个文件
time_str = datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d %H-%M')
LOG_FILE = '{}\\{}\\logs\\{}.log'.format(os.getcwd(), BOT_NAME, time_str)
LOG_LEVEL = 'DEBUG'

# 8.设置运行多个爬虫的自定义命令
COMMANDS_MODULE = '{}.commands'.format(BOT_NAME)

# 9.scrapy输出的json文件中显示中文（https://www.cnblogs.com/linkr/p/7995454.html）
FEED_EXPORT_ENCODING = 'utf-8'

# 10.管道pipeline配置，后面的值越小，越先经过这根管道 TODO 需要修改
ITEM_PIPELINES = {
   '{}.pipelines.PositionProjectPipeline'.format(BOT_NAME): 300,
}

# 11.限制爬虫的爬取速度， 单位为秒
DOWNLOAD_DELAY = 1

# 12. 下载中间件 TODO 需要修改
DOWNLOADER_MIDDLEWARES = {
   '{}.middlewares.RandomUserAgent'.format(BOT_NAME): 1,
}

# 13. 禁用cookie
COOKIES_ENABLED = False

05. scrapy框架糗事百科爬虫案例

qsbk_spider.py

# -*- coding: utf-8 -*-
import scrapy

from qsbk.items import QsbkItem


class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    base_domain = "https://www.qiushibaike.com"

    def parse(self, response):
        duanzidivs = response.xpath("//div[@id='content-left']/div")
        for duanzidiv in duanzidivs:
            author = duanzidiv.xpath(".//h2/text()").extract_first().strip()
            content = duanzidiv.xpath(".//div[@class='content']//text()").extract()
            item = QsbkItem(author=author, content=content)
            yield item
        # 爬取下一页
        next_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain + next_url, callback=self.parse)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QsbkItem(scrapy.Item):
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py低级方式

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class QsbkPipeline(object):

    def __init__(self):
        self.fp = open("duanzi.json", "w", encoding="utf-8")

    def open_spider(self, spider):
        print("爬虫开始了...")

    def process_item(self, item, spider):
        item_json = json.dumps(dict(item), indent=4, ensure_ascii=False)
        self.fp.write(item_json+"\n")
        return item

    def close_spider(self, spider):
        self.fp.close()
        print("爬虫结束了...")

pipelines.py高级方式一（比较耗内存）

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
from scrapy.exporters import JsonItemExporter


class QsbkPipeline(object):

    def __init__(self):
        self.fp = open("duanzi.json", "wb")
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding="utf-8", indent=4)
        self.exporter.start_exporting()

    def open_spider(self, spider):
        print("爬虫开始了...")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("爬虫结束了...")

pipelines.py高级方式二

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter


class QsbkPipeline(object):

    def __init__(self):
        self.fp = open("duanzi.json", "wb")
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8", indent=4)

    def open_spider(self, spider):
        print("爬虫开始了...")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print("爬虫结束了...")

导出为csv文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter, CsvItemExporter


class QsbkPipeline(object):
    def __init__(self):
        self.fp = open("qsbk.csv", "wb")
        self.exporter = CsvItemExporter(self.fp,  encoding='utf-8')

    def open_spider(self, spider):
        print('爬虫开始了...')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        print('爬虫结束了...')
        self.fp.close()

06. scrapy.Request知识点

Request 部分源码：

# 部分代码
class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None, 
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None):

        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        assert isinstance(priority, int), "Request priority not an integer: %r" % priority
        self.priority = priority

        assert callback or not errback, "Cannot use errback without a callback"
        self.callback = callback
        self.errback = errback

        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter

        self._meta = dict(meta) if meta else None

    @property
    def meta(self):
        if self._meta is None:
            self._meta = {}
        return self._meta

其中，比较常用的参数：

url: 就是需要请求，并进行下一步处理的url

callback: 指定该请求返回的Response，由那个函数来处理。

method: 请求一般不需要指定，默认GET方法，可设置为"GET", "POST", "PUT"等，且保证字符串大写

headers: 请求时，包含的头文件。一般不需要。内容一般如下：
        # 自己写过爬虫的肯定知道
        Host: media.readthedocs.org
        User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
        Accept: text/css,*/*;q=0.1
        Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
        Accept-Encoding: gzip, deflate
        Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/
        Cookie: _ga=GA1.2.1612165614.1415584110;
        Connection: keep-alive
        If-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMT
        Cache-Control: max-age=0

meta: 比较常用，在不同的请求之间传递数据使用的。字典dict型

        request_with_cookies = Request(
            url="http://www.example.com",
            cookies={'currency': 'USD', 'country': 'UY'},
            meta={'dont_merge_cookies': True}
        )

encoding: 使用默认的 'utf-8' 就行。

dont_filter: 表明该请求不由调度器过滤。这是当你想使用多次执行相同的请求,忽略重复的过滤器。默认为False。

errback: 指定错误处理函数

07. 思考 parse()方法的工作机制

因为使用的yield，而不是return。parse函数将会被当做一个生成器使用。scrapy会逐一获取parse方法中生成的结果，并判断该结果是一个什么样的类型；
如果是request则加入爬取队列，如果是item类型则使用pipeline处理，其他类型则返回错误信息。
scrapy取到第一部分的request不会立马就去发送这个request，只是把这个request放到队列里，然后接着从生成器里获取；
取尽第一部分的request，然后再获取第二部分的item，取到item了，就会放到对应的pipeline里处理；
parse()方法作为回调函数(callback)赋值给了Request，指定parse()方法来处理这些请求 scrapy.Request(url, callback=self.parse)
Request对象经过调度，执行生成 scrapy.http.response()的响应对象，并送回给parse()方法，直到调度器中没有Request（递归的思路）
取尽之后，parse()工作结束，引擎再根据队列和pipelines中的内容去执行相应的操作；
程序在取得各个页面的items前，会先处理完之前所有的request队列里的请求，然后再提取items。
这一切的一切，Scrapy引擎和调度器将负责到底。

08. CrawlSpider爬虫

创建命令：scrapy genspider -t crawl 爬虫的名字爬虫的域名

微信小程序crawlspider爬虫

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from wxapp.items import WxappItem


class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=2']

    rules = (
        # 指定规则，爬取列表上的详情链接，并不需要解析
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=1'), follow=True),
        # 指定爬取详情页面的规则，不需要递归找，防止重复
        Rule(LinkExtractor(allow=r'.+article-.+\.html'), callback="parse_detail", follow=False)
    )

    def parse_detail(self, response):
        title = response.xpath("//div[@class='cl']/h1/text()").get()
        item = WxappItem(title=title)
        return item

·注意：千万记住 callback 千万不能写 parse，再次强调：由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

09. Scrapy 发送post请求案例(人人网登录案例)

可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
如果希望程序执行一开始就发送POST请求，可以重写Spider类的start_requests(self) 方法，并且不再调用start_urls里的url。

# -*- coding: utf-8 -*-
import scrapy


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://renren.com/']

    def start_requests(self):
        """
        重写了start_requests方法，模拟人人网的登录
        """
        url = "http://www.renren.com/PLogin.do"
        data = {"email": "[email protected]", "password": "fanjianhaiabc123"}
        # post请求得用FormRqeust，模拟登录
        request = scrapy.FormRequest(url, formdata=data, callback=self.parse_page)
        yield request

    def parse_page(self, response):
        """
        登录成功之后，访问个人主页面
        """
        # get请求， 获取个人主页信息
        request = scrapy.Request(url="http://www.renren.com/446858319/profile", callback=self.parse_profile)
        yield request

    def parse_profile(self, response):
        with open("profile.html", "w", encoding="utf-8") as fp:
            fp.write(response.text)

10. scrapy框架豆瓣网登录案例(验证码识别技术)（待爬）

11. scrapy 下载图片和文件方法（汽车之家宝马五系高清图片下载）

方式一，传统的下载方式
bmw5_spider.py

# -*- coding: utf-8 -*-
import scrapy
from bmw5.items import Bmw5Item


class Bmw5SpiderSpider(scrapy.Spider):
    name = 'bmw5_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    def parse(self, response):
        uiboxs = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxs:
            category = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            print(category)
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = Bmw5Item(category=category, urls=urls)
            yield item

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
from urllib import request


class Bmw5Pipeline(object):

    def __init__(self):
        self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
        if not os.path.exists(self.path):
            os.mkdir(self.path)

    def process_item(self, item, spider):
        category = item['category']
        urls = item['urls']

        category_path = os.path.join(self.path, category)
        if not os.path.exists(category_path):
            os.mkdir(category_path)
        for url in urls:
            image_name = url.split('_')[-1]
            request.urlretrieve(url, os.path.join(category_path, image_name))
        return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Bmw5Item(scrapy.Item):
    category = scrapy.Field()
    urls = scrapy.Field()

方式2
下载图片的Images Pipeline
1. 定义好一个Item，然后再这个Item中定义两个属性，分别为image_urls以及images，image_urls是用来存储需要下载的图片的url链接，需要给一个列表
1. 当文件下载完成后，会吧文件下载的相关信息存储到item的images属性中，比如下载路径、下载的url和图片的校验码等。
1. 在配置文件settings.py中配置IMAGES_STORE, 这个配置属性是用来设置图片下载下来的路径。
1. 启动pipeline, 在ITEM_PIPELIES中设置scrapy.pipelines.images.ImagesPipeline:1

下载文件的Files Pipeline

1. 定义好一个Item，然后再这个Item中定义两个属性，分别为file_urls以及files，file_urls是用来存储需要下载的图片的url链接，需要给一个列表
1. 当文件下载完成后，会吧文件下载的相关信息存储到item的files属性中，比如下载路径、下载的url和图片的校验码等。
1. 在配置文件settings.py中配置FILES_STORE, 这个配置属性是用来设置图片下载下来的路径。
1. 启动pipeline, 在ITEM_PIPELIES中设置scrapy.pipelines.files.FilesPipeline:1

自定义图片下载 Images Pipeline

bmw5_spider.py

# -*- coding: utf-8 -*-
import scrapy
from bmw5.items import Bmw5Item


class Bmw5SpiderSpider(scrapy.Spider):
    name = 'bmw5_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    def parse(self, response):
        uiboxs = response.xpath("//div[@class='uibox']")[1:]
        for uibox in uiboxs:
            category = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
            print(category)
            urls = uibox.xpath(".//ul/li/a/img/@src").getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = Bmw5Item(category=category, image_urls=urls)
            yield item

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os

from scrapy.pipelines.images import ImagesPipeline

from bmw5.settings import IMAGES_STORE


class BMWImagesPipeline(ImagesPipeline):
    """
    自定义图片下载器
    """
    def get_media_requests(self, item, info):
        # 这个方法是在发送下载请求之前调用
        # 其实这个方法本身就是去发送下载请求的
        request_objs = super(BMWImagesPipeline, self).get_media_requests(item, info)

        for request_obj in request_objs:
            request_obj.item = item

        return request_objs

    def file_path(self, request, response=None, info=None):
        # 这个方法是在图片将要存储的时候调用， 来获取这个图片的存储路径
        path = super(BMWImagesPipeline, self).file_path(request, response, info)
        # 获取category
        category = request.item['category']
        image_store = IMAGES_STORE
        category_path = os.path.join(image_store, category)
        if not os.path.exists(category_path):
            os.mkdir(category_path)

        image_name = path.replace("full/", "")
        image_path = os.path.join(category_path, image_name)

        return image_path

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Bmw5Item(scrapy.Item):
    category = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py

ITEM_PIPELINES = {
    # 'bmw5.pipelines.Bmw5Pipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'bmw5.pipelines.BMWImagesPipeline': 1,
}

12. crawl spider 下载图片和文件方法（汽车之家宝马五系高清图片下载）

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
from bmw5.items import Bmw5Item


class Bmw5SpiderSpider(CrawlSpider):
    name = 'bmw5_spider'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    rules = {
        Rule(LinkExtractor(allow=r"https://car.autohome.com.cn/pic/series/65.+"), callback="parse_page", follow=True),
    }

    def parse_page(self, response):
        category = response.xpath("//div[@class='uibox']/div/text()").get()
        srcs = response.xpath("//div[contains(@class,'uibox-con')]/ul/li//img/@src").getall()
        srcs = list(map(lambda url: response.urljoin(url.replace("240x180_0_q95_c42", "1024x0_1_q95")), srcs))
        item = Bmw5Item(category=category, image_urls=srcs)
        yield item

13. 下载器中间件-设置随机请求头

设置随机请求头（谷歌，火狐，Safari）
User-Agent 字符串连接

httpbin.py

# -*- coding: utf-8 -*-
import scrapy
import json


class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/user-agent']

    def parse(self, response):
        useragent = json.loads(response.text)['user-agent']
        print('=' * 30)
        print(useragent)
        print('=' * 30)
        yield scrapy.Request(self.start_urls[0], dont_filter=True)

middlewares.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

import random


class UserAgentDownloadMiddleware(object):
    USER_AGENTS = ['Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
                   'Mozilla/5.0 (compatible; ABrowse 0.4; Syllable)',
                   'Mozilla/4.0 (compatible; MSIE 7.0; America Online Browser 1.1; Windows NT 5.1; (R1 1.5); .NET CLR 2.0.50727; InfoPath.1)']

    def process_request(self, request, spider):
        """
        这个方法是下载器在发送请求之前会执行的。 一般可以在这个里面设置随机代理IP，请求头等信息
        request： 发送请求的request对象
        spider：发送请求的spider对象
        返回值：
            1. 如果返回None，Scrapy将继续处理改request，执行其他中间件
            2. 返回response对象：Scrapy将不会调用其他的process_request方法， 将直接返回这个response对象。
                已经激活的中间件process_response()方法则会在每个response对象返回时被调用
            3. 返回request对象，不再使用之前的request对象下载数据，使用返回的这个
            4. 如果这个方法中出现了异常，则会调用process_exception方法
        """
        useragent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = useragent

middlewares.py改进版
- 注意:USER_AGENT_LIST抽出来了（参考下面设置随机请求头）

import random
from position_project.conf.user_agent import USER_AGENT_LIST


class RandomUserAgent(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)

14. ip代理中间件(快代理)

开放代理


class IPProxyDownloadMiddleware(object):
    """
    开放代理(不是免费代理哦)
    """
    PROXIES = ["178.44.170.152:8080", "110.44.113.182:8000"]
    def process_request(self, request, spider):
        proxy = random.choice(self.PROXIES)
        request.meta['proxy'] = proxy

独享代理

import base64
class IPPxoxyDownloadMiddleware(object):
    """
    独享代理
    """
    def process_request(self,request, spider):
        proxy = '121.199.6.124:16816'
        user_password = '970138074:rcdj35ur'
        request.meta['proxy'] = proxy
        # bytes
        b64_user_password = base64.b64encode(user_password.encode("utf-8"))
        request.headers["Proxy-Authorization"] = 'Basic ' + b64_user_password.decode("utf-8")

15. Scrapy Shell

启动命令

scrapy shell "http://www.itcast.cn/channel/teacher.shtml"

通过response 写xpath进行调试

16. 攻克Boss直聘反爬虫（待调整）

spiders.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from boss.items import BossItem


class ZhipingSpider(CrawlSpider):
    name = 'zhipin'
    allowed_domains = ['zhipin.com']
    start_urls = ['http://www.zhipin.com/c101010100/?query=python&page=1']

    rules = (
        # 匹配列表页规则https://www.zhipin.com/c101010100/?query=python&page=1
        Rule(LinkExtractor(allow=r'.+\?query=python&page=\d+'), follow=True),
        # 匹配详情页规则
        Rule(LinkExtractor(allow=r'.+job_detail/.+\.html'), callback="parse_job", follow=False),

    )

    def parse_job(self, response):
        print("*" * 100)
        name = response.xpath("//div[@class='name']/h1/text()").get()
        salary = response.xpath("//div[@class='name']/span[@class='salary']/text()").get()
        job_info = response.xpath("//div[@class='job-sec']//text()").getall()
        job_info = list(map(lambda x: x.strip(), job_info))
        job_info = "".join(job_info)
        job_info = job_info.strip()
        print(job_info)
        item = BossItem(name=name, salary=salary, job_info=job_info)
        yield item

settings.py

DEFAULT_REQUEST_HEADERS = {
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"zh-CN,zh;q=0.9",
    "cache-control":"max-age=0",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "sec-fetch-mode":"navigate",
    "sec-fetch-site":"none",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "cookie":"_uab_collina=155192752626463196786582; lastCity=101010100; __c=1565492379; toUrl=/; __zp_stoken__=a32dy4M8VTtvU41ADf0l5K0oReZKFror7%2F2qFAGN5RbBdirT9P%2F2zhugmroLb2ZzmyLVH7BYC%2B3ELS5F05bZCcNIRA%3D%3D; sid=sem; __g=sem; __l=l=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title&r=https%3A%2F%2Fsp0.baidu.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc.php%3Ft%3D06KL00c00fDIFkY0IWPB0KZEgsAN9DqI00000Kd7ZNC00000LI-XKC.THdBULP1doZA80K85yF9pywdpAqVuNqsusK15ynsmWIWry79nj0snynYPvD0IHY3rjm3nDcswWDzPHwaP1RYPRPAPjN7PRPafRfYwD77nsK95gTqFhdWpyfqn1czPjmsPjnYrausThqbpyfqnHm0uHdCIZwsT1CEQLILIz4lpA-spy38mvqVQ1q1pyfqTvNVgLKlgvFbTAPxuA71ULNxIA-YUAR0mLFW5Hb4rHf%26tpl%3Dtpl_11534_19713_15764%26l%3D1511867677%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E5%252587%252586%2525E5%2525A4%2525B4%2525E9%252583%2525A8-%2525E6%2525A0%252587%2525E9%2525A2%252598-%2525E4%2525B8%2525BB%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253DBoss%2525E7%25259B%2525B4%2525E8%252581%252598%2525E2%252580%252594%2525E2%252580%252594%2525E6%252589%2525BE%2525E5%2525B7%2525A5%2525E4%2525BD%25259C%2525EF%2525BC%25258C%2525E6%252588%252591%2525E8%2525A6%252581%2525E8%2525B7%25259F%2525E8%252580%252581%2525E6%25259D%2525BF%2525E8%2525B0%252588%2525EF%2525BC%252581%2526xp%253Did(%252522m3224604348_canvas%252522)%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D8%26wd%3Dboss%25E7%259B%25B4%25E8%2581%2598%26issp%3D1%26f%3D8%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26inputT%3D3169&g=%2Fwww.zhipin.com%2Fuser%2Fsem7.html%3Fsid%3Dsem%26qudao%3Dbaidu3%26plan%3DPC-%25E9%2580%259A%25E7%2594%25A8%25E8%25AF%258D%26unit%3DPC-zhaopin-hexin%26keyword%3Dboss%25E7%259B%25B4%25E8%2581%2598%25E4%25BC%2581%25E4%25B8%259A%25E6%258B%259B%25E8%2581%2598; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1565493077,1565494665,1565494677,1565504545; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1565505516; __a=29781409.1551927520.1553506739.1565492379.86.5.40.25"
}

17. 动态网页的数据爬取

直接分析ajax调用的接口，然后通过代码请求这个接口
使用Selenium + Chromedriver模拟浏览器行为获取数据
- selenium 常用操作
- Selenium-Python中文文档链接

17.1.安装Selenium

conda install selenium

17.2. 安装chromedriver

下载链接
下载完成后，放到不需要权限的纯英文目录下就可以了
注意chromedriver的版本要和浏览器的版本一致，64位的也可以用32位的

17.3 第一个小案例

from selenium import webdriver
import time

driver_path = r"D:\chromedriver\chromedriver.exe"

driver = webdriver.Chrome(executable_path=driver_path)

driver.get("http://www.baidu.com")
# 通过page_source获取网页的源代码
print(driver.page_source)

time.sleep(3)
driver.close()

17.4. 定位元素

如果只是想要解析网页中的数据，那么推荐将网页源代码扔给lxml来解析，因为lxml底层使用的是c怨言，所以解析效率会高一点
如果是想要对元素进行一些操作，比如给一个文本框输入，或者点击某个按钮，那么就必须使用selenium给我们提供的查找元素的额方法

from selenium import webdriver
import time
from lxml import etree

driver_path = r"D:\chromedriver\chromedriver.exe"

driver = webdriver.Chrome(executable_path=driver_path)

driver.get("http://www.baidu.com")
# 通过page_source获取网页的源代码
print(driver.page_source)

# inputTag = driver.find_element_by_id("kw")
# inputTag = driver.find_element_by_name("wd")
# inputTag = driver.find_element_by_class_name("s_ipt")
inputTag = driver.find_element_by_xpath("//input[@class='s_ipt']")
inputTag.send_keys("迪丽热巴")
htmlE = etree.HTML(driver.page_source)

print(htmlE)
time.sleep(3)
driver.close()

17.5. selenium 操作表单元素

文本框的操作

inputTag = driver.find_element_by_xpath("//input[@class='s_ipt']")
inputTag.send_keys("迪丽热巴")

time.sleep(3)

inputTag.clear()

checkbox的操作

inputTag = driver.find_element_by_name("remember")
inputTag.click()

select的操作
按钮的操作

17.6. 行为链

from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import  ActionChains

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("http://www.baidu.com")

inputTag = driver.find_element_by_xpath("//input[@class='s_ipt']")
submitBtn = driver.find_element_by_id('su')

actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag, '黄渤')
actions.move_to_element(submitBtn)
actions.click(submitBtn)
actions.perform()

time.sleep(6)

inputTag.clear()


driver.close()

17.7. cookie的操作

17.8. 页面等待

隐式等待

driver.implicitly_wait(10)

显示等待

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("http://www.douban.com")
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'app-title'))
)
print(element)

17.9. 切换页面

from selenium import webdriver

driver_path = r"D:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get("http://www.jd.com")

driver.execute_script("window.open('https://www.douban.com/')")
print(driver.window_handles)
driver.switch_to.window(driver.window_handles[1])

print(driver.current_url)

17.10. selenium 使用代理

from selenium import webdriver

driver_path = r"D:\chromedriver\chromedriver.exe"

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://60.17.239.207:31032")
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)

driver.get("http://www.jd.com")

WebElement元素

18. Selenium 拉勾网爬虫

传统方式

import requests
from lxml import etree
import time
import re

# 请求头
HEADERS = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Referer": "https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput=",
    "Host": "www.lagou.com",
}

def request_list_page():
    url1 = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='

    url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'

    # 通过data来控制翻页

    for page in range(1, 2):
        data = {
            'first': 'false',
            'pn': page,
            'kd': 'python'
        }
        s = requests.Session()  # 建立session
        response = s.get(url=url1, headers=HEADERS, timeout=3)
        cookie = s.cookies  # 获取cookie
        respon = s.post(url=url, headers=HEADERS, data=data, cookies=cookie, timeout=3)
        time.sleep(7)
        result = respon.json()
        positions = result['content']['positionResult']['result']
        for position in positions:
            positionId = position['positionId']
            position_url = "https://www.lagou.com/jobs/{}.html".format(positionId)
            parse_position_detail(position_url, s)
            break

def parse_position_detail(url, s):
    response = s.get(url, headers=HEADERS)
    text = response.text
    htmlE = etree.HTML(text)
    position_name = htmlE.xpath("//div[@class='job-name']/@title")[0]
    job_request_spans = htmlE.xpath("//dd[@class='job_request']//span")
    salary = job_request_spans[0].xpath("./text()")[0].strip()
    education = job_request_spans[3].xpath("./text()")[0]
    education = re.sub(r"[/ \s]", "", education)
    print(education)
    job_detail = htmlE.xpath("//div[@class='job-detail']//text()")
    job_detail = "".join(job_detail).strip()
    print(job_detail)


if __name__ == '__main__':
    request_list_page()

Selenium + Chromedriver方式

import re
import time

from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


class LagouSpider(object):
    """
    Selenium + ChromeDriver 拉钩爬虫
    """
    driver_path = r"D:\chromedriver\chromedriver.exe"

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=LagouSpider.driver_path)
        # 这个链接并不是真正招聘职位信息的链接
        self.url = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='
        # 职位信息列表
        self.positions = []

    def run(self):
        self.driver.get(self.url)
        while True:
            WebDriverWait(self.driver, 10).until(
                # 这里只能追踪的元素，不能追踪到元素的具体属性
                EC.presence_of_element_located((By.XPATH, "//div[@class='pager_container']/span[last()]"))
            )

            source = self.driver.page_source
            self.parse_list_page(source)
            next_btn = self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")
            if "pager_next_disabled" in next_btn.get_attribute("class"):
                break
            else:
                next_btn.click()

    def parse_list_page(self, source):
        htmlE = etree.HTML(source)
        links = htmlE.xpath("//a[@class='position_link']/@href")
        for link in links:
            self.request_detail_page(link)
            time.sleep(1)

    def request_detail_page(self, url):
        # self.driver.get(url)
        self.driver.execute_script("window.open('{}')".format(url))
        self.driver.switch_to.window(self.driver.window_handles[1])

        WebDriverWait(self.driver, 10).until(
            # EC.presence_of_element_located((By.XPATH, "//div[@class='job-name']/@title"))
            # 这里只能追踪到元素，追踪不到元素下的具体属性
            EC.presence_of_element_located((By.XPATH, "//div[@class='job-name']"))
        )

        page_srouce = self.driver.page_source
        self.parse_detail_page(page_srouce)
        # 关闭这个详情页
        self.driver.close()
        # 继续切换到职位列表页面
        self.driver.switch_to.window(self.driver.window_handles[0])

    def parse_detail_page(self, source):
        htmlE = etree.HTML(source)
        position_name = htmlE.xpath("//div[@class='job-name']/h2/text()")[0]
        company = htmlE.xpath("//div[@class='job-name']/h4/text()")[0]
        job_request_spans = htmlE.xpath("//dd[@class='job_request']//span")
        salary = job_request_spans[0].xpath("./text()")[0].strip()
        salary = re.sub(r"[/ \s]", "", salary)
        city = job_request_spans[1].xpath("./text()")[0].strip()
        city = re.sub(r"[/ \s]", "", city)
        experience = job_request_spans[2].xpath("./text()")[0].strip()
        experience = re.sub(r"[/ \s]", "", experience)
        education = job_request_spans[3].xpath("./text()")[0]
        education = re.sub(r"[/ \s]", "", education)
        type = job_request_spans[4].xpath("./text()")[0]
        type = re.sub(r"[/ \s]", "", type)
        job_detail = htmlE.xpath("//div[@class='job-detail']//text()")
        job_detail = "".join(job_detail).strip()
        print("职位：%s" % position_name)
        print("单位：%s" % company)
        print("")
        print(salary + "/" + city + "/" + experience + "/" + education + "/" + type)
        print("")
        print(job_detail)

        position = {
            'name': position_name,
            'company': company,
            'salary': salary,
            'city': city,
            'experience': experience,
            'education': education,
            'desc': job_detail
        }
        self.positions.append(position)
        # print(position)
        print("=" * 100)


if __name__ == '__main__':
    spider = LagouSpider()
    spider.run()

19. Scrapy+Selenium爬取简书网整站，并且存入到mysql当中

目前这个网站有css加密

item目标字段类

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class JianshuProjectItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    article_id = scrapy.Field()
    origin_url = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()

setting设置类

# 1.导包
import logging
import datetime
import os


# 2.项目名称 TODO 需要修改
BOT_NAME = 'jianshu_project'

# 3.模块名称
SPIDER_MODULES = ['{}.spiders'.format(BOT_NAME)]
NEWSPIDER_MODULE = '{}.spiders'.format(BOT_NAME)

# 4.遵守机器人协议（默认为True）
ROBOTSTXT_OBEY = False

# 5.用户代理（使用的浏览器类型）
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 ' \
             'Safari/537.36 '

# 6.默认请求头信息（USER_AGENT 单独配置）
DEFAULT_REQUEST_HEADERS = {
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"zh-CN,zh;q=0.9",
}


# 7.格式化日志输出的格式，日志文件每分钟生成一个文件
time_str = datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d %H-%M')
LOG_FILE = '{}\\{}\\logs\\{}.log'.format(os.getcwd(), BOT_NAME, time_str)
LOG_LEVEL = 'DEBUG'

# 8.设置运行多个爬虫的自定义命令
COMMANDS_MODULE = '{}.commands'.format(BOT_NAME)

# 9.scrapy输出的json文件中显示中文（https://www.cnblogs.com/linkr/p/7995454.html）
FEED_EXPORT_ENCODING = 'utf-8'

# 10.管道pipeline配置，后面的值越小，越先经过这根管道 TODO 需要修改
ITEM_PIPELINES = {
   # '{}.pipelines.JianshuProjectPipeline'.format(BOT_NAME): 300,
   '{}.pipelines.JianshuTwistedPipeline'.format(BOT_NAME): 300,
}

# 11.限制爬虫的爬取速度， 单位为秒
DOWNLOAD_DELAY = 1

# 12. 下载中间件 TODO 需要修改
DOWNLOADER_MIDDLEWARES = {
   '{}.middlewares.RandomUserAgent'.format(BOT_NAME): 1,
   '{}.middlewares.SeleniumDownloadMiddleware'.format(BOT_NAME): 2
}

# 13. 禁用cookie
COOKIES_ENABLED = False

spider爬虫类

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from jianshu_project.items import JianshuProjectItem


class JianshuSpider(CrawlSpider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/']

    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        title = response.xpath('//div[@id="__next"]/div[1]/div/div/section[1]/h1/text()').extract_first()
        avatar = response.xpath("//div[@class='_2mYfmT']//a[@class='_1OhGeD']/img/@src").extract_first()
        author = response.xpath("//span[@class='FxYr8x']/a/text()").extract_first()
        pub_time = response.xpath(
            '//div[@id="__next"]/div[1]/div/div/section[1]/div[1]/div/div/div[2]/time/text()').extract_first()
        url = response.url
        url1 = url.split('?')[0]
        article_id = url1.split('/')[-1]

        content = response.xpath("//article[@class='_2rhmJa']").extract_first()

        item = JianshuProjectItem(
            title=title,
            avatar=avatar,
            author=author,
            pub_time=pub_time,
            origin_url=url,
            article_id=article_id,
            content=content
        )
        yield item

pipeline管道

import pymysql
from pymysql import cursors
from twisted.enterprise import adbapi


class JianshuProjectPipeline(object):
    """同步入庫"""
    def __init__(self):
        dbparams = {
            'host': 'mini1',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'database': 'db_jianshu',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        print('*' * 300)
        print(item)
        self.cursor.execute(self.sql, (item['title'], item['content'],
                                       item['author'], item['avatar'],
                                       item['pub_time'], item['article_id'],
                                       item['origin_url']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
            insert into tb_article (id,title,content,author,avatar,pub_time,article_id, origin_url) values(null,%s,%s,%s,%s,%s,%s,%s)
            """
            return self._sql
        return self._sql


class JianshuTwistedPipeline(object):
    """异步入库"""

    def __init__(self):
        dbparams = {
            'host': 'mini1',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'database': 'db_jianshu',
            'charset': 'utf8',
            'cursorclass': cursors.DictCursor
        }
        self.dbpool = adbapi.ConnectionPool('pymysql', **dbparams)
        self._sql = None

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
                insert into tb_article (id,title,content,author,avatar,pub_time,article_id, origin_url) values(null,%s,%s,%s,%s,%s,%s,%s)
                """
            return self._sql
        return self._sql

    def process_item(self, item, spider):
        defer = self.dbpool.runInteraction(self.insert_item, item)
        defer.addErrback(self.handle_error, item, spider)
        return item

    def insert_item(self, cursor, item):
        cursor.execute(self.sql, (item['title'], item['content'],
                                  item['author'], item['avatar'],
                                  item['pub_time'], item['article_id'],
                                  item['origin_url']))

    def handle_error(self, error, item, spider):
        print('*' * 100)
        print('error:', error)
        print('*' * 100)

middleware中间件

import random
from jianshu_project.conf.user_agent import USER_AGENT_LIST
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse


class RandomUserAgent(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)


class SeleniumDownloadMiddleware(object):

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'D:\chromedriver\chromedriver.exe')

    def process_request(self, request, spider):

        self.driver.get(request.url)
        time.sleep(1)
        try:
            while True:
                loadMore = self.driver.find_element_by_class_name('load-more')
                loadMore.click()
                time.sleep(0.3)

                if not loadMore:
                    break
        except Exception as e:
            pass

        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=source, request=request, encoding='utf-8')
        return response

其他确实部分都可以在前面部分找到相关叙述

20. selenium设置代理和UserAgent

import random
from useragent_demo.conf.user_agent import USER_AGENT_LIST
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse


class RandomUserAgent(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(USER_AGENT_LIST)


class SeleniumDownloadMiddleware1(object):

    def process_request(self, request, spider):
        options = webdriver.ChromeOptions()
        options.add_argument('user-agent={}'.format(request.headers['User-Agent'])) # 设置随机请求头
        # self.options.add_argument('--proxy-server={}'.format(request.headers['proxy'])) # 设置代理
        driver = webdriver.Chrome(chrome_options=options, executable_path=r'D:\chromedriver\chromedriver.exe')

        driver.get(request.url)
        source = driver.page_source
        response = HtmlResponse(url=driver.current_url, body=source, request=request, encoding='utf-8')
        driver.close()
        return response


class SeleniumDownloadMiddleware(object):

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'D:\chromedriver\chromedriver.exe')
        self.options = self.driver.create_options()

    def process_request(self, request, spider):
        self.options.add_argument('user-agent={}'.format(request.headers['User-Agent']))
        self.driver.get(request.url)
        time.sleep(1)
        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url, body=source, request=request, encoding='utf-8')
        return response

21. http://httpbin.org 测试接口解析

你可能感兴趣的:(爬虫)

Python 爬虫实战：视频平台播放量实时监控（含反爬对抗与数据趋势预测）西攻城狮北 python 爬虫音视频
一、引言在数字内容蓬勃发展的当下，视频平台的播放量数据已成为内容创作者、营销人员以及行业分析师手中极为关键的情报资源。它不仅能够实时反映内容的受欢迎程度，更能在竞争分析、营销策略制定以及内容优化等方面发挥不可估量的作用。然而，视频平台为了保护自身数据和用户隐私，往往会设置一系列反爬虫机制，对数据爬取行为进行限制。这就向我们发起了挑战：如何巧妙地突破这些限制，同时精准地捕捉并预测播放量的动态变化趋势
requests的使用
一·概念requests作为爬虫的基础库，在我们快速爬取和反爬破解中起到很重要的作用，其中的知识点大概有以下几个方面：二·内容一，request：1-requests.get…get请求获取数据2-requests.post…post请求获取数据二，response:1-response.text.响应体str类型2-response.encoding从HTTPheader中获取响应内容的编码方式
Python爬虫实战：利用最新技术爬取B站直播数据 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 html 百度
1.B站直播数据爬取概述B站(哔哩哔哩)是中国最大的年轻人文化社区和视频平台之一，其直播业务近年来发展迅速。爬取B站直播数据可以帮助我们分析直播市场趋势、热门主播排行、观众喜好等有价值的信息。常见的B站直播数据类型包括：直播间基本信息(标题、分类、主播信息)实时观看人数与弹幕数据礼物打赏数据直播历史记录分区热门直播数据本文将重点介绍如何获取直播间基本信息和分区热门直播数据。2.环境准备与工具选择2
基于Python的智能公示信息监控爬虫系统开发实战 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言音视频搜索引擎 scrapy
摘要本文详细介绍了如何使用Python构建一个高效的公示信息监控爬虫系统。系统采用最新技术栈，包括异步爬取、智能解析、反反爬策略等，能够自动监控各类政府网站、企业公示平台的更新信息。文章从系统设计到具体实现，提供了完整的代码示例和详细的技术解析，帮助读者掌握大规模公示信息采集的核心技术。关键词：Python爬虫、公示监控、信息采集、异步爬取、智能解析1.引言在数字化时代，各类公示信息（如政府采购、
基于Python的Google Scholar学术论文爬虫实战：最新技术与完整代码解析 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言学习 scrapy
摘要本文详细介绍如何使用Python构建一个高效的GoogleScholar爬虫系统，包括代理设置、反反爬策略、数据解析与存储等核心技术。文章涵盖最新Python爬虫技术栈（如Playwright、异步IO等），提供完整可运行的代码示例，并讨论学术爬虫的伦理与法律问题。通过本教程，读者将掌握从GoogleScholar批量获取学术论文信息的高级爬虫技术。关键词：Python爬虫、GoogleSch
Python selenium 库 AI老李 python python selenium 开发语言
关键要点PythonSelenium库用于自动化Web浏览器，适合测试和爬虫，中文教程资源丰富。推荐菜鸟教程、CSDN博客和Selenium-Python中文文档，涵盖基础到进阶。学习需注意浏览器驱动匹配和动态加载处理，可能需显式等待。资源推荐以下是适合初学者和中级学习者的中文教程：菜鸟教程：提供全面的Selenium教程，包括安装和示例，详见Selenium教程。Selenium-Python中
windows exe爬虫：exe抓包程序猿阿三爬虫项目实战 exe抓包
不论任何爬虫，抓包是获取数据最直接和最方便的方式，这章节我们一起看一下windowsexe是如何拦截数据的。用mitmproxy/Charles/Fiddler或Wireshark拦截它的HTTP/HTTPS/TCP流量。如果是HTTPS，安装并信任代理的根证书。由于exe大部分可能走的是自定义应用层协议。在不知情所拦截应用使用的流量时，所以建议用Wireshark。本文利用python代码，实现
Python爬虫实战：基于最新技术的定时签到系统开发全解析 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言人工智能自动化知识图谱
摘要本文详细介绍了如何使用Python开发一个功能完善的定时签到爬虫系统。文章从爬虫基础知识讲起，逐步深入到高级技巧，包括异步请求处理、浏览器自动化、验证码破解、分布式架构等最新技术。我们将通过一个完整的定时签到项目案例，展示如何构建一个稳定、高效且具有良好扩展性的爬虫系统。文中提供了大量可运行的代码示例，涵盖requests、aiohttp、selenium、playwright等多种技术方案，
Python爬虫实战：使用最新技术爬取新华网新闻数据 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 scrapy 音视频
一、前言在当今信息爆炸的时代，网络爬虫技术已经成为获取互联网数据的重要手段。作为国内权威新闻媒体，新华网每天发布大量高质量的新闻内容，这些数据对于舆情分析、市场研究、自然语言处理等领域具有重要价值。本文将详细介绍如何使用Python最新技术构建一个高效、稳定的新华网新闻爬虫系统。二、爬虫技术选型2.1技术栈选择在构建新华网爬虫时，我们选择了以下技术栈：请求库：httpx（支持HTTP/2，异步请求
Python爬虫：从图片或扫描文档中提取文字数据的完整指南 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言数据挖掘 c++
1.引言随着大数据技术的不断进步，图像数据逐渐成为了许多行业中重要的数据源之一。图像中不仅包含了丰富的视觉信息，还可能蕴含着大量的文字数据。对于科研、企业、政府等多个领域而言，如何从图片或扫描文档中提取出有价值的文字信息是一个亟待解决的问题。在这一过程中，OCR（OpticalCharacterRecognition，光学字符识别）技术成为了解决这一问题的重要工具。在本文中，我们将探讨如何使用Py
爬虫技术：从基础到高级，探索数据抓取的奥秘
一、基础爬虫：揭开数据抓取的神秘面纱对于初学者来说，基础爬虫是入门的起点。基础爬虫的目标通常是静态网页，这些网页的内容在加载时就已经确定，不需要与服务器进行交互。通过简单的HTTP请求和HTML解析，就可以获取到网页中的数据。在基础爬虫中，最核心的技术是HTML解析。HTML是网页的结构语言，它定义了网页的布局和内容。爬虫程序需要通过解析HTML，找到其中的文本、图片、链接等元素。常用的HTML解
python 计算生态概览的概述
文章目录前言python计算生态库的介绍1.网络爬虫2.数据分析3.文本处理4.数据可视化5.机器学习6.图形用户界面7.游戏开发8.网络应用开发前言python计算生态概览的解释Python计算生态概览是对Python作为一门强大而广泛使用的编程语言所拥有的庞大软件集合的整体描述和概述。这个生态体系不仅包含了Python的标准库（stdlib），即随Python解释器安装的基本模块，还涵盖了极其
Python生态全景图：8大主流框架优缺点及选型指南 Sammyyyyy python 开发语言 django fastapi flask
引言：Python的“万能”生态Python为何能成为当今最流行的编程语言之一？答案并非其语法本身，而在于其强大且多样化的框架生态。这个生态系统如同一片繁荣的大陆，覆盖了从Web后端到人工智能的几乎所有技术领域，让开发者能用一种语言胜任多种截然不同的任务。本文将化作一张“技术地图”，快速带你游览Python在Web开发、数据科学和网络爬虫三大领域的8个标志性框架。我们的目标是迅速掌握它们的精髓，让
从零到一：王者荣耀英雄数据采集与技能图谱异步爬虫实战程序员威哥爬虫 python 开发语言自动化 scrapy
引言：随着游戏行业的迅猛发展，王者荣耀作为一款深受玩家喜爱的手游，其英雄数据和技能信息成为了爬虫开发者研究的热点之一。通过抓取英雄数据并对技能图谱进行可视化，我们不仅能够更好地理解游戏数据，还可以为游戏爱好者或数据分析师提供一个有价值的数据分析平台。本篇文章将带你一步步实现王者荣耀英雄数据的采集与技能图谱的可视化，并使用异步爬虫技术提高爬取效率。我们将结合实际开发中的需求，深入讲解如何使用异步爬虫
Python 网络爬虫中 robots 协议使用的常见问题及解决方法
在Python网络爬虫开发中，robots协议的正确应用是保证爬虫合规性的关键。然而，在实际使用过程中，开发者常会遇到各种问题，若处理不当，可能导致爬虫被封禁或引发法律风险。本文将梳理robots协议使用中的常见问题，并提供针对性的解决方法。一、协议解析不准确导致的合规性问题1.1误读User-agent通配符范围问题表现：将User-agent:*错误理解为适用于所有场景，忽略了特定爬虫的单独规
【网络与爬虫 24】爬虫数据存储方案：从文件到数据库的全面指南莫比乌斯@卷技术技巧 #网络与爬虫网络爬虫数据库
【网络与爬虫24】爬虫数据存储方案：从文件到数据库的全面指南关键词：爬虫数据存储、CSV、JSON、Excel、SQLite、MySQL、MongoDB、Redis、数据持久化、数据管理摘要：本文全面介绍爬虫数据存储的各种方案，从简单的文本文件、CSV、JSON到Excel表格，再到SQLite、MySQL等关系型数据库，以及MongoDB、Redis等NoSQL数据库。通过对比分析不同存储方式的
测试你的Python环境是否配置成功川星弦 python 开发语言
#导入需要的库importrequestsfrombs4importBeautifulSoup#目标网页URLurl='https://quotes.toscrape.com/'#这是一个专门用来练习爬虫的网站#设置请求头，模拟浏览器访问headers={'User-Agent':'Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHT
Go爬虫开发学习记录朱颜辞镜花辞树‎ golang 爬虫学习
Go爬虫开发学习记录基础篇：使用net/http库Go的标准库net/http提供了完善的HTTP客户端功能，是构建爬虫的基石：packagemainimport("fmt""io""net/http")funcfetchPage(urlstring)string{//创建自定义HTTP客户端client:=&http.Client{}//构建GET请求req,_:=http.NewRequest
Julia爬取数据能力及应用场景 q56731523 julia 开发语言
Julia是一种高性能编程语言，特别适合数值计算和数据分析。然而，关于数据爬取（即网络爬虫）方面，我们需要明确以下几点：虽然它是一门通用编程语言，但它的强项不在于网络爬取（WebScraping）这类任务。而且Julia的生态系统在爬虫方面还不够成熟和丰富。所以说Julia爬取数据后立即进行高性能的数据分析这点还是有一些优势。Julia虽然以高性能数值计算和数据分析见长，但它同样具备网络爬取（We
Lua嵌入式爬虫实现步骤 q56731523 lua 爬虫开发语言 r语言
在Lua中实现嵌入式爬虫，通俗点说就是指在一个宿主程序（如Nginx/OpenResty、Redis等）中使用Lua脚本来完成网络爬取任务。由于Lua本身的标准库并不包含网络请求功能，因此我们需要依赖宿主环境提供的网络库。在Lua中实现嵌入式爬虫通常指在资源受限环境（如OpenResty/Nginx、Redis、IoT设备）中运行的轻量级网络爬取工具。以下是关键实现方案和示例：核心方案：基于Ope
Scala实现网页数据采集示例
Scala可以轻松实现简单的数据采集任务，结合AkkaHTTP（高效HTTP客户端）和Jsoup（HTML解析库）是常见方案。Scala因为受众比较少，而且随着这两年python的热门语言，更让Scala不为人知，今天我将结合我所学的知识实现一个简单的Scala爬虫代码示例。以下就是我整理的一个完整示例，演示如何抓取网页标题和链接：示例代码importakka.actor.ActorSystemi
使用 Kotlin 编写的爬虫程序，用于爬取简历采集系统智联和无忧的内容
这是一个使用Kotlin编写的爬虫程序，用于爬取简历采集系统智联和无忧的内容。使用代理信息proxy_host:www.duoip.cn,proxy_port:8000。以下是每行代码和步骤的详细解释：```kotlinimportorg.jsoup.Jsoupimportorg.jsoup.nodes.Documentimportorg.jsoup.nodes.Elementimportorg.
Scrapy分布式爬虫进阶：动态代理与并发优化实战 Kelaru python project scrapy 分布式爬虫 python
写在前面。。。继“动态网页”“登录网站”“经验总结”“分布式爬虫”后，本篇献上Scrapy-Redis进阶实战，基于QuotestoScrape，聚焦动态代理池和并发优化，代码简洁，经验点燃智慧，适合新手到老兵。准备工作1.环境配置Python：3.8+（推荐3.10）。依赖安装：pipinstallscrapy==2.11.2scrapy-redis==0.7.4redis==5.0.8requ
Python 爬虫实战：电商商品评论深度爬取与用户情感分析系统搭建西攻城狮北 python 爬虫开发语言电商
引言在电商领域，商品评论是消费者决策的重要参考，也是商家优化产品和服务的关键依据。通过爬取和分析电商商品评论，可以深入了解用户需求、产品优缺点以及市场趋势。本文将详细介绍如何使用Python构建一个完整的电商商品评论爬取系统，并进行用户情感分析。我们将涵盖从爬虫设计、数据抓取、数据清洗、情感分析到可视化的全流程。1.项目背景与目标电商平台上，商品评论通常包含以下信息：用户名评论内容评论时间评分（星
Java基础学习笔记2 qichi333 学习笔记 java eclipse
今天是Java基础学习第二天，加油！！！下面是我今天记的一些笔记。（有点懒惰了，爬虫今天没学，因为赖床了(bushi)，但我会勤奋起来的^_^，一定一定！明天不能偷懒了天！！）一、运算符例子：inta=10;intb=20;intc=a+b;其中，“+”是运算符，且是算术运算符；“a+b”是表达式，且是算术表达式。1.算术运算符例1：publicclassdemo3{publicstaticvoi
Selenium使用指南
点击文末小卡片，免费获取软件测试全套资料，资料在手，涨薪更快Selenium是网页应用中最流行的自动化测试工具，可以用来做自动化测试或者浏览器爬虫等。官网地址为：相对于另外一款web自动化测试工具QTP来说有如下优点：免费开源轻量级，不同语言只需要一个体积很小的依赖包支持多种系统，包括Windows，Mac，Linux支持多种浏览器，包括Chrome，FireFox，IE，safari，opera
Python 网络爬虫的基本流程及 robots 协议详解女码农的重启 python 网络爬虫 JAVA 开发语言
数据驱动的时代，网络爬虫作为高效获取互联网信息的工具，其规范化开发离不开对基本流程的掌握和对robots协议的遵守。本文将系统梳理Python网络爬虫的核心流程，并深入解读robots协议的重要性及实践规范。一、Python网络爬虫的基本流程Python网络爬虫的工作过程可分为四个核心阶段，每个阶段环环相扣，共同构成数据采集的完整链路。1.1发起网络请求这是爬虫与目标服务器交互的第一步，通过发送H
爬虫小结 Crescent_P python小项目 python 数据分析
python爬虫小组作业上周布置了python的小组作业,每一组要求爬取老师指定的信息,本组抽到的题目如下:从中国银行网址：http://www.boc.cn/sourcedb/whpj/获取主要外汇（美元、欧元、英镑、加拿大元、澳大利亚元、日元、韩元、新台币、澳门元和港币）的牌价信息，计算出它们的每天平均价。要求把今年5月份每天平均价格保存到Excel文件中，每种外汇的数据保存在一个工作表中，并
Python 爬虫实战：抓取华尔街日报付费文章摘要的全方位指南 Python爬虫项目 python 爬虫开发语言信息可视化数据分析
引言在全球化的信息时代，获取高质量的新闻内容对于研究、投资和决策具有重要意义。《华尔街日报》（TheWallStreetJournal，简称WSJ）作为国际知名的财经媒体，其文章内容备受关注。然而，WSJ的大部分内容属于付费订阅，普通用户无法直接访问。本文将深入探讨如何使用Python爬虫技术，结合最新的工具和方法，抓取WSJ的付费文章摘要。一、了解目标网站结构1.1WSJ网站结构分析WSJ的官方
Python爬虫实战：使用最新技术爬取头条新闻数据 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 scrapy 音视频
一、前言：Python爬虫在现代数据获取中的重要性在当今信息爆炸的时代，数据已经成为最宝贵的资源之一。作为数据获取的重要手段，网络爬虫技术在各个领域发挥着越来越重要的作用。Python凭借其简洁的语法、丰富的库生态系统和强大的社区支持，已经成为网络爬虫开发的首选语言。本文将详细介绍如何使用Python及其最新的爬虫技术来爬取头条新闻数据。我们将从基础概念讲起，逐步深入到高级技巧，最后给出完整的爬虫
iOS http封装 374016526 ios 服务器交互 http 网络请求
程序开发避免不了与服务器的交互，这里打包了一个自己写的http交互库。希望可以帮到大家。内置一个basehttp，当我们创建自己的service可以继承实现。 KuroAppBaseHttp *baseHttp = [[KuroAppBaseHttp alloc] init]; [baseHttp setDelegate:self]; [baseHttp
lolcat ：一个在 Linux 终端中输出彩虹特效的命令行工具 brotherlamp linux linux教程 linux视频 linux自学 linux资料
那些相信 Linux 命令行是单调无聊且没有任何乐趣的人们，你们错了，这里有一些有关 Linux 的文章，它们展示着 Linux 是如何的有趣和“淘气” 。在本文中，我将讨论一个名为“lolcat”的小工具 – 它可以在终端中生成彩虹般的颜色。何为 lolcat ? Lolcat 是一个针对 Linux，BSD 和 OSX 平台的工具，它类似于 cat 命令，并为 cat
MongoDB索引管理（1）——[九] eksliang mongodb MongoDB管理索引
转载请出自出处：http://eksliang.iteye.com/blog/2178427 一、概述数据库的索引与书籍的索引类似，有了索引就不需要翻转整本书。数据库的索引跟这个原理一样，首先在索引中找，在索引中找到条目以后，就可以直接跳转到目标文档的位置，从而使查询速度提高几个数据量级。不使用索引的查询称
Informatica参数及变量 18289753290 Informatica 参数变量
下面是本人通俗的理解，如有不对之处，希望指正 info参数的设置：在info中用到的参数都在server的专门的配置文件中（最好以parma）结尾下面的GLOBAl就是全局的，$开头的是系统级变量，$$开头的变量是自定义变量。如果是在session中或者mapping中用到的变量就是局部变量，那就把global换成对应的session或者mapping名字。 [GLOBAL] $Par
python 解析unicode字符串为utf8编码字符串酷的飞上天空 unicode
php返回的json字符串如果包含中文，则会被转换成\uxx格式的unicode编码字符串返回。在浏览器中能正常识别这种编码，但是后台程序却不能识别，直接输出显示的是\uxx的字符，并未进行转码。转换方式如下 >>> import json >>> q = '{"text":"\u4
Hibernate的总结永夜-极光 Hibernate
1.hibernate的作用,简化对数据库的编码,使开发人员不必再与复杂的sql语句打交道做项目大部分都需要用JAVA来链接数据库，比如你要做一个会员注册的页面，那么获取到用户填写的基本信后，你要把这些基本信息存入数据库对应的表中，不用hibernate还有mybatis之类的框架，都不用的话就得用JDBC，也就是JAVA自己的，用这个东西你要写很多的代码，比如保存注册信
SyntaxError: Non-UTF-8 code starting with '\xc4' 随便小屋 python
刚开始看一下Python语言，传说听强大的，但我感觉还是没Java强吧！写Hello World的时候就遇到一个问题，在Eclipse中写的，代码如下 ''' Created on 2014年10月27日 @author: Logic ''' print("Hello World!"); 运行结果 SyntaxError: Non-UTF-8
学会敬酒礼仪不做酒席菜鸟 aijuans 菜鸟
俗话说，酒是越喝越厚，但在酒桌上也有很多学问讲究，以下总结了一些酒桌上的你不得不注意的小细节。细节一：领导相互喝完才轮到自己敬酒。敬酒一定要站起来，双手举杯。细节二：可以多人敬一人，决不可一人敬多人，除非你是领导。细节三：自己敬别人，如果不碰杯，自己喝多少可视乎情况而定，比如对方酒量，对方喝酒态度，切不可比对方喝得少，要知道是自己敬人。细节四：自己敬别人，如果碰杯，一
《创新者的基因》读书笔记 aoyouzi 读书笔记《创新者的基因》
创新者的基因创新者的“基因”，即最具创意的企业家具备的五种“发现技能”：联想，观察，实验，发问，建立人脉。第一部分破坏性创新，从你开始第一章破坏性创新者的基因如何获得启示：发现以下的因素起到了催化剂的作用：(1) -个挑战现状的问题；(2)对某项技术、某个公司或顾客的观察；(3) -次尝试新鲜事物的经验或实验；(4)与某人进行了一次交谈，为他点醒
表单验证技术百合不是茶 JavaScript DOM对象 String对象事件
js最主要的功能就是验证表单,下面是我对表单验证的一些理解,贴出来与大家交流交流 ,数显我们要知道表单验证需要的技术点, String对象,事件,函数一:String对象;通常是对字符串的操作; 1,String的属性; 字符串.length;表示该字符串的长度; var str= "java"
web.xml配置详解之context-param bijian1013 java servlet web.xml context-param
一.格式定义： <context-param> <param-name>contextConfigLocation</param-name> <param-value>contextConfigLocationValue></param-value> </context-param> 作用：该元
Web系统常见编码漏洞（开发工程师知晓） Bill_chen sql PHP Web fckeditor 脚本
1.头号大敌：SQL Injection 原因：程序中对用户输入检查不严格，用户可以提交一段数据库查询代码，根据程序返回的结果，获得某些他想得知的数据，这就是所谓的SQL Injection，即SQL注入。本质: 对于输入检查不充分，导致SQL语句将用户提交的非法数据当作语句的一部分来执行。示例： String query = "SELECT id FROM users
【MongoDB学习笔记六】MongoDB修改器 bit1129 mongodb
本文首先介绍下MongoDB的基本的增删改查操作，然后，详细介绍MongoDB提供的修改器，以完成各种各样的文档更新操作 MongoDB的主要操作 show dbs 显示当前用户能看到哪些数据库 use foobar 将数据库切换到foobar show collections 显示当前数据库有哪些集合 db.people.update，update不带参数，可
提高职业素养，做好人生规划白糖_ 人生
培训讲师是成都著名的企业培训讲师，他在讲课中提出的一些观点很新颖，在此我收录了一些分享一下。注：讲师的观点不代表本人的观点，这些东西大家自己揣摩。 1、什么是职业规划：职业规划并不完全代表你到什么阶段要当什么官要拿多少钱，这些都只是梦想。职业规划是清楚的认识自己现在缺什么，这个阶段该学习什么，下个阶段缺什么，又应该怎么去规划学习，这样才算是规划。
国外的网站你都到哪边看？ bozch 技术网站国外
学习软件开发技术，如果没有什么英文基础，最好还是看国内的一些技术网站，例如：开源OSchina，csdn，iteye,51cto等等。个人感觉如果英语基础能力不错的话，可以浏览国外的网站来进行软件技术基础的学习，例如java开发中常用的到的网站有apache.org 里面有apache的很多Projects,springframework.org是spring相关的项目网站,还有几个感觉不错的
编程之美-光影切割问题 bylijinnan 编程之美
package a; public class DisorderCount { /**《编程之美》“光影切割问题” * 主要是两个问题： * 1.数学公式（设定没有三条以上的直线交于同一点）： * 两条直线最多一个交点，将平面分成了4个区域； * 三条直线最多三个交点，将平面分成了7个区域； * 可以推出：N条直线 M个交点，区域数为N+M+1。
关于Web跨站执行脚本概念 chenbowen00 Web 安全跨站执行脚本
跨站脚本攻击(XSS)是web应用程序中最危险和最常见的安全漏洞之一。安全研究人员发现这个漏洞在最受欢迎的网站,包括谷歌、Facebook、亚马逊、PayPal,和许多其他网站。如果你看看bug赏金计划,大多数报告的问题属于 XSS。为了防止跨站脚本攻击,浏览器也有自己的过滤器,但安全研究人员总是想方设法绕过这些过滤器。这个漏洞是通常用于执行cookie窃取、恶意软件传播,会话劫持,恶意重定向。在
[开源项目与投资]投资开源项目之前需要统计该项目已有的用户数 comsci 开源项目
现在国内和国外,特别是美国那边,突然出现很多开源项目,但是这些项目的用户有多少,有多少忠诚的粉丝,对于投资者来讲,完全是一个未知数,那么要投资开源项目,我们投资者必须准确无误的知道该项目的全部情况,包括项目发起人的情况,项目的维持时间..项目的技术水平,项目的参与者的势力,项目投入产出的效益.....
oracle alert log file（告警日志文件） daizj oracle 告警日志文件 alert log file
The alert log is a chronological log of messages and errors, and includes the following items: All internal errors (ORA-00600), block corruption errors (ORA-01578), and deadlock errors (ORA-00060)
关于 CAS SSO 文章声明 denger SSO
由于几年前写了几篇 CAS 系列的文章，之后陆续有人参照文章去实现，可都遇到了各种问题，同时经常或多或少的收到不少人的求助。现在这时特此说明几点： 1. 那些文章发表于好几年前了，CAS 已经更新几个很多版本了，由于近年已经没有做该领域方面的事情，所有文章也没有持续更新。 2. 文章只是提供思路，尽管 CAS 版本已经发生变化，但原理和流程仍然一致。最重要的是明白原理，然后
初二上学期难记单词 dcj3sjt126com english word
lesson 课 traffic 交通 matter 要紧；事物 happy 快乐的，幸福的 second 第二的 idea 主意；想法；意见 mean 意味着 important 重要的，重大的 never 从来，决不 afraid 害怕的 fifth 第五的 hometown 故乡，家乡 discuss 讨论；议论 east 东方的 agree 同意；赞成 bo
uicollectionview 纯代码布局, 添加头部视图 dcj3sjt126com Collection
#import <UIKit/UIKit.h> @interface myHeadView : UICollectionReusableView { UILabel *TitleLable; } -(void)setTextTitle; @end #import "myHeadView.h" @implementation m
N 位随机数字串的 JAVA 生成实现 FX夜归人 java Math 随机数 Random
/** * 功能描述随机数工具类<br /> * @author FengXueYeGuiRen * 创建时间 2014-7-25<br /> */ public class RandomUtil { // 随机数生成器 private static java.util.Random random = new java.util.R
Ehcache（09）——缓存Web页面 234390216 ehcache 页面缓存
页面缓存目录 1 SimplePageCachingFilter 1.1 calculateKey 1.2 可配置的初始化参数 1.2.1 cach
spring中少用的注解@primary解析 jackyrong primary
这次看下spring中少见的注解@primary注解，例子 @Component public class MetalSinger implements Singer{ @Override public String sing(String lyrics) { return "I am singing with DIO voice
Java几款性能分析工具的对比 lbwahoo java
Java几款性能分析工具的对比摘自：http://my.oschina.net/liux/blog/51800 在给客户的应用程序维护的过程中，我注意到在高负载下的一些性能问题。理论上，增加对应用程序的负载会使性能等比率的下降。然而，我认为性能下降的比率远远高于负载的增加。我也发现，性能可以通过改变应用程序的逻辑来提升，甚至达到极限。为了更详细的了解这一点，我们需要做一些性能
JVM参数配置大全 nickys jvm 应用服务器
JVM参数配置大全 /usr/local/jdk/bin/java -Dresin.home=/usr/local/resin -server -Xms1800M -Xmx1800M -Xmn300M -Xss512K -XX:PermSize=300M -XX:MaxPermSize=300M -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -
搭建 CentOS 6 服务器(14) - squid、Varnish rensanning varnish
（一）squid 安装 # yum install httpd-tools -y # htpasswd -c -b /etc/squid/passwords squiduser 123456 # yum install squid -y 设置 # cp /etc/squid/squid.conf /etc/squid/squid.conf.bak # vi /etc/
Spring缓存注解@Cache使用 tom_seed spring
参考资料 http://www.ibm.com/developerworks/cn/opensource/os-cn-spring-cache/ http://swiftlet.net/archives/774 缓存注解有以下三个： @Cacheable @CacheEvict @CachePut
dom4j解析XML时出现"java.lang.noclassdeffounderror: org/jaxen/jaxenexception"错误 xp9802
java.lang.NoClassDefFoundError: org/jaxen/JaxenExc 关键字: java.lang.noclassdeffounderror: org/jaxen/jaxenexception 使用dom4j解析XML时，要快速获取某个节点的数据，使用XPath是个不错的方法，dom4j的快速手册里也建议使用这种方式执行时却抛出以下异常： Exceptio