Wayne12081213

玩转python爬虫，从简易到复杂

一. urllib

urllib是python中自带的一个基于爬虫的模块

作用：可以使用代码模拟浏览器发起请求

使用流程：

指定url
发起请求
获取页面数据
持久化存储

1. 第一个urllib爬虫程序

# 需求： 爬取搜狗首页的页面数据
import urllib.request
# 1. 指定url
url = 'https://www.sogou.com/'

# 2. 发起请求：urlopen可以根据指定的url发起请求，并返回一个响应对象
response = urllib.request.urlopen(url=url)

# 3. 获取页面数据：read函数返回的就是响应对象中存储的页面数据
page_text = response.read()

# 4. 进行持久化存储
with open('./sougou.html', 'wb') as f:
    f.write(page_text)
    print("Done")

2. urllib编码处理

# 需求：爬去指定词条所对应的页面数据
import urllib.request
import urllib.parse

# 指定url
url = 'https://www.sogou.com/web?query='

# url特性：url不可以存在非ASCII编码的字符数据，汉字并不在ASCII编码当中
word = urllib.parse.quote("宝马")
url += word

# 发请求
response = urllib.request.urlopen(url=url)

# 获取页面数据
page_text = response.read()

# 持久化存储
with open('./bmw.html', 'wb') as f:
    f.write(page_text)

3. urllib的post请求

# urllib模块发起的post请求
# 需求：爬取百度翻译的翻译结果
import urllib.request
import urllib.parse

# 1. 指定url
url = 'https://fanyi.baidu.com/sug'

# post请求携带的参数进行处理
# 流程：
# 1). 将post请求参数封装到字典
data = {
    'kw': "苹果"
}

# 2). 使用parse模块中的urlencode(返回值类型为str)进行编码处理
query = urllib.parse.urlencode(data)

# 3). 将步骤2的编码结果转换成byte类型
data = query.encode()

# 2. 发起post请求：urlopen函数的data参数表示的就是经过处理之后的post请求携带的参数
response = urllib.request.urlopen(url=url, data=data)

print(response.read())

二. requests模块

requests是python原生的一个基于网络请求的模块，模拟浏览器发起请求

1. requests-get请求：

1）简单的get请求

import requests
# 需求：爬取搜狗首页的页面数据
url =  'https://www.sogou.com/'

# 发起get请求，get方法会返回请求成功的相应对象
response = requests.get(url=url)

# 获取响应中的数据值：text可以获取响应对象中字符串形式的页面数据
page_data = response.text

# response对象中其他的重要属性
# content或取得时response对象中二进制（byte）类型的页面数据
# print(response.content)
# 返回一个响应状态码
# print(response.status_code)
# 返回响应头信息
# print(response.headers)
# 获取请求的url
# print(response.url)

# 持久化操作
with open('./sougou_requests.html', 'w',encoding='utf-8') as f:
    f.write(page_data)

2）requests携带参数的get请求

方式1：

# requests模块处理携带参数的get请求
# 需求：指定一个词条，获取搜狗搜索结果所对应的页面数据

# 1. 指定url
url = 'https://www.sogou.com/web?query=宝马&ie=utf-8'

response = requests.get(url=url)

page_text = response.text

with open('./bmw_requests.html', 'w', encoding='utf-8') as f:
    f.write(page_text)

方式2：

# requests模块处理携带参数的get请求
# 需求：指定一个词条，获取搜狗搜索结果所对应的页面数据

url = 'https://www.sogou.com/web'
# 将参数封装到字典中
params = {
    'query': '宝马',
    'ie': 'utf-8'
}

response = requests.get(url=url, params=params)

print(response.text)

3）自定义请求头信息

# 自定义请求头信息
import requests

url = 'https://www.sogou.com/web'
params = {
    'query': '宝马',
    'ie': 'utf-8'
}

# 自定义请求头信息
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}

response = requests.get(url=url, params=params, headers=headers)

print(response.text)

2. requests的post请求

import requests

# 1. 指定post请求的url
url = 'https://accounts.douban.com/login'

# 2. 发起post请求
data = {
    'source': 'movie',
    'redir': 'https://movie.douban.com/',
    'form_email' : '[email protected]',
    'form_password' : 'xxx',
    'login' : '登录'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
response = requests.post(url=url, data=data)

# 3. 获取响应对象中的页面数据
page_text = response.text

# 4. 持久化操作
with open('./douban_request.html', 'w', encoding='utf-8') as f:
    f.write(page_text)

3. ajax的get请求

import requests

url = 'https://movie.douban.com/j/chart/top_list?'

# 封装ajax的get请求中携带的参数
params = {
    'type' : '13',
    'interval_id' : '100:90',
    'action': '',
    'start': '0',
    'limit': '20',
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
response = requests.get(url, params=params, headers=headers)

print(response.text)

4. ajax的post请求

import requests

# 1. 指定url
post_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

# 处理post请求的参数
data = {
    'cname': '',
    'pid': '',
    'keyword': '上海',
    'pageIndex': 1,
    'pageSize': 10
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}

# 2. 发起基于ajax的post请求
response = requests.post(url=post_url, data=data, headers=headers)
print(response.text)

5. 综合实战

import requests
import os

# 创建一个文件夹
if not os.path.exists('./zhihu_pages'):
    os.mkdir('./zhihu_pages')

word = input('enter a word: ')

# 动态指定页码的范围
start_pageNum = int(input('enter a start pageNum: '))
end_pageNum = int(input('enter a end pageNum: '))

# 指定url：设计成一个具有通用的url
url = 'http://zhihu.sogou.com/zhihu'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}

for page in range(start_pageNum, end_pageNum+1):
    params = {
        'query': word,
        'page': page,
        'ie': 'utf-8'
    }
    response = requests.get(url=url, params=params, headers=headers)
    
    # 获取响应中的页面数据
    page_text = response.text
    
    # 进行持久化存储
    filename = word + str(page) + '.html'
    filePath = 'zhihu_pages/' + filename
    with open(filePath, 'w', encoding='utf-8') as f:
        f.write(page_text)
        print("文件第%s页数据写入成功" % page)

6. requests携带cookie

cookie作用：服务器端使用cookie来记录客户端的状态信息

实现流程：1.执行登陆操作（获取cookie）；2.在发起个人主页请求时，需要将cookie携带到该请求中

import requests

# 获取session对象
session = requests.session()

# 1. 发起登录请求：将cookie获取，并存储到session对象中
login_url = 'https://accounts.douban.com/login'
data = {
    'source': 'None',
    'redir': 'https://www.douban.com/people/186539740/',
    'form_email' : '[email protected]',
    'form_password' : 'xxx',
    'login' : '登录'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
# 使用session发起post请求
login_response = session.post(url=login_url, data=data, headers=headers)

# 2. 对个人主页发起请求（session（cookie）），获取相应页面数据
url = 'https://www.douban.com/people/186539740/'
response = session.get(url=url, headers=headers)
page_text = response.text

with open('./douban_person1.html', 'w', encoding='utf-8') as f:
    f.write(page_text)

7. requests模块的代理操作

代理分类：1.正向代理：代替客户端获取数据；2.反向代理：代理服务器端提供数据

免费代理ip的网站提供商：www.goubanjia.com; 快代理；西祠代理

import requests

url = 'https://www.baidu.com/s?ie=utf-8&wd=ip'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
# 将代理ip封装到字典当中
proxy = {
    'http': '39.137.77.66:8080'
}
# 更换网路IP
response = requests.get(url=url, proxies=proxy, headers=headers)

with open('./proxy1.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

8. 数据解析

1）正则表达式

# 需求：使用正则对糗事百科中的图片数据进行解析和下载
import requests
import re
import os

# 指定url
url = 'https://www.qiushibaike.com/pic/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
# 发起请求
response = requests.get(url=url, headers=headers)

# 获取页面数据
page_text = response.text

# 数据解析
img_list = re.findall('.*?.*?', page_text, re.S)

# 创建一个存储图片数据的文件夹
if not os.path.exists('./imgs_qiushi'):
    os.mkdir('imgs_qiushi')

for url in img_list:
    # 将图片的url进行拼接，拼接成一个完整的url
    img_url = 'https:' + url
    # 持久化存储：存储的是图片的数据，并不是url
    # 获取图片二进制的数据值
    img_data = requests.get(url=img_url, headers=headers).content
    imgName = url.split('/')[-1]
    imgPath = 'imgs_qiushi/' + imgName
    with open(imgPath, 'wb') as f:
        f.write(img_data)
        print(imgName + "写入完成")

2）xpath

xpath在爬虫中的使用流程：

（1）下载：pip install lxml

（2）导包：from lxml import etree

（3）创建etree对象进行指定数据的解析

本地：etree = etree.parse("本地文件路径")
etree.xpath("xpath表达式")
网络：etree = etree.HTML("网络请求到的页面数据")
etree.xpath("xpath表达式")

常用的xpath表达式：

（1）属性定位：

# 找到class属性值为song的div标签："//div[@class='song']"

（2）层级&索引定位：

# 找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a："//div[@class='tang']/ul/li[2]/a"

（3）逻辑运算：

# 找到href属性值为空且class属性值为du的a标签："//div[@href='' and @class='du']"

（4）模糊匹配：

# "//div[contains(@class, 'ng')]"

# "//div[starts-with(@class, 'ta')]"

（5）取文本：

# /表示获取某个标签下的文本内容

# //表示获取某个标签下的文本内容和所有子标签下的文本内容

# "//div[@class='song']/p[1]/text()"

# "//div[@class='tang']//text()"

（6）取属性：

# "//div[@class='tang']//li[2]/a/@href"

# 需求：使用xpath对段子网中的段子内容和标题进行解析，并持久化存储
import requests
from lxml import etree

# 1. 指定url
url = 'https://ishuo.cn/joke'

# 2. 发起请求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}
response = requests.get(url=url, headers=headers)

# 3. 获取页面内容
page_text = response.text

# 4. 数据解析
tree = etree.HTML(page_text)
# 获取所有的li标签，
li_list = tree.xpath("//div[@id='list']/ul/li")
with open('./duanzi.txt', 'w', encoding='utf-8') as f:
    for li in li_list:
        content = li.xpath("./div[@class='content']/text()")[0]
        title = li.xpath("./div[@class='info']/a/text()")[0]
        # 5. 持久化
        f.write(title + ":" + content + "\n\n")

3）bs4

核心思想：将html文档转换成Beautiful对象，然后调用该对象中的属性和方法进行html稳定指定内容的定位查找。

属性和方法：

（1）根据标签名查找

soup.a # 只能找到第一个符合要求的标签

（2）获取属性

soup.a.attrs # 获取a所有的属性和属性值，返回一个字典

soup.a.attrs['href'] # 获取href属性

soup.a['href'] # 也可以简写为这种形式

（3）获取内容

soup.a.string

soup.a.text

soup.a.get_text()

注意：如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容

（4）find：找到第一个符合要求的标签

soup.find('a') # 找到第一个符合要求的

soup.find('a', title='xxx')

soup.find('a', alt='xxx')

soup.find('a', class='xxx')

soup.find('a', id='xxx')

（5）find_all：找到所有符合要求的标签

soup.find_all('a')

soup.find_all(['a', 'b']) # 找到所有的a和b标签

soup.find_all('a', limit=2) # 限制前两个

（6）根据选择器选择指定的内容

select:soup.select('#feng')

- 常见的选择器：标签选择器（a）、类选择器（.）、id选择器（#）、层级选择器

层级选择器：

div .dudu #lala ,meme .xixi

div > p > a > .lala

注意：select选择器返回永远是列表，需要通过下标提取指定的对象

# 需求：爬取古诗文网中三国小说里的标题和内容
import requests
from bs4 import BeautifulSoup
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
}

# 根据url获取页面内容中指定的标题所对应的文章内容
def get_content(url):
    content_page = requests.get(url=url, headers=headers).text
    soup = BeautifulSoup(content_page, 'lxml')
    div = soup.find('div', class_='chapter_content')
    return div.text
    
page_text = requests.get(url=url, headers=headers).text

soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select('.book-mulu > ul > li > a')

with open('./threekingdoms.txt', 'w', encoding='utf-8') as f:
    for a in a_list:
        title = a.string
        content_url = 'http://www.shicimingju.com' + a['href']
        content = get_content(content_url)
        f.write(title + ":\n" + content + "\n\n\n")
        print(title + "： 已被写入")

9. 处理页面动态加载数据的爬取

1）selenum：三方库。可以实现让浏览器完成自动化的操作

（1）环境搭建

a. 安装： pip install selenium

b. 获取浏览器的驱动程序：

谷歌浏览器驱动下载地址：chromedriver.storage.googleapis.com/index.html

下载的驱动程序必须和浏览器的版本统一，对照表参照：http://blog.csdn.net/huilan_same/article/details/51896672

from selenium import webdriver

# 创建一个浏览器对象executable_path驱动的路径
b = webdriver.Chrome(executable_path='./chromedriver')

#get方法可以指定一个url，让浏览器进行请求
url = 'https://www.baidu.com'
b.get(url)

# 使用下面的方法，查找指定的元素进行操作即可
# find_element_by_id   根据id找节点
# find_elements_by_name   根据name找
# find_elements_by_xpath   根据xpath查找
# find_elements_by_tag_name   根据标签找
# find_elements_by_class_name   根据class名字查找

# 让百度进行指定词条的搜索
text = b.find_element_by_id('kw') # 定位到了text文本框
text.send_keys('人民币') # send_keys表示向文本框中录入指定内容

button = b.find_element_by_id('su')
button.click() # click表示的是点击操作

b.quit() # 关闭浏览器

2）phantomjs

from selenium import webdriver

b = webdriver.PhantomJS(executable_path='phantomjs')

# 打开浏览器
b.get('https://www.baidu.com')

# 截屏
b.save_screenshot('./1.png')

text = b.find_element_by_id('kw')
text.send_keys('人民币')

b.save_screenshot('./2.png')
b.quit()

3）综合实战

from selenium import webdriver

b = webdriver.PhantomJS(executable_path='phantomjs')
url = 'https://movie.douban.com/typerank?type_name=%E5%8A%A8%E4%BD%9C&type=5&interval_id=100:90&action='
b.get(url)

js = 'window.scrollTo(0, document.body.scrollHeight)'
b.execute_script(js)

page_text = b.page_source
print(page_text)

补充：

案例一：爬取汽车之家新闻页面数据

'''
爬取汽车之家新闻页面数据
'''
import requests
from bs4 import BeautifulSoup
import os

# 1. 伪造浏览器发送请求
r1 = requests.get(url="https://www.autohome.com.cn/news/")

r1.encoding = "gbk"

# 2. 去响应的响应体中解析出我们想要的数据
soup = BeautifulSoup(r1.text, "html.parser")

# 3. 按照规则找，div标签且id="auto-channel-lazyload-article"找匹配成功的第一个
container = soup.find(name="div", attrs={"id": "auto-channel-lazyload-article"})

# 4. container中找所有的li标签，返回的是个列表
li_list = container.find_all(name="li")

for tag in li_list:
    title = tag.find(name="h3")
    if not title:
        continue
    summary = tag.find(name='p')
    a_tag = tag.find(name='a')
    url = "https:" + a_tag.attrs.get('href')
    img = tag.find(name="img")
    img_url = "https:" + img.get("src")
    print(title.text)
    print(summary.text)
    print(url)
    print(img_url)

    r2 = requests.get(url=img_url)
    img_name = img_url.rsplit('/', maxsplit=1)[1]
    img_path = os.path.join("imgs", img_name)
    with open(img_path, 'wb') as f:
        f.write(r2.content)

    print("_______________________________________________________________")

案例二：爬取抽屉新热榜首页新闻，并保存封面图片

'''
爬取抽屉热榜新闻
'''
import requests
from bs4 import BeautifulSoup
import os

r1 = requests.get(
    url="https://dig.chouti.com/",
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    }
)

s1 = BeautifulSoup(r1.text, 'html.parser')

container = s1.find(name="div", attrs={'id': 'content-list'})
news_list = container.find_all(name='div', attrs={'class': 'item'})

for news in news_list:
    title = news.find(name='a', attrs={'class': 'show-content'})
    summary = news.find(name='span', attrs={'class': 'summary'})
    print("Title: ", title.text)

    img_div = news.find(name='div', attrs={'class': 'news-pic'})
    img_url = "https:" + img_div.find(name='img').get('original')

    r2 = requests.get(img_url)
    img_name = img_url.rsplit('/', maxsplit=1)[1].split('?')[0]
    img_path = os.path.join('imgs', img_name)
    with open(img_path, 'wb') as f:
        f.write(r2.content)

    if not summary:
        print("----------------------------------------------------------")
        continue
    print("Summary: ",summary.text)
    print("----------------------------------------------------------")

案例三：登录抽屉并点赞

'''
通过代码进行自动登录，然后进行点赞
'''
import requests
import bs4

# 第一部分：登录
r1 = requests.get(
    url="https://dig.chouti.com/",
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    }
)
r1_cookie_dict = r1.cookies.get_dict()

r2 = requests.post(
    url="https://dig.chouti.com/login",
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    },
    data={
        'phone':"999999",
        'password': "XXXXXX",
        'oneMonth': 1,
    },
    cookies = r1_cookie_dict
)
print(r2.text)

# 第二部分：点赞
r3 = requests.post(
    url='https://dig.chouti.com/link/vote?linksId=20843176',
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    },
    cookies=r1_cookie_dict
)
print(r3.text)

案例四：爬取抖音小视频

import requests

user_id = '58841646784'

# 获取所有作品
"""
 signature = _bytedAcrawler.sign('用户ID')
 douyin_falcon:node_modules/byted-acrawler/dist/runtime
"""
import subprocess
signature = subprocess.getoutput('node s1.js %s' %user_id)

user_video_list = []

# ############################# 获取个人作品 ##########################
user_video_params = {
    'user_id': str(user_id),
    'count': '21',
    'max_cursor': '0',
    'aid': '1128',
    '_signature': signature,
    'dytk': 'b4dceed99803a04a1c4395ffc81f3dbc' # '114f1984d1917343ccfb14d94e7ce5f5'
}

def get_aweme_list(max_cursor=None):
    if max_cursor:
        user_video_params['max_cursor'] = str(max_cursor)
    res = requests.get(
        url="https://www.douyin.com/aweme/v1/aweme/post/",
        params=user_video_params,
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
            'x-requested-with':'XMLHttpRequest',
            'referer':'https://www.douyin.com/share/user/58841646784',
        }
    )
    content_json = res.json()
    aweme_list = content_json.get('aweme_list', [])

    user_video_list.extend(aweme_list)
    if content_json.get('has_more') == 1:
        return get_aweme_list(content_json.get('max_cursor'))


get_aweme_list()


# ############################# 获取喜欢作品 ##########################


favor_video_list = []

favor_video_params = {
    'user_id': str(user_id),
    'count': '21',
    'max_cursor': '0',
    'aid': '1128',
    '_signature': signature,
    'dytk': 'b4dceed99803a04a1c4395ffc81f3dbc'
}


def get_favor_list(max_cursor=None):
    if max_cursor:
        favor_video_params['max_cursor'] = str(max_cursor)
    res = requests.get(
        url="https://www.douyin.com/aweme/v1/aweme/favorite/",
        params=favor_video_params,
        headers={
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
            'x-requested-with':'XMLHttpRequest',
            'referer':'https://www.douyin.com/share/user/58841646784',
        }
    )
    content_json = res.json()
    aweme_list = content_json.get('aweme_list', [])
    favor_video_list.extend(aweme_list)
    if content_json.get('has_more') == 1:
        return get_favor_list(content_json.get('max_cursor'))


get_favor_list()


# ############################# 视频下载 ##########################
for item in user_video_list:
    video_id = item['video']['play_addr']['uri']

    video = requests.get(
        url='https://aweme.snssdk.com/aweme/v1/playwm/',
        params={
            'video_id':video_id
        },
        headers={
            'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest',
            'referer': 'https://www.douyin.com/share/user/58841646784',
        },
        stream=True
    )
    file_name = video_id + '.mp4'
    with open(file_name,'wb') as f:
        for line in video.iter_content():
            f.write(line)


for item in favor_video_list:
    video_id = item['video']['play_addr']['uri']

    video = requests.get(
        url='https://aweme.snssdk.com/aweme/v1/playwm/',
        params={
            'video_id':video_id
        },
        headers={
            'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest',
            'referer': 'https://www.douyin.com/share/user/58841646784',
        },
        stream=True
    )
    file_name = video_id + '.mp4'
    with open(file_name, 'wb') as f:
        for line in video.iter_content():
            f.write(line)

其他请求

requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
  
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)

更多参数

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request `.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How long to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) ` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response ` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      
    """

参数示例

def param_method_url():
    # requests.request(method='get', url='http://127.0.0.1:8000/test/')
    # requests.request(method='post', url='http://127.0.0.1:8000/test/')
    pass


def param_param():
    # - 可以是字典
    # - 可以是字符串
    # - 可以是字节（ascii编码以内）

    # requests.request(method='get',
    # url='http://127.0.0.1:8000/test/',
    # params={'k1': 'v1', 'k2': '水电费'})

    # requests.request(method='get',
    # url='http://127.0.0.1:8000/test/',
    # params="k1=v1&k2=水电费&k3=v3&k3=vv3")

    # requests.request(method='get',
    # url='http://127.0.0.1:8000/test/',
    # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8'))

    # 错误
    # requests.request(method='get',
    # url='http://127.0.0.1:8000/test/',
    # params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))
    pass


def param_data():
    # 可以是字典
    # 可以是字符串
    # 可以是字节
    # 可以是文件对象

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data={'k1': 'v1', 'k2': '水电费'})

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data="k1=v1; k2=v2; k3=v3; k3=v4"
    # )

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data="k1=v1;k2=v2;k3=v3;k3=v4",
    # headers={'Content-Type': 'application/x-www-form-urlencoded'}
    # )

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是：k1=v1;k2=v2;k3=v3;k3=v4
    # headers={'Content-Type': 'application/x-www-form-urlencoded'}
    # )
    pass


def param_json():
    # 将json中对应的数据进行序列化成一个字符串，json.dumps(...)
    # 然后发送到服务器端的body中，并且Content-Type是 {'Content-Type': 'application/json'}
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     json={'k1': 'v1', 'k2': '水电费'})


def param_headers():
    # 发送请求头到服务器端
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     json={'k1': 'v1', 'k2': '水电费'},
                     headers={'Content-Type': 'application/x-www-form-urlencoded'}
                     )


def param_cookies():
    # 发送Cookie到服务器端
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     data={'k1': 'v1', 'k2': 'v2'},
                     cookies={'cook1': 'value1'},
                     )
    # 也可以使用CookieJar（字典形式就是在此基础上封装）
    from http.cookiejar import CookieJar
    from http.cookiejar import Cookie

    obj = CookieJar()
    obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
                          discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
                          port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
                   )
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     data={'k1': 'v1', 'k2': 'v2'},
                     cookies=obj)


def param_files():
    # 发送文件
    # file_dict = {
    # 'f1': open('readme', 'rb')
    # }
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # files=file_dict)

    # 发送文件，定制文件名
    # file_dict = {
    # 'f1': ('test.txt', open('readme', 'rb'))
    # }
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # files=file_dict)

    # 发送文件，定制文件名
    # file_dict = {
    # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
    # }
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # files=file_dict)

    # 发送文件，定制文件名
    # file_dict = {
    #     'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
    # }
    # requests.request(method='POST',
    #                  url='http://127.0.0.1:8000/test/',
    #                  files=file_dict)

    pass


def param_auth():
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth

    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)

    # ret = requests.get('http://192.168.1.1',
    # auth=HTTPBasicAuth('admin', 'admin'))
    # ret.encoding = 'gbk'
    # print(ret.text)

    # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
    # print(ret)
    #


def param_timeout():
    # ret = requests.get('http://google.com/', timeout=1)
    # print(ret)

    # ret = requests.get('http://google.com/', timeout=(5, 1))
    # print(ret)
    pass


def param_allow_redirects():
    ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
    print(ret.text)


def param_proxies():
    # proxies = {
    # "http": "61.172.249.96:80",
    # "https": "http://61.185.219.126:3128",
    # }

    # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

    # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
    # print(ret.headers)


    # from requests.auth import HTTPProxyAuth
    #
    # proxyDict = {
    # 'http': '77.75.105.165',
    # 'https': '77.75.105.165'
    # }
    # auth = HTTPProxyAuth('username', 'mypassword')
    #
    # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
    # print(r.text)

    pass


def param_stream():
    ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
    print(ret.content)
    ret.close()

    # from contextlib import closing
    # with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
    # # 在此处理响应。
    # for i in r.iter_content():
    # print(i)


def requests_session():
    import requests

    session = requests.Session()

    ### 1、首先登陆任何页面，获取cookie

    i1 = session.get(url="http://dig.chouti.com/help/service")

    ### 2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权
    i2 = session.post(
        url="http://dig.chouti.com/login",
        data={
            'phone': "8615131255089",
            'password': "xxxxxx",
            'oneMonth': ""
        }
    )

    i3 = session.post(
        url="http://dig.chouti.com/link/vote?linksId=8589623",
    )
    print(i3.text)

bs4模块

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

安装：pip3 install beautifulsoup4

使用示例：

from bs4 import BeautifulSoup
 
html_doc = """
The Dormouse's story

    ...


"""
 
soup = BeautifulSoup(html_doc, features="lxml")

1. name，标签名称

# tag = soup.find('a')
# name = tag.name # 获取
# print(name)
# tag.name = 'span' # 设置
# print(soup)

2. attr，标签属性

# tag = soup.find('a')
# attrs = tag.attrs    # 获取
# print(attrs)
# tag.attrs = {'ik':123} # 设置
# tag.attrs['id'] = 'iiiii' # 设置
# print(soup)

3. children,所有子标签

# body = soup.find('body')
# v = body.children

4. descendants,所有子子孙孙标签

# body = soup.find('body')
# v = body.descendants

5. clear,将标签的所有子标签全部清空（保留标签名）

# tag = soup.find('body')
# tag.clear()
# print(soup)

6. decompose,递归的删除所有的标签

# body = soup.find('body')
# body.decompose()
# print(soup)

7. extract,递归的删除所有的标签，并获取删除的标签

# body = soup.find('body')
# v = body.extract()
# print(soup)

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

10. find,获取匹配的第一个标签

# tag = soup.find('a')
# print(tag)
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

11. find_all,获取匹配的所有标签

# tags = soup.find_all('a')
# print(tags)
 
# tags = soup.find_all('a',limit=1)
# print(tags)
 
# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags)
 
 
# ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v)
 
# v = soup.find_all(class_=['sister0', 'sister'])
# print(v)
 
# v = soup.find_all(text=['Tillie'])
# print(v, type(v[0]))
 
 
# v = soup.find_all(id=['link1','link2'])
# print(v)
 
# v = soup.find_all(href=['link1','link2'])
# print(v)
 
# ####### 正则 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v)
 
# rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v)
 
# rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v)
 
# ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v)
 
 
# ## get,获取标签属性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')
# v = tag.get_text('id')
# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v)
 
# tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

16. 当前的关联标签

# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings
 
#
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings
 
#
# tag.parent
# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)
 
# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)
 
# tag.find_parent(...)
# tag.find_parents(...)
 
# 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select('a[href]')
 
soup.select('a[href="http://example.com/elsie"]')
 
soup.select('a[href^="http://example.com/"]')
 
soup.select('a[href$="tillie"]')
 
soup.select('a[href*=".com/el"]')
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

19. 标签的内容

# tag = soup.find('span')
# print(tag.string)          # 获取
# tag.string = 'new content' # 设置
# print(soup)
 
# tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup)
 
# tag = soup.find('body')
# v = tag.stripped_strings  # 递归内部获取所有标签的文本
# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

21.insert在当前标签内部指定位置插入一个标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

23. replace_with 在当前标签替换为指定标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('div')
# tag.replace_with(obj)
# print(soup)

24. 创建标签之间的关系

# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来

# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一个新来的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup)
 
# tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup)

26. unwrap，去掉当前标签，将保留其包裹的标签

# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

三. Scrapy

安装windows版scrapy

a. pip3 install wheel

b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

c. 进入下载目录，执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl

d. pip3 install pywin32

e. pip3 install scrapy

1. 基本命令

1. scrapy startproject 项目名称
   - 在当前目录中创建中创建一个项目文件（类似于Django）
 
2. scrapy genspider [-t template]  
   - 创建爬虫应用
   如：
      scrapy gensipider -t xmlfeed autohome autohome.com.cn
   PS:
      查看所有命令：scrapy gensipider -l
      查看模板命令：scrapy gensipider -d 模板名称
 
3. scrapy list
   - 展示爬虫应用列表
 
4. scrapy crawl 爬虫应用名称
   - 运行单独爬虫应用

2. 项目结构以及爬虫应用简介

project_name/

scrapy.cfg # 项目的主配置信息（爬虫相关的真正配置信息在settings.py文件中）

project_name/

__init__.py

items.py # 设置数据存储模板，用于结构化数据，类似Django的Model

pipelines.py # 数据持久化处理

settings.py # 配置文件，如：递归的层数、并发数，延迟下载等

spiders/ # 爬虫目录

__init__.py

爬虫1.py

爬虫2.py

爬虫3.py

文件说明：

scrapy.cfg 项目的主配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等
spiders 爬虫目录，如：创建文件，编写爬虫规则
注意：一般创建爬虫文件时，以网站域名命名

3. 简单配置

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

4. 持久化存储操作

1）磁盘文件

（1）基于终端指令

a. 保证parse方法返回一个可迭代类型的对象（存储解析到的页面内容）

b. 使用终端指令完成数据存储到指定磁盘文件中的操作：

scrapy crawl 爬虫文件名称 -o 磁盘文件.后缀

（2）基于管道

a. items.py：存储解析到的页面数据

b. pipelines：处理持久化存储的相关操作

c. 代码实现流程：

a) 将解析到的页面数据存储到items对象

b) 使用yield关键字将items提交给管道文件进行处理

c) 在管道文件中编写代码完成数据存储的操作

d) 在配置文件中开启管道操作

2）数据库

（1）mysql

import pymysql


class QiubaiPipeline(object):
    conn = None
    cursor = None
    def open_spider(self, spider):
        # 1. 连接数据库
        self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='root', db='qiubai')

    def process_item(self, item, spider):
        # 2. 执行sql语句
        sql = 'insert into qiubai values("%s", "%s")' % (item['author'], item['content'])
        self.cursor = self.conn.cursor()
        try:
            # 3. 提交事务
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

# -*- coding: utf-8 -*-
import scrapy
from qiubai.items import QiubaiItem


class QiushibaikeSpider(scrapy.Spider):
    name = 'qiushibaike'
    # allowed_domains = ['www.qiushibaike.com/text']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        # 解析段子的内容合作者
        div_list = response.xpath("//div[@id='content-left']/div")

        for div in div_list:
            author = div.xpath("./div/a[2]/h2/text()").extract()[0]
            content = div.xpath(".//div[@class='content']/span/text()").extract()[0]

            item = QiubaiItem(author=author, content=content)
            yield item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QiubaiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

（2）redis

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import redis


class QiubaiPipeline(object):
    conn = None

    def open_spider(self, spider):
        self.conn = redis.Redis('host=127.0.0.1', port=6379)

    def process_item(self, item, spider):
        dict = {
            'author': item['author'],
            'content': item['content']
        }
        self.conn.lpush('data', dict)
        return item

    def close_spider(self, spider):
        pass

（3）代码实现流程：

a) 将解析到的页面数据存储到items对象

b) 使用yield关键字将items提交给管道文件进行处理

c) 在管道文件中编写代码完成数据存储的操作

d) 在配置文件中开启管道操作

案例一：爬取抽屉热榜新闻，并保存在本地文件当中

chouti.py的代码

# -*- coding: utf-8 -*-
import scrapy
from ..items import XzxItem
from scrapy.http import Request
# import sys, os, io
# sys.stdout=io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        # 1. 解析文本中的内容，将标题和简介提取出来
        item_list = response.xpath("//div[@id='content-list']/div[@class='item']")

        for item in item_list:
            title = item.xpath(".//div[@class='part1']/a/text()").extract_first().strip()
            # item.xpath(".//div[@class='part1']/a[0]/text()")
            # summary = item.xpath(".//div[@class='area-summary']/span/text()").extract_first().strip()
            # print(summary)
            href = item.xpath(".//div[@class='part1']/a/@href").extract_first().strip()
            yield XzxItem(title=title, href=href)

        # 2. 获取页面
        page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract()
        for url in page_list:
            url = "https://dig.chouti.com" + url
            yield Request(url=url, callback=self.parse) # 可以将关掉去重dont_filter=True

item.py的代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class XzxItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    href = scrapy.Field()

settings.py的代码

# -*- coding: utf-8 -*-

# Scrapy settings for xzx project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'xzx'

SPIDER_MODULES = ['xzx.spiders']
NEWSPIDER_MODULE = 'xzx.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'xzx.middlewares.XzxSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'xzx.middlewares.XzxDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'xzx.pipelines.XzxPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

CHOUTI_NEWS_PATH = 'x1.log'

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
'''
1. 先去类中找from_crawler
    有：执行且必须返回一个当前类的对象
    没有：不执行，则去执行构造方法并返回一个对象
2. 再去执行对象其他方法
'''

class XzxPipeline(object):
    def __init__(self, file_path):
        self.f = None
        self.file_path = file_path

    @classmethod
    def from_crawler(cls, crawler):
        file_path = crawler.settings.get("CHOUTI_NEWS_PATH")
        return cls(file_path)

    def open_spider(self, spider):
        '''
        爬虫开始执行时，调用
        :param spider:
        :return:
        '''
        print("爬虫开始了")
        self.f = open(self.file_path, "a+", encoding="utf-8")

    def process_item(self, item, spider):
        '''
        :param item: 爬虫yield过来的item对象，封装：title和href
        :param spider: 爬虫对象
        :return:
        '''
        print(item)
        self.f.write(item['title'] + "\n")
        self.f.flush()
        return item

    def close_spider(self, spider):
        '''
        爬虫关闭时，调用
        :param spider:
        :return:
        '''
        self.f.close()
        print("爬虫结束了")

5. 针对多个url进行数据的爬取

解决方案：手动发送请求

import scrapy
from qiubai.items import QiubaiItem


class QiushibaikeSpider(scrapy.Spider):
    name = 'qiushibaike'
    # allowed_domains = ['www.qiushibaike.com/text']
    start_urls = ['https://www.qiushibaike.com/text/']

    # 设计一个通用的url模板
    url = 'https://www.qiushibaike.com/text/page/%d'
    pageNum = 1

    def parse(self, response):
        # 解析段子的内容合作者
        div_list = response.xpath("//div[@id='content-left']/div")

        for div in div_list:
            author = div.xpath("./div/a[2]/h2/text()").extract()[0]
            content = div.xpath(".//div[@class='content']/span/text()").extract()[0]

            item = QiubaiItem(author=author, content=content)
            yield item

        # 请求的手动发送
        if self.pageNum <= 13:
            self.pageNum += 1
            new_ulr = format(self.url % self.pageNum)
            yield scrapy.Request(url=new_ulr, callback=self.parse)

6. Scrapy核心组件

引擎（Scrapy）：用来处理整个系统的数据流处理，触发事务（框架核心）

管道（Pipeline）：负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据

调度器（Scheduler）：用来接收引擎发过来的请求，压入队列中，并在引擎再次请求的时候返回。可以想象成一个url（抓取网页的网址或者说是链接）的优先队列，由它来决定下一个要抓取的网址是什么，同时去除重复的网址

下载器（Downloader）：用于下载网页内容，并将网页内容返回给蜘蛛（Scrapy下载器是建立在twisted这个高效的异步模型上的）

爬虫（Spiders）：爬虫是主要干活的，用于从特定的网页中提取自己需要的信息，即所谓的实体（item）。用户也可以从中提取出链接，让Scrapy继续抓取下一个页面

7. post请求

如何发送post请求：一定要重写start_request方法

# -*- coding: utf-8 -*-
import scrapy


class PostdemoSpider(scrapy.Spider):
    name = 'postDemo'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://fanyi.baidu.com/sug']

    # 该方法其实是父类中的一个方法：该方法可以对start_urls列表中的元素进行get请求的发送
    # 发起post：
        # 1. 将Request方法中method参数赋值成post
        # 2. FormRequest()可以发起post请求（推荐）
    def start_requests(self):
        data = {
            'kw': 'dog'
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse)

    def parse(self, response):
        print(response.text)

8. Cookie操作

需求：豆瓣网个人登录，获取该用户个人主页这个二级页面的页面数据

# -*- coding: utf-8 -*-
import scrapy


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['www.douban.com']
    start_urls = ['https://www.douban.com/accounts/login']

    # 重写start_requests方法
    def start_requests(self):
        # 将请求参数封装到字典
        data = {
            'source': 'index_nav',
            'form_email': 'xxxx',
            'form_password': 'xxx'
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse)

    def parseSecondPage(self, response):
        f = open('second.html', 'w', encoding='utf-8')
        f.write(response.text)

    def parse(self, response):
        f = open('main.html', 'w', encoding='utf-8')
        f.write(response.text)

        url = 'https://www.douban.com/people/xxx/'
        yield scrapy.Request(url=url, callback=self.parseSecondPage)

9. 代理操作

流程：

1. 下载中间件类的自定制

1）object

2）重写process_reqeust(self, request, spider)的方法

2. 配置文件中进行下载中间件的开启

from scrapy import signals

# 自定义一个下载中间件的类，在类中实现process_request（处理中间件拦截到的请求）方法
class MyProxy(object):
    def process_request(self, request, spider):
        # 请求ip的更换
        request.meta['proxy'] = 'http://120.76.77.152:9999'

10. 日志等级

ERROR：错误

WARNING：警告

INFO：一般信息

DEBUG：调试信息（默认）

在settings.py文件中设置日志等级：LOG_LEVEL = 'ERROR'

将日志信息存储到指定文件中，而非显示在终端里，在settings.py文件中：LOG_FILE = 'log.txt'

11. 请求传参

# -*- coding: utf-8 -*-
import scrapy
from moviePro.moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['www.id97.com']
    start_urls = ['http://www.id97.com/movie']

    # 专门用于解析二级子页面中的数据值
    def parseBySecondPage(self, response):
        actor = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
        language = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[6]/td[2]/text()').extract_first()
        lastTime = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[8]/td[2]/text()').extract_first()

        # 取出Request方法的meta参数传递过来的字典（response.meta）
        item = response.meta['item']
        item['actor'] = actor
        item['language'] = language
        item['lastTime'] = lastTime

        # 将item提交给管道
        yield item

    def parse(self, response):
        # 名称、类型、导演、语言、片场
        div_list = response.xpath('/html/body/div[1]/div[1]/div[2]/div')
        for div in div_list:
            name = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first()
            # 如下xpath表达式返回的是一个列表，
            type = div.xpath('.//div[@class="otherinfo"]//text()').extract()
            # 将type列表转化成字符串
            type = "".join(type)
            url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()

            # 创建items对象
            item = MovieproItem()
            item['name'] = name
            item['type'] = type
            # 需要对url发起请求，获取页面数据，进行指定数据解析
            # meta参数只可以赋值一个字典（将item对象先封装到字典中）
            yield scrapy.Request(url=url, callback=self.parseBySecondPage, meta={'item': item})

12. CrawlSpider

CrawlSpider概念：CrawlSpider其实就是Spider的一个子类。CrawlSpider功能更加强大，因为链接提取器和规则解析器

代码：

1）创建一个基于CrawlSpider的爬虫文件

scrapy genspider -t crawl xxx xxx.com

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    # 实例化了一个链接提取器对象
    # 链接提取器：用来提取指定的链接（url）
    # allow参数：赋值一个正则表达式
    # 链接提取器就可以根据正则表达式在页面中提取指定的链接
    # 提取到的链接会全部交给规则解析器
    link = LinkExtractor(allow=r'/all/hot/recent/\d+')
    rules = (
        # 实例化了一个规则解析器对象
        # 规则解析器接受了链接提取器发送的链接后，就会对这些链接发起请求，获取链接对应的页面内容，就会根据指定的规则对页面进行解析
        # callback：指定一个解析规则（方法/函数）
        # follow：是否将链接提取器继续作用到链接提取器提取出的链接所表示的页面数据中
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        print(response.text)
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

13. 分布式爬虫

1）概念：多台机器上可以执行同一个爬虫程序，实现网站数据的分布爬取。

2）原生的scrapy是不可以实现分布式爬虫的

原因：a. 调度器无法共享

b. 管道无法共享

3）scrapy-redis组件：专门为scrapy开发的一套组件。该组件可以让scrapy实现分布式

下载：pip install scrapy-redis

4）分布式爬取的流程：

a. redis配置文件的配置：

i.bind 127.0.0.1 进行注释

ii. protected-mode no 关闭保护模式

b. redis服务器的开启：基于配置文件

c. 创建scrapy工程后，创建基于crawlSpider的爬虫文件

d. 导入RedisCrawlSpider类，然后将爬虫文件修改成基于该类的源文件

e. 将start_url修改成redis_key

f. 将项目的管道和调度器配置成基于scrapy_redis组件中的

--使用scrapy-redis组件总封装好的调度器，将所有的url存储到指定的调度器中，从而实现了多台机器的调度器共享。

# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

--使用scrapy-redis组建中封装好的管道，将每台机器爬取到的数据存储通过改管道存储到redis数据库中，从而能实现了多台机器的管道共享

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

g. 配置redis服务器的ip地址和端口号

# 如果redis服务器不在自己本机，则需要如下配置：
# REDIS_HOST = 'redis服务的ip地址'
# REDIS_PORT = 6379

h. 执行爬虫文件：

scrapy runspider xxx.py

i. 将起始url放置到调度器的队列中：redis-cli: lpush 队列名称（redis-key）起始url

代码：

爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redisPro.redisPro.items import RedisproItem
from scrapy_redis.spiders import RedisCrawlSpider

class QiubaiSpider(RedisCrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['www.qiushibaike.com/pic']
    # start_urls = ['http://www.qiushibaike.com/pic/']
    # 调度器队列的名称

    redis_key = 'qiubaispider' # 表示跟start_urls含义是一样
    link = LinkExtractor(allow=r'/pic/page/\d+')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')

        for div in div_list:
            img_url = "https:" + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first()
            item = RedisproItem()
            item['img_url'] = img_url

            yield item

settings.py文件

# -*- coding: utf-8 -*-

# Scrapy settings for redisPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'redisPro'

SPIDER_MODULES = ['redisPro.spiders']
NEWSPIDER_MODULE = 'redisPro.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'redisPro (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'redisPro.middlewares.RedisproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'redisPro.middlewares.RedisproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    #'redisPro.pipelines.RedisproPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

# 如果redis服务器不在自己本机，则需要如下配置：
# REDIS_HOST = 'redis服务的ip地址'
# REDIS_PORT = 6379

items.py文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class RedisproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_url = scrapy.Field()

补充：

1. 登录抽屉并点赞

# -*- coding: utf-8 -*-
import scrapy
from ..items import XzxItem
from scrapy.http import Request
from scrapy.http.cookies import CookieJar
# import sys, os, io
# sys.stdout=io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/r/ask/hot/1']
    cookie_dict = {}
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        # 1. 去第一次访问页面中获取cookie
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)

        for k, v in cookie_jar._cookies.items():
            for i,j in v.items():
                for m,n in j.items():
                    self.cookie_dict[m] = n.value
        yield Request(
            url='https://dig.chouti.com/login',
            method='POST',
            body='phone=00000000&password=xxxxxxx&oneMonth=1',
            cookies=self.cookie_dict,
            headers={
                'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
                'content-type':'application/x-www-form-urlencoded; charset=UTF-8',
            },
            callback=self.check_login
        )

    def check_login(self, response):
        yield Request(url='https://dig.chouti.com/', callback=self.index)

    def index(self, response):
        news_id_list = response.xpath('//div[@id="content-list"]//div[@class="part2"]/@share-linkid').extract()

        for news_id in news_id_list:
            news_url = "https://dig.chouti.com/link/vote?linksId=%s" % (news_id,)
            yield Request(
                url=news_url,
                method="POST",
                cookies=self.cookie_dict,
                callback=self.output,
            )

        page_list = response.xpath('//*[@id="dig_lcpage"]//a/@href').extract()
        for url in page_list:
            url = "https://dig.chouti.com" + url
            yield Request(url=url, callback=self.index)

    def output(self, response):
        print(response.text)

    # def parse(self, response):
    #     # 1. 解析文本中的内容，将标题和简介提取出来
    #     item_list = response.xpath("//div[@id='content-list']/div[@class='item']")
    #
    #     for item in item_list:
    #         title = item.xpath(".//div[@class='part1']/a/text()").extract_first().strip()
    #         # item.xpath(".//div[@class='part1']/a[0]/text()")
    #         # summary = item.xpath(".//div[@class='area-summary']/span/text()").extract_first().strip()
    #         # print(summary)
    #         href = item.xpath(".//div[@class='part1']/a/@href").extract_first().strip()
    #         yield XzxItem(title=title, href=href)
    #
    #     # 2. 获取页面
    #     page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract()
    #     for url in page_list:
    #         url = "https://dig.chouti.com" + url
    #         yield Request(url=url, callback=self.parse) # 可以将关掉去重dont_filter=True

2. xpath语法

# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs)

3. 重写去重规则

from scrapy.dupefilter import BaseDupeFilter
import redis
from scrapy.utils.request import request_fingerprint

class XzxDupefilter(BaseDupeFilter):

    def __init__(self,key):
        self.conn = None
        self.key = key

    @classmethod
    def from_settings(cls, settings):
        key = settings.get('DUP_REDIS_KEY')
        return cls(key)

    def open(self):
        self.conn = redis.Redis(host='127.0.0.1',port=6379)

    def request_seen(self, request):
        fp = request_fingerprint(request)
        added = self.conn.sadd(self.key, fp)
        return added == 0

你可能感兴趣的:(玩转python爬虫，从简易到复杂)

植物神经功能紊乱患者应补充哪些维生素养生小齐健康医疗生活
植物神经功能紊乱，也称为躯体形式自主神经功能失调，是一种常见的神经系统功能紊乱疾病。其症状涉及多个系统，表现多样且复杂，包括心脏系统症状、消化系统症状、神经系统症状等。对于植物神经功能紊乱的患者来说，合理补充维生素对于缓解症状、促进恢复具有重要作用。那么，患者应该补充哪些维生素呢？维生素B1（硫胺素）维生素B1能够促进神经递质的合成和释放，缓解神经系统功能障碍。对于植物神经功能紊乱患者而言，口服维
【产品经理修炼之道】-企业级SaaS架构实战（2）：架构视图与视角 xiaoli8748_软件开发产品经理产品经理架构
企业级的SaaS系统涉及的利益干系人众多，那么该如何将其做好分类，便于快速获取相应信息？本文将介绍SaaS架构视图与视角，便于你了解相关内容，更好展开工作。把企业级SaaS系统架构描述清楚，是一项极其复杂的工作。因为企业级SaaS系统涉及的利益干系人众多，例如：客户、产品经理、研发、销售、运营、管理层等等。由于背景不同，认知不同，每个人看待它的角度、方法都各不相同。为了控制复杂度，我们需要设计一整
双指针-接雨水 Vacant Seat java 数据结构算法
接雨水给定n个非负整数表示每个宽度为1的柱子的高度图，计算按此排列的柱子，下雨之后能接多少雨水。输入：整型数组输出：整型变量思路：一层一层的计算水量，会超出时间限制按列求，分为三种情况，当前列与左右两边最大的列的较小值进行比较，只有当前列小于较小值，当前列才会接到水.也会超出时间限制动态规划，不需要每次都求出左边和右边的最大值，可以将最大值存储到两个数组之中，就可以解决时间复杂度的问题双指针，在第
Windows 字体导入到 Docker 指定容器程序员老王wd docker 容器运维
以下是将Windows字体导入到Docker指定容器的详细操作步骤：1.准备工作确认字体文件：在Windows系统中，字体文件通常位于C:\Windows\Fonts目录下。你可以选择需要导入的字体文件，常见的字体文件格式有.ttf（TrueType字体）和.otf（OpenType字体）。确保Docker环境正常：要保证Docker已经正确安装并且正在运行，同时你知道目标容器的名称或ID。2.复
validation 实现参数校验程序员老王wd java
简述在Java中，参数校验是非常重要的，因为它可以确保方法或函数在执行时接收到的参数是合法的，有效的，从而提高代码的健壮性和安全性。参数校验可以防止无效的输入导致的异常或错误，同时也能减少因为无效参数导致的安全漏洞Java中的参数校验可以通过手动编写校验逻辑来实现，但这样会增加代码的复杂度和重复性。为了简化参数校验的实现，可以使用ValidationAPI，它是JavaEE平台的一部分，提供了一套
Maven 私服配置 zoobuzas linux maven
Maven私服配置上传将项目发布到私服，修改settings.xml文件，配置连接私服的用户和密码。此用户名和密码用于私服校验，因为私服需要知道上传的账号和密码是否和私服中的账号和密码一致。releases连接发布版本项目仓库snapshots连接测试版本项目仓库releasesadminadmin123snapshotsadminadmin123thirdpartyadminadmin123第二
rpm:使用实例 mzhan017 云平台运维 rpm
文章目录rpm源测试命令查看releasenote帮助redhat提供的源目录查看编译选项查看软件运行的配置文件建议查看当前软件的帮助文档都有哪些安装错误12安装32bitrpm安装老版本强制安装设置安装目录查询文件查看依赖关系安装rpm到特定目录安装rpm到特定目录2--root校验rpm的文件是否正确rpm2cpio查询rpmname查询文件权限错误错误Unabletochangerootdi
List把特定元素排在第一位 inner_peace8 Java 总结集合 List把特定元素排在第一位
人工智能，零基础入门！http://www.captainbed.net/inner有的时候会有这样的需求，就是从数据库查出来的list要做一些处理，比如部门成员列表，产品要求你把部门经理排在第一位，这个时候就可以用集合工具类下的自带方法，做法：需要遍历集合，找到这个元素在集合中的位置，然后使用Collections.swap(list,o,i)（O：为元素目前所在位置，i：为要放置的位置）方法来
QtCreator 模块/视图编程( 一）模型类，自定义模型QStringListModel,QStandardItemModel,QFileSystemModel,QSqlQueryModel, psujtfc Qt QtCreator QtCreator 模块视图模型类自定义模型
1模型/视图架构1.1模型所有的模型都是基于QAbstractItemModel类，这个类定义了一个接口，可以供视图和委托来访问数据。Qt提供的现成模型：QStringListModel:用来存储一个简单的QString项目列表QStandardItemModel:管理复杂的树型结构数据项，每一个数据项可以包含任意数据QFileSystemModel:提供了本地文件系统中文件和目录的信息QSqlQ
Elasticsearch【复合搜索、结果排序、分页查询、高亮查询、SQL查询】(四)-全面详解（学习总结---从入门到深化）童小纯中间件大全---全面详解 jenkins 运维 elasticsearch 搜索引擎
目录Elasticsearch搜索文档_复合搜索Elasticsearch搜索文档_结果排序Elasticsearch搜索文档_分页查询Elasticsearch搜索文档_高亮查询Elasticsearch搜索文档_SQL查询原生JAVA操作ES_搭建项目原生JAVA操作ES_索引操作Elasticsearch搜索文档_复合搜索GET/索引/_search{"query":{"bool":{//必
Jetson AGX Orin平台Jetpack6.x版本相机驱动移植问题记录 free-xx Nvidia Jetson平台相机开发驱动开发 jetson orin nvidia
1.问题描述正在将相机驱动程序从R35.4迁移到R36.3；驱动程序在R35.4上工作正常，但在R36.3上不工作GStreamerwithnvarguscamerasrc采集异常但是v4l2-ctl采集正常2.常规问题排查2.1查询图像格式$v4l2-ctl-d/dev/video0--list-formats-extioctl:VIDIOC_ENUM_FMTType:VideoCapture[
探索Node.js的原生插件新境界：使用`node-gyp` 洪新龙
探索Node.js的原生插件新境界：使用node-gypnode-gypNode.jsnativeaddonbuildtool项目地址:https://gitcode.com/gh_mirrors/no/node-gyp在Node.js的生态系统中，将高性能的C/C++代码融入到JavaScript世界已经成为提升应用性能的关键手段。今天，我们要向大家隆重推荐一个不可或缺的工具——node-gyp
android-为手机设置全局代理 carden_coder agent android android
有这么一个需求，需要给手机设置全局代理。百度到的结果都是设置后，如果需要清除代理的话，需要重启手机，这里使用的方式是不需要重启的方式后实现方式如下：设置代理:String[]strings={"settingsputglobalhttp_proxy"+ip+":"+port};ShellUtils.CommandResultcommandResult2=ShellUtils.execCmd(str
鸿蒙开发全局UI方法：【时间滑动选择器弹窗】鸿蒙系统小能手Mr.Li 鸿蒙开发 ui harmonyos 华为 OpenHarmony 鸿蒙鸿蒙系统 arkui
时间滑动选择器弹窗以24小时的时间区间创建时间滑动选择器，展示在弹窗上。说明：该组件从APIVersion8开始支持。后续版本如有新增内容，则采用上角标单独标记该内容的起始版本。本模块功能依赖UI的执行上下文，不可在UI上下文不明确的地方使用，参见[UIContext]说明。从APIversion10开始，可以通过使用[UIContext]中的[showTimePickerDialog]来明确UI
uniapp 封装进度条组件 sunyin.liu uniapp iview uni-app
一、素材准备（5张图片，第一张高亮图、第二张灰色、第二张高亮、第三张灰色、第四张高亮）免费给大家，本人自己设计的。链接：https://pan.baidu.com/s/18sz5ORSYuPbf_4fpureRMQ提取码：bpyj如果链接失效了，请在底下评论，看到会马上更新。二、写啥废话，直接代码上封装的组件代码1.先引入uview-ui,uniapp不懂引入uview-u组件的自行移步到官网。很
In function `main': testpcre.c:(.text+0x93): undefined reference to `pcre_compile' testpcre.c:(.tex 周杰伦今天喝奶茶了吗 Error Unix
从昨晚困扰我到现在的问题，终于解决了~~~先贴源程序testpcre.c#include#include#includeintmain(intargc,char**argv){if(argc!=3){printf("Usage:%spatterntext\n",argv[0]);return1;}constchar*pPattern=argv[1];constchar*pText=argv[2];
Spring Boot框架中的IO 阿乾之铭 Spring Boot IO spring boot log4j java 1024程序员节
1.文件资源的访问与管理在SpringBoot中，资源文件的访问与管理是常见的操作需求，比如加载配置文件、读取静态文件或从外部文件系统读取文件。Spring提供了多种方式来处理资源文件访问，包括通过ResourceLoader、@Value注解以及ApplicationContext获取资源。下面我们详细介绍这几种常见的文件资源访问与管理方法。1.1使用ResourceLoader加载资源Reso
AIGC从入门到实战：揭秘 Midjourney 的提示词写作技巧 AI架构设计之禅 DeepSeek R1 &大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
AIGC从入门到实战：揭秘Midjourney的提示词写作技巧作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着人工智能技术的飞速发展，人工智能生成内容（AIGC）逐渐成为可能。AIGC指的是利用人工智能技术自动生成文本、图像、音乐等内容。在AIGC领域，Midjourney是一个备受关注的技术，它能够根据用户输入的提示
VSCode实用快捷键羽墨灵丘 vscode 编辑器
文件新建文件：Ctrl+N关闭文件：Ctrl+W查找查找文件名：Ctrl+P当前文件查找：Ctrl+F全局文件查找：Ctrl+Shift+F快速定位某一行：Ctrl+G光标光标跳转到行首/行尾：Home/End光标跳转到文件开始/文件末尾：Ctrl+Home/Ctrl+End多行光标：Ctrl+Alt+上下方向键导航到下一个错误行：F8选中选中当前行：Ctrl+L选中多行：Ctrl+Shift+上
如何让ChatGPT生成Midjourney提示词 AI观星台人工智能 stable diffusion midjourney
关注文章下方公众号，即可免费获取AIGC最新学习资料导读：最近AI绘画非常的火，今天我们看ChatGPT如何生成Midjourney提示词，让AI教AI做事。本文字数：900，阅读时长大约：3分钟正如Midjourney的官方网站报道的那样，提供工作提示（Prompt）是一项碰运气的业务。从单个表情符号或单词都可以生成图像，但自然地结果可能并不完全符合用户的预期。一般来说，提示越长、越详细，结果就
基于Java的自助多张图片合成拼接实战夜郎king java Java多图片合成 Java图片合成实战
目录前言一、图片合成需求描述二、图片合成设计与实现1、编程语言2、基础数据准备3、图片合成流程4、图片合成实现三、总结前言在当今数字化时代，图像处理技术在各个领域都发挥着至关重要的作用。从社交媒体到电子商务，从在线教育到虚拟现实，图像的展示和处理方式直接影响着用户体验和信息传递的效率。而图片合成拼接技术作为图像处理中的一个重要分支，其应用范围广泛，需求也日益增长。在实际开发中，图片合成拼接的需求多
DeepSeek对AI领域的变革性影响分析报告芝士AI吃鱼人工智能 DeepSeek OpenAI
一、引言近年来，人工智能（AI）技术加速演进，而中国开源大模型DeepSeek的崛起，标志着全球AI竞争进入新阶段。其凭借低成本、高性能、开源生态三大核心优势，迅速成为行业焦点。本报告从技术、产业、投资、就业及未来趋势等维度，全面解析DeepSeek对AI领域的深远影响，为集团战略布局提供参考。二、技术突破：算法效率与成本革命架构创新：MOE与MLA技术优化DeepSeek采用混合专家系统（MoE
DeepSeek对AI发展的范式革新与推动：研究报告芝士AI吃鱼 DeepSeek AI OpenAI LLM
DeepSeek对AI发展的范式革新与推动：研究报告一、技术范式的突破：从“算力堆砌”到“极致工程化”DeepSeek的成功标志着AI发展从依赖大规模算力投入向算法优化与工程效率的转变。其核心技术突破包括：低算力消耗的模型训练通过蒸馏训练策略、动态模型剪枝和稀疏训练，DeepSeek将训练成本降至OpenAI同类模型的1/10，同时保持性能可比甚至超越。例如，其训练成本仅558万美元，而OpenA
【NTN 卫星通信】聊聊低轨卫星通信一只好奇的猫2 NTN卫星通信 5G
NTN，非对地网络一般指卫星通信系统，最近两年开始比较热门的一个通讯系统。包括国外和国内都在推进研发中。国外的包括马斯克的StarLink、美国另外一个公司的OneWeb等都在推动低轨卫星网络的发展。国内最近2年也在研发中，一些大型的卫星通信研究所，大型的通信设备供应商等都在加紧研发，估计2-3年后会开始商用。NTN的几个大的形态1、从卫星轨道的高度看，可以分为高轨卫星，地球同步卫星和低轨卫星，高
geojson 导入mysql_导入GeoJSON数据到SQL Server数据库中冠位咕哒子 geojson 导入mysql
导入GeoJSON数据到SQLServer数据库中GeoJSON是GIS行业里一种常见的数据交换格式，能够存储结构化的空间地理信息。因为SQLServer从2008版开始提供了空间数据类型geometry与geography的支持，所以我也试着将项目中用到的地图数据转换到数据库中，方便之后的调用。因为中途遇到了不少坑，所以写了这篇文章作为备忘。事前准备：了解GeoJSON与SQLServer的空间
LeetCode-142. 环形链表 II 德先生&赛先生 leetcode 算法 c++
1、题目描述：给定一个链表的头节点head，返回链表开始入环的第一个节点。如果链表无环，则返回null。如果链表中有某个节点，可以通过连续跟踪next指针再次到达，则链表中存在环。为了表示给定链表中的环，评测系统内部使用整数pos来表示链表尾连接到链表中的位置（索引从0开始）。如果pos是-1，则在该链表中没有环。注意：pos不作为参数进行传递，仅仅是为了标识链表的实际情况。不允许修改链表。示例1
使用 Shiro 和 JPA 结合 MySQL 实现一个简易权限管理系统 Java猿_ mysql 数据库
1.项目设置首先，确保你的项目已经配置好Maven或Gradle依赖管理工具，并添加以下依赖：Maven依赖org.apache.shiroshiro-core1.9.0org.apache.shiroshiro-web1.9.0org.springframework.bootspring-boot-starter-data-jpamysqlmysql-connector-java8.0.26or
python股票分析系统部署操作过程及代码实现大懒猫软件 python 开发语言 flask plotly api restful
部署一个股票分析系统涉及多个步骤，包括后端服务、前端界面和实时数据更新。以下是一个详细的部署过程，涵盖从代码编写到服务器部署的完整步骤。1.系统架构概述后端：使用Flask提供RESTfulAPI和数据处理服务。前端：使用PlotlyDash构建动态界面，实时显示股票价格走势。数据源：从金融数据API（如AlphaVantage、YahooFinance）获取实时数据。2.系统开发步骤2.1安装必
运用python制作一个完整的股票分析系统大懒猫软件 python 开发语言 django beautifulsoup
使用python制作一个股票分析系统，可以通过股票价格走势动态界面，实时动态监测不同类型股票的变化情况。以下是一个完整的股票分析系统开发指南，包括股票价格走势动态界面和实时监测功能。这个系统将结合网络爬虫、数据分析、机器学习和可视化技术，帮助你实时监测不同类型股票的变化情况。1.系统功能概述数据采集：使用网络爬虫技术从财经网站采集股票数据。数据处理：计算技术指标（如KDJ、BOLL）并进行数据预处
使用 Python 爬虫和 FFmpeg 爬取 B 站高清视频大懒猫软件 python 爬虫 ffmpeg
以下是一个完整的Python爬虫代码示例，用于爬取B站视频并使用FFmpeg合成高清视频。1.准备工作确保安装了以下Python库和工具：bash复制pipinstallrequestsmoviepy2.爬取视频和音频文件B站的视频和音频文件通常是分开存储的，需要分别下载视频和音频文件，然后使用FFmpeg合成。Python复制importrequestsfrommoviepy.editorimp
jQuery 键盘事件keydown ,keypress ,keyup介绍 107x js jquery keydown keypress keyup
本文章总结了下些关于jQuery 键盘事件keydown ,keypress ,keyup介绍，有需要了解的朋友可参考。一、首先需要知道的是： 1、keydown() keydown事件会在键盘按下时触发. 2、keyup() 代码如下复制代码 $('input').keyup(funciton(){
AngularJS中的Promise bijian1013 JavaScript AngularJS Promise
一.Promise Promise是一个接口，它用来处理的对象具有这样的特点：在未来某一时刻（主要是异步调用）会从服务端返回或者被填充属性。其核心是，promise是一个带有then()函数的对象。为了展示它的优点，下面来看一个例子，其中需要获取用户当前的配置文件： var cu
c++ 用数组实现栈类 CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T, int SIZE = 50> class Stack{ private: T list[SIZE];//数组存放栈的元素 int top;//栈顶位置 public: Stack(
java和c语言的雷同麦田的设计者 java 递归 scaner
软件启动时的初始化代码，加载用户信息2015年5月27号从头学java二 1、语言的三种基本结构：顺序、选择、循环。废话不多说，需要指出一下几点： a、return语句的功能除了作为函数返回值以外，还起到结束本函数的功能，return后的语句不会再继续执行。 b、for循环相比于whi
LINUX环境并发服务器的三种实现模型被触发 linux
服务器设计技术有很多，按使用的协议来分有TCP服务器和UDP服务器。按处理方式来分有循环服务器和并发服务器。 1 循环服务器与并发服务器模型在网络程序里面，一般来说都是许多客户对应一个服务器，为了处理客户的请求，对服务端的程序就提出了特殊的要求。目前最常用的服务器模型有： ·循环服务器：服务器在同一时刻只能响应一个客户端的请求 ·并发服务器：服
Oracle数据库查询指令肆无忌惮_ oracle数据库
20140920 单表查询 -- 查询************************************************************************************************************ -- 使用scott用户登录 -- 查看emp表 desc emp
ext右下角浮动窗口知了ing JavaScript ext
第一种 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/
浅谈REDIS数据库的键值设计矮蛋蛋 redis
http://www.cnblogs.com/aidandan/ 原文地址：http://www.hoterran.info/redis_kv_design 丰富的数据结构使得redis的设计非常的有趣。不像关系型数据库那样，DEV和DBA需要深度沟通，review每行sql语句，也不像memcached那样，不需要DBA的参与。redis的DBA需要熟悉数据结构，并能了解使用场景。
maven编译可执行jar包 alleni123 maven
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven <build> <plugins> <plugin> <artifactId>maven-asse
人力资源在现代企业中的作用百合不是茶 HR 企业管理
//人力资源在在企业中的作用人力资源为什么会存在，人力资源究竟是干什么的人力资源管理是对管理模式一次大的创新，人力资源兴起的原因有以下点：工业时代的国际化竞争，现代市场的风险管控等等。所以人力资源在现代经济竞争中的优势明显的存在，人力资源在集团类公司中存在着明显的优势(鸿海集团)，有一次笔者亲自去体验过红海集团的招聘，只知道人力资源是管理企业招聘的当时我被招聘上了，当时给我们培训的人
Linux自启动设置详解 bijian1013 linux
linux有自己一套完整的启动体系，抓住了linux启动的脉络，linux的启动过程将不再神秘。阅读之前建议先看一下附图。本文中假设inittab中设置的init tree为： /etc/rc.d/rc0.d /etc/rc.d/rc1.d /etc/rc.d/rc2.d /etc/rc.d/rc3.d /etc/rc.d/rc4.d /etc/rc.d/rc5.d /etc
Spring Aop Schema实现 bijian1013 java spring AOP
本例使用的是Spring2.5 1.Aop配置文件spring-aop.xml <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmln
【Gson七】Gson预定义类型适配器 bit1129 gson
Gson提供了丰富的预定义类型适配器，在对象和JSON串之间进行序列化和反序列化时，指定对象和字符串之间的转换方式， DateTypeAdapter public final class DateTypeAdapter extends TypeAdapter<Date> { public static final TypeAdapterFacto
【Spark八十八】Spark Streaming累加器操作（updateStateByKey) bit1129 update
在实时计算的实际应用中，有时除了需要关心一个时间间隔内的数据，有时还可能会对整个实时计算的所有时间间隔内产生的相关数据进行统计。比如：对Nginx的access.log实时监控请求404时，有时除了需要统计某个时间间隔内出现的次数，有时还需要统计一整天出现了多少次404，也就是说404监控横跨多个时间间隔。 Spark Streaming的解决方案是累加器，工作原理是，定义
linux系统下通过shell脚本快速找到哪个进程在写文件 ronin47
一个文件正在被进程写我想查看这个进程文件一直在增大找不到谁在写使用lsof也没找到这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。 linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。幸运的是systemtap的安装包里带了inodewatch.stp，位
java-两种方法求第一个最长的可重复子串 bylijinnan java 算法
import java.util.Arrays; import java.util.Collections; import java.util.List; public class MaxPrefix { public static void main(String[] args) { String str="abbdabcdabcx";
Netty源码学习-ServerBootstrap启动及事件处理过程 bylijinnan java netty
Netty是采用了Reactor模式的多线程版本，建议先看下面这篇文章了解一下Reactor模式： http://bylijinnan.iteye.com/blog/1992325 Netty的启动及事件处理的流程，基本上是按照上面这篇文章来走的文章里面提到的操作，每一步都能在Netty里面找到对应的代码其中Reactor里面的Acceptor就对应Netty的ServerBo
servelt filter listener 的生命周期 cngolon filter listener servelt 生命周期
1. servlet 当第一次请求一个servlet资源时，servlet容器创建这个servlet实例，并调用他的 init(ServletConfig config)做一些初始化的工作，然后调用它的service方法处理请求。当第二次请求这个servlet资源时，servlet容器就不在创建实例，而是直接调用它的service方法处理请求，也就是说
jmpopups获取input元素值 ctrain JavaScript
jmpopups 获取弹出层form表单首先，我有一个div，里面包含了一个表单，默认是隐藏的，使用jmpopups时，会弹出这个隐藏的div，其实jmpopups是将我们的代码生成一份拷贝。当我直接获取这个form表单中的文本框时，使用方法：$('#form input[name=test1]').val()；这样是获取不到的。我们必须到jmpopups生成的代码中去查找这个值，$(
vi查找替换命令详解 daizj linux 正则表达式替换查找 vim
一、查找查找命令 /pattern<Enter> ：向下查找pattern匹配字符串 ?pattern<Enter>：向上查找pattern匹配字符串使用了查找命令之后，使用如下两个键快速查找： n：按照同一方向继续查找 N：按照反方向查找字符串匹配 pattern是需要匹配的字符串，例如： 1: /abc<En
对网站中的js,css文件进行打包 dcj3sjt126com PHP 打包
一，为什么要用smarty进行打包 apache中也有给js,css这样的静态文件进行打包压缩的模块，但是本文所说的不是以这种方式进行的打包，而是和smarty结合的方式来把网站中的js,css文件进行打包。为什么要进行打包呢，主要目的是为了合理的管理自己的代码。现在有好多网站，你查看一下网站的源码的话，你会发现网站的头部有大量的JS文件和CSS文件，网站的尾部也有可能有大量的J
php Yii: 出现undefined offset 或者 undefined index解决方案 dcj3sjt126com undefined
在开发Yii 时，在程序中定义了如下方式： if($this->menuoption[2] === 'test')，那么在运行程序时会报：undefined offset:2，这样的错误主要是由于php.ini 里的错误等级太高了，在windows下错误等级
linux 文件格式（1） sed工具 eksliang linux linux sed工具 sed工具 linux sed详解
转载请出自出处： http://eksliang.iteye.com/blog/2106082 简介 sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为“模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾
Android应用程序获取系统权限 gqdy365 android
引用如何使Android应用程序获取系统权限第一个方法简单点，不过需要在Android系统源码的环境下用make来编译： 1. 在应用程序的AndroidManifest.xml中的manifest节点
HoverTree开发日志之验证码 hvt .net C#asp.net hovertree webform
HoverTree是一个ASP.NET的开源CMS，目前包含文章系统，图库和留言板功能。代码完全开放，文章内容页生成了静态的HTM页面，留言板提供留言审核功能，文章可以发布HTML源代码，图片上传同时生成高品质缩略图。推出之后得到许多网友的支持，再此表示感谢！留言板不断收到许多有益留言，但同时也有不少广告，因此决定在提交留言页面增加验证码功能。ASP.NET验证码在网上找，如果不是很多，就是特别多
JSON API：用 JSON 构建 API 的标准指南中文版 justjavac json
译文地址：https://github.com/justjavac/json-api-zh_CN 如果你和你的团队曾经争论过使用什么方式构建合理 JSON 响应格式，那么 JSON API 就是你的 anti-bikeshedding 武器。通过遵循共同的约定，可以提高开发效率，利用更普遍的工具，可以是你更加专注于开发重点：你的程序。基于 JSON API 的客户端还能够充分利用缓存，
数据结构随记_2 lx.asymmetric 数据结构笔记
第三章栈与队列一．简答题 1. 在一个循环队列中，队首指针指向队首元素的前一个位置。 2.在具有n个单元的循环队列中，队满时共有 n-1 个元素。 3. 向栈中压入元素的操作是先移动栈顶指针&n
Linux下的监控工具dstat 网络接口 linux
1) 工具说明dstat是一个用来替换 vmstat,iostat netstat,nfsstat和ifstat这些命令的工具, 是一个全能系统信息统计工具. 与sysstat相比, dstat拥有一个彩色的界面, 在手动观察性能状况时, 数据比较显眼容易观察; 而且dstat支持即时刷新, 譬如输入dstat 3, 即每三秒收集一次, 但最新的数据都会每秒刷新显示. 和sysstat相同的是,
C 语言初级入门--二维数组和指针 1140566087 二维数组 c/c++指针
/* 二维数组的定义和二维数组元素的引用二维数组的定义：当数组中的每个元素带有两个下标时，称这样的数组为二维数组； (逻辑上把数组看成一个具有行和列的表格或一个矩阵); 语法：类型名数组名[常量表达式1][常量表达式2] 二维数组的引用：引用二维数组元素时必须带有两个下标，引用形式如下：例如： int a[3][4]; 引用：
10点睛Spring4.1-Application Event wiselyman application
10.1 Application Event Spring使用Application Event给bean之间的消息通讯提供了手段应按照如下部分实现bean之间的消息通讯继承ApplicationEvent类实现自己的事件实现继承ApplicationListener接口实现监听事件使用ApplicationContext发布消息