Python爬虫学习笔记

Python爬虫学习笔记

文章目录

  • Python爬虫学习笔记
    • 写在前面
    • 第一章 初识爬虫
      • 1.1 什么是爬虫
      • 1.2 需要使用的软件
      • 1.3 第一个小爬虫
      • 1.4 Web请求过程剖析
      • 1.5 Http协议
      • 1.6 Requests入门
        • 1.6.1 爬取搜狗搜索页面
        • 1.6.2 快速获取百度翻译结果
        • 1.6.3 爬取豆瓣电影排行
      • 1.7 关闭resp
    • 第二章 数据解析与提取
      • 2.1 数据解析概述
      • 2.2 Re解析 正则表达式
      • 2.3 Python的re模块使用
      • 2.4 手刃豆瓣top250电影排行
      • 2.5 屠戮电影天堂电影信息
      • 2.5 Bs解析前戏-Html语法规则
      • 2.6 Bs4解析入门-搞搞菜价
      • 2.7 Bs4解析案例-抓取优美图库图片
      • 2.8 XPath入门
      • 2.9 Xpath实战 抓取猪八戒网信息
    • 第三章 Requests进阶
      • 3.1 Requests进阶概述
      • 3.2 处理cookie 登录小说网
      • 3.3 防盗链 抓取梨视频
      • 3.4 代理
      • 3.5 综合训练 抓取网易云音乐评论信息
    • 第四章 异步
      • 4.1 第四章概述
      • 4.2 多线程
      • 4.3 多进程
      • 4.4 线程池与进程池入门
      • 4.5 线程池案例-抓取新发地菜价
      • 4.6 协程
        • 4.6.1 协程概念
        • 4.6.2 多任务异步交互
        • 4.6.3 关于异步协程-过时警告
      • 4.7 异步http请求aiohttp模块
      • 4.8 异步爬虫实战-扒光一部小说
      • 4.9 爬取视频
        • 4.9.1 综合训练-视频网站的工作原理
        • 4.9.2 抓取云播TV-简单版
    • 第五章 selenium
      • 5.1 selenium引入概念
      • 5.2 selenium各种操作-抓拉钩
      • 5.3 各种操作-窗口间的切换
      • 5.4 selenium操作-无头浏览器
      • 5.5 selenium各种操作-超级鹰处理验证码
      • 5.6 selenium -超级鹰干超级鹰
      • 5.7 selenium-搞定12306的登陆问题

写在前面

该笔记是我在学习b站up主路飞学城IT的爬虫视频时做的,详细内容请去b站找原视频,文章仅供参考,如有不对请指正,另外文章内可能有些网站已失效,请自行寻找适合的网站

第一章 初识爬虫

1.1 什么是爬虫

爬虫是从互联网上爬取各类资源,包括图片,文字,视频等格式,其原理就是用代码模拟浏览器下载各种资源。爬虫不一定要使用python语言,也可以使用java、c等,其原因还是因为python比较简洁,并且有丰富的第三方库,使爬虫技术更为简便。

什么是robots.txt?robots.txt就是一个文件包含了这个网页哪些可以爬哪些不可爬,查看方法就是在该url后面添加"/robots.txt",例http://www.bilibili.com/robots.txt。

1.2 需要使用的软件

  • Python3.8
  • Pycharm 等编译器
  • requests、urllib等模块

1.3 第一个小爬虫

第一个小爬虫就是爬取整个百度的网页,比较简单

from urllib.request import urlopen

url = "http://www.baidu.com"
resp = urlopen(url)

with open("myBaidu.html", mode="w", encoding="utf-8") as f:		# 这里需要注意Windows用户需要添一个“encoding='utf-8'”,因为百度网页编码格式是utf-8,而open()函数默认是gbk,否则出现的网页将会乱码
    f.write(resp.read().decode("utf-8"))
print('success!')

1.4 Web请求过程剖析

  • 服务器渲染:在页面源代码中能看到数据,在服务器端将数据和html整合在一起,统一返回给客户端。
  • 客户端渲染:在页面源代码中不能看到数据,第一次请求只要一个html骨架,第二次请求拿到数据,进行数据展示。

要熟练使用浏览器数据抓包工具,F12-Network

1.5 Http协议

协议:就是两个计算机之间为了能够流畅的进行沟通而设置的一个君子协议,常见的协议有TCP/IP,SOAP协议,HTTP协议,SMTP协议等等······

HTTP协议,Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(www:World Wide Web)服务器传输超文本到本地浏览器的传送协议,直白点就是浏览器和服务器之间的数据交互遵守的就是HTTP协议。

HTTP协议把一条消息分为三大块内容,无论是请求还是响应都是三块内容

请求:

请求行	-> 请求方式 请求url地址 协议
请求头 -> 放一些服务器要使用的附加信息

请求体 -> 一般放一些请求参数

响应:

状态行 -> 协议 状态码
响应头 -> 放一些客户端要使用的一些附加信息

响应体 -> 服务器返回的真正客户端要用的内容(HTML,json)

请求头中最常见的一些重要内容(爬虫需要):

  1. User-Agent:请求载体的身份标识(用啥发送的请求)
  2. Referer:防盗链(这次请求是从哪个页面来的?反爬会用到)
  3. cookie:本地字符串数据信息(用户登录信息,反爬的token)

响应头中一些重要的内容:

  1. cookie:本地字符串数据信息(用户登录信息,反爬的token)
  2. 各种神奇的莫名其妙的字符串(这个需要经验,一般都是token字样,防止各种攻击和反爬)

1.6 Requests入门

1.6.1 爬取搜狗搜索页面

首先安装requests模块 pip install requests

import requests

url = 'https://www.sogou.com/web?query=周杰伦'
headers = {
     
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36" 	# 这里的消息头可以去浏览器审查元素Network里找到,具体为Network第一个文件里的Request Headers——user-agent,可以理解为模拟浏览器标识
}
resp = requests.get(url, headers=headers)   

print(resp)
print(resp.text)

1.6.2 快速获取百度翻译结果

在百度翻译上找到获取翻译结果的url:https://fanyi.baidu.com/sug

在这里用的是POST方法,上传需要翻译的单词,返回翻译结果,post上传参数为data

import requests

url = "https://fanyi.baidu.com/sug"
text = input("请输入你要翻译的英文单词")
data = {
     
    "kw": text
}
# 发送post请求,发送的数据必须放在字典中,通过data参数进行传递
resp = requests.post(url, data=data)
print(resp.json())  # 将服务器返回的内容直接处理成json() -> dict

1.6.3 爬取豆瓣电影排行

爬虫不好使第一个尝试User-Agent,python爬虫默认的user-agent:python-requests/2.25.1,不是浏览器标识

在这里使用的是GET方法,获取豆瓣电影排行,get上传参数为param

import requests

url = "https://movie.douban.com/j/chart/top_list"
# 重新封装参数
param = {
     
    "type": "24",
    "interval_id": "100:90",
    "action": "",
    "start": 0,
    "limit": 20
}
headers = {
     
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
                  "Safari/537.36 "
}

resp = requests.get(url, params=param, headers=headers)
print(resp.json())
resp.close()

1.7 关闭resp

在程序的最后需要关闭resp(连接端口),不关闭的话可能会因为多次访问最后进不去,所以需要在最后添加一句resp.close(),包括打开文件,最后也要关闭

第二章 数据解析与提取

2.1 数据解析概述

在上一章中,我们基本上掌握了抓取整个网页的基本技能。但是呢,大多数情况下,我们并不是需要整个网页的内容,只需要其中的一小部分。那么这就涉及到了数据提取的问题。

本课程中,提供三种解析方式:

  1. re解析
  2. bs4解析
  3. xpath解析

这三种方式可以混合进行使用,完全以结果做导向,只要能拿到你想要的数据,用什么方案并不重要,当你掌握这些之后再考虑性能问题。

2.2 Re解析 正则表达式

Regular Expression,正则表达式,一种使用表达式的方式对字符串进行匹配的语法规则。

我们抓取到的网页源代码本质上就是一个超长的字符串,想从里面提取内容,用正则再适合不过。

正则的优点:速度快,效率高,准确性高

正则的缺点:新手上手难度比较高

正则的语法:使用元字符进行排列组合用来匹配字符串,在线测试正则表达式http://tool.oschina.net/regex/

元字符:具有固定含义的特殊符号

常用元字符:

.	匹配除换行以外的任意字符
\w	匹配字母或数字或下划线
\s	匹配任意的空白符
\d	匹配数字
\n	匹配一个换行符
\t	匹配一个制表符

^	匹配字符串的开始
$	匹配字符串的结尾

\W	匹配非字母或数字或下划线
\S	匹配非空白符
\D	匹配非数字
a|b	匹配字符a或字符b
()	匹配括号内的表达式,也表示一个组
[...]	匹配字符组中的字符
[^...]	匹配除了字符组中字符的所有字符

量词:控制前面的元字符出现的次数

*	重复零次或更多次
+	重复一次或更多次
?	重复零次或一次
{
     n}	重复n次
{
     n,}	重复n次或更多次
{
     n,m}	重复n到m次

贪婪匹配和惰性匹配

.*		贪婪匹配
.*?		惰性匹配

爬虫中最多使用的就是惰性匹配,因此对此需要重视

惰性匹配就是尽可能少的去匹配内容,举例

str:玩儿吃鸡游戏,晚上一起上游戏,干嘛呢?打游戏啊
reg:玩儿.*?游戏
# 这里的原理是:首先匹配“玩儿”两个字,然后再找“.*”次“游戏”,“.*”是尽可能多的进行匹配,因此此时匹配到的会是“玩儿吃鸡游戏,晚上一起上游戏,干嘛呢?打游戏”,然后“?”限制搜索次数,限制到最小次数,最终结果就为“玩儿吃鸡游戏”
此时结果为:玩儿吃鸡游戏

str<div class="jay">周杰伦</div><div class="jj">林俊杰</div>
reg: <div class=".*?">.*?</div>
结果:<div class="jay">周杰伦</div>
	<div class="jj">林俊杰</div>

2.3 Python的re模块使用

学习正则后,该如何在程序中使用呢?

import re

# findall:匹配字符串中所有符合正则的内容
lst = re.findall(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(lst)

# finditer:匹配字符串中所有的内容[返回的迭代器],从迭代器中拿到内容需要.group()
it = re.finditer(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
for i in it:
    print(i.group())

# search是找到一个结果就返回,返回的结果是match对象,拿数据需要.group()
s = re.search(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(s.group())

# match是从头开始匹配,因此第一个是中文匹配不到
s = re.match(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(s.group())

当正则表达式很长的时候,我们也可以使用预加载正则表达式

# 预加载正则表达式
obj = re.compile(r"\d+")

ret = obj.finditer("我的电话号是10086,我的女朋友电话号是10010")
for it in ret:
    print(it.group())

obj.findall("sadadsa223dawswefq123fasdigjoihuiohuiogsdf")
print(ret)

那么如何单独提取出字符串中的内容呢?

import re
s = """
    
张富帅
张富贵
吕富帅
小狗头
小煞笔
"""
# (?P<分组名字>正则)可以单独从正则匹配的内容中进一步提取内容 obj = re.compile(r"
(?P.*?)
"
, re.S) # re.S 让.能匹配换行符 res = obj.finditer(s) for it in res: print(it.group("name")) print(it.group("id"))

2.4 手刃豆瓣top250电影排行

  1. 拿到页面源代码 requests
  2. 通过re来提取想要的信息 re
import requests
import re
import csv

# 提取页面
url = "http://movie.douban.com/top250"
headers = {
     
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
                  "Safari/537.36 "
}
resp = requests.get(url, headers=headers)
page_content = resp.text

# 解析数据 拿到电影名,导演,年份,评分,评分人数信息
obj = re.compile(r'
  • .*?
    .*?(?P.*?)</span>.*?'</span> <span class="token string">r'<p class="">.*?导演: (?P<director>.*?) .*?<br>(?P<year>.*?) .*?'</span> <span class="token string">r'<span class="rating_num" property="v:average">(?P<average>.*?)</span>.*?'</span> <span class="token string">r'<span>(?P<people>.*?)</span>'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> <span class="token comment"># 开始匹配</span> result <span class="token operator">=</span> obj<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>page_content<span class="token punctuation">)</span> f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"top250.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token keyword">for</span> it <span class="token keyword">in</span> result<span class="token punctuation">:</span> <span class="token comment"># print(it.group("title"))</span> <span class="token comment"># print(it.group("director").strip())</span> <span class="token comment"># print(it.group("year").strip())</span> <span class="token comment"># print('评分'+it.group("average"))</span> <span class="token comment"># print(it.group('people'))</span> dic <span class="token operator">=</span> it<span class="token punctuation">.</span>groupdict<span class="token punctuation">(</span><span class="token punctuation">)</span> dic<span class="token punctuation">[</span><span class="token string">'director'</span><span class="token punctuation">]</span> <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'director'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> dic<span class="token punctuation">[</span><span class="token string">'year'</span><span class="token punctuation">]</span> <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'year'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>dic<span class="token punctuation">.</span>values<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'success'</span><span class="token punctuation">)</span> </code></pre> <h3>2.5 屠戮电影天堂电影信息</h3> <ol> <li>定位到2021必看热片</li> <li>从2021必看热片中提取电影子页面的链接地址</li> <li>请求子页面中的链接地址,拿到我们想要的下载磁链接</li> </ol> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">import</span> re <span class="token comment"># 定位阶段</span> domain <span class="token operator">=</span> <span class="token string">"https://dytt89.com/"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>domain<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># verify=False 去掉安全验证</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'gb2312'</span> <span class="token comment"># 指定字符集</span> <span class="token comment"># print(resp.text)</span> <span class="token comment"># 提取阶段 拿到<ul>里面的<li></span> obj1 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r"2021必看热片.*?<ul>(?P<ul>.*?)</ul>"</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> obj2 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r"<a href='(?P<href>.*?)'"</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> obj3 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r'◎片  名 (?P<movie>.*?)<br />.*?<td style="WORD-WRAP: break-word" bgcolor="#fdfddf"><a href="('</span> <span class="token string">r'?P<download>.*?)"'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span> result1 <span class="token operator">=</span> obj1<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> child_href_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> it <span class="token keyword">in</span> result1<span class="token punctuation">:</span> ul <span class="token operator">=</span> it<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'ul'</span><span class="token punctuation">)</span> <span class="token comment"># 提取子页面链接</span> result2 <span class="token operator">=</span> obj2<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>ul<span class="token punctuation">)</span> <span class="token keyword">for</span> itt <span class="token keyword">in</span> result2<span class="token punctuation">:</span> <span class="token comment"># 拼接子页面的url地址:域名+子页面地址</span> child_href <span class="token operator">=</span> domain <span class="token operator">+</span> itt<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span> child_href_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span>child_href<span class="token punctuation">)</span> <span class="token comment"># 把子页面链接存储起来</span> <span class="token comment"># 提取子页面内容</span> <span class="token keyword">for</span> href <span class="token keyword">in</span> child_href_list<span class="token punctuation">:</span> child_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> child_resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'gb2312'</span> result3 <span class="token operator">=</span> obj3<span class="token punctuation">.</span>search<span class="token punctuation">(</span>child_resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>result3<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'movie'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>result3<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'download'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>2.5 Bs解析前戏-Html语法规则</h3> <p>bs4解析比较简单,但是需要一定的html知识,然后再去使用bs4去提取,逻辑和编写难度就会非常简单清晰,有前端基础的可略过</p> <p>HTML(Hyper Text Markup Language)超文本标记语言,是我们编写网页的最基本也是最核心的一种语言。其语法规则就是用不同的标签对网页上的内容进行标记,从而使网页显示出不同的展示效果。</p> <pre><code class="prism language-html"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span><span class="token punctuation">></span></span> Hello World! <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span> </code></pre> <p>上述代码的含义是在页面显示“Hello World!”一句,但是这句话被</p> <h1>和</h1>标记了。白话就是括起来了,被H1标签括起来了。这个时候,浏览器在展示的时候就会让“Hello World!”这句话加粗加大,变为标题,所以HTML的语法就是用类似这样的标签对页面内容进行标记。不同的标签表现出来的效果也是不一样的。 <p></p> <pre><code class="prism language-html">h1:一级标题 h2:二级标题 p:段落 font:字体(已被废弃,但还能用) body:主体 </code></pre> <p>标签还有很多,这里就不一一列举。接下来是属性</p> <pre><code class="prism language-html"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span> <span class="token attr-name">align</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>center<span class="token punctuation">'</span></span><span class="token punctuation">></span></span> Hello World! <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>1<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>a<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>2<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>b<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>3<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>c<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span> </code></pre> <p>其中"align"就是标签属性,"center"就是属性值,后续的bs4解析就是可以根据id的属性值进行检索。</p> <h3>2.6 Bs4解析入门-搞搞菜价</h3> <p>首先pip install bs4安装模块</p> <ol> <li>拿到页面源代码</li> <li>使用bs4进行解析 拿到数据</li> </ol> <p>视频中的网站源代码已改变,因此这里选用的url是:http://www.bjtzh.gov.cn/bjtz/home/jrcj/index.shtml,最后结果类似</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">import</span> csv url <span class="token operator">=</span> <span class="token string">"http://www.bjtzh.gov.cn/bjtz/home/jrcj/index.shtml"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span> f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"vegetable_price.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token comment"># 解析数据</span> <span class="token comment"># 1.把页面源代码交给BeautifulSoup进行处理,生成bs对象</span> page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> <span class="token comment"># 指定html解析器</span> <span class="token comment"># 2.从bs对象中查找数据</span> <span class="token comment"># find(标签,属性=值) 只找第一个</span> <span class="token comment"># findall(标签,属性=值) 找到所有的</span> table <span class="token operator">=</span> page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"table"</span><span class="token punctuation">,</span> attrs<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">"style"</span><span class="token punctuation">:</span> <span class="token string">"margin: 0px auto; width: 588px; height: 847px; border-collapse: collapse;"</span><span class="token punctuation">,</span> <span class="token string">"width"</span><span class="token punctuation">:</span> <span class="token string">"588"</span><span class="token punctuation">,</span> <span class="token string">"cellspacing"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span> <span class="token string">"cellpadding"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span> <span class="token string">"border"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span> <span class="token string">"align"</span><span class="token punctuation">:</span> <span class="token string">"center"</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token comment"># 拿到所有数据行</span> trs <span class="token operator">=</span> table<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"tr"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">7</span><span class="token punctuation">:</span><span class="token punctuation">]</span> <span class="token keyword">for</span> tr <span class="token keyword">in</span> trs<span class="token punctuation">:</span> <span class="token comment"># 每一行数据</span> tds <span class="token operator">=</span> tr<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'td'</span><span class="token punctuation">)</span> <span class="token comment"># 拿到每行数据中的td</span> name <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text <span class="token comment"># .text表示拿到被标签标记的内容</span> kind <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text high <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text low <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span><span class="token punctuation">[</span>name<span class="token punctuation">,</span> kind<span class="token punctuation">,</span> high<span class="token punctuation">,</span> low<span class="token punctuation">]</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'success!'</span><span class="token punctuation">)</span> </code></pre> <h3>2.7 Bs4解析案例-抓取优美图库图片</h3> <ol> <li>拿到主页面的源代码 提取子页面的链接地址 href</li> <li>通过href拿到子页面的内容,从子页面找到图片的下载地址 img->src</li> <li>下载图片</li> </ol> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">import</span> time url_index <span class="token operator">=</span> <span class="token string">"https://umei.cc"</span> url <span class="token operator">=</span> <span class="token string">"https://umei.cc/bizhitupian/weimeibizhi/"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span> <span class="token comment"># 把源代码交给BeautifulSoup</span> main_page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> a_list <span class="token operator">=</span> main_page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"div"</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">"TypeList"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"a"</span><span class="token punctuation">)</span> <span class="token comment"># print(a_list)</span> <span class="token keyword">for</span> a <span class="token keyword">in</span> a_list<span class="token punctuation">:</span> href <span class="token operator">=</span> url_index <span class="token operator">+</span> a<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token comment"># 直接通过get就可以直接拿到属性值</span> <span class="token comment"># 拿到子页面源代码</span> child_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">)</span> child_resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span> <span class="token comment"># 从子页面拿到图片下载链接</span> child_page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>child_resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> p <span class="token operator">=</span> child_page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"p"</span><span class="token punctuation">,</span> align<span class="token operator">=</span><span class="token string">"center"</span><span class="token punctuation">)</span> img <span class="token operator">=</span> p<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"img"</span><span class="token punctuation">)</span> src <span class="token operator">=</span> img<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"src"</span><span class="token punctuation">)</span> <span class="token comment"># 下载图片</span> img_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>src<span class="token punctuation">)</span> <span class="token comment"># img_resp.content # 这里拿到的是字节</span> img_name <span class="token operator">=</span> src<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token comment"># 切割 拿到url中的最后一个/以后的内容</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"Wallpaper/"</span><span class="token operator">+</span>img_name<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>img_resp<span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token comment"># 图片内容写入文件</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"success!"</span><span class="token punctuation">,</span> img_name<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 防止访问过多服务器压力过大</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"all over"</span><span class="token punctuation">)</span> </code></pre> <h3>2.8 XPath入门</h3> <p>xpath是在XML文档中搜索内容的一门语言</p> <p>html是xml的一个子集</p> <p>安装lxml模块 pip install lxml</p> <pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree xml <span class="token operator">=</span> <span class="token triple-quoted-string string">""" <book> <id>1</id> <name>野花遍地香</name> <price>1.23</price> <author> <nick id="10086">周大强</nick> <nick id="10010">周芷若</nick> <nick class="joy">周杰伦</nick> <nick class="jolin">蔡依林</nick> <div> <nick>rerererererer</nick> </div> <div> <nick>rerererererer2</nick> <div> <nick>rerererererer3</nick> </div> </div> </author> <partner> <nick id="ppc">胖胖陈</nick> <nick id="ppbc">胖胖不陈</nick> </partner> </book> """</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>XML<span class="token punctuation">(</span>xml<span class="token punctuation">)</span> <span class="token comment"># result = tree.xpath("/book") # /表示层级关系,第一个/是根节点</span> <span class="token comment"># result = tree.xpath("/book/name/text()") # text()表示拿文本</span> <span class="token comment"># result = tree.xpath("/book/author//nick/text()") # 后代 拿出nick里的文本以及三个rerere</span> <span class="token comment"># result = tree.xpath("/book/author/*/nick/text()") # *任意节点,通配符 只拿出re1,re2</span> result <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">"/book//nick/text()"</span><span class="token punctuation">)</span> <span class="token comment"># 拿出所有nick的文本</span> <span class="token keyword">print</span><span class="token punctuation">(</span>result<span class="token punctuation">)</span> </code></pre> <p>在html文件中,[]可以表示索引,索引为第几个,例如///</p> <ul> <li>[1]//text()表示第一条</li> <li>中标签的文字内容;</li> <li><p></p> <p>[]里面也可以表示为标签的属性筛选,例如///</p> </li> <li>/[@href=‘dapao’]/text(),表示href为“dapao”的标签的文字内容;</li> <li><p></p> <p>///</p> </li> <li>//@href可以单取a标签href的属性值。</li> <li><p></p> <p><strong>小技巧</strong>:可以从网页中按F12,页面源代码中可以快速复制xpath</p> <h3>2.9 Xpath实战 抓取猪八戒网信息</h3> <ol> <li>拿到页面源代码</li> <li>提取和解析数据</li> </ol> <p>在这里我搜索的是“小程序开发”,遇到许多视频中没有出现的问题,好在通过百度也算是解决了,如果有更好的解决方法麻烦大佬留言</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token comment"># 我这搜索的是小程序开发,爬取过程中有许多不方便的,尽量尝试搜索英文</span> url <span class="token operator">=</span> <span class="token string">"https://beijing.zbj.com/search/f/?kw=%E5%B0%8F%E7%A8%8B%E5%BA%8F%E5%BC%80%E5%8F%91"</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># print(resp.text)</span> <span class="token comment"># 解析</span> html <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># 拿到第一个服务商的div</span> divs <span class="token operator">=</span> html<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">"/html/body/div[6]/div/div/div[3]/div[4]/div[1]/div"</span><span class="token punctuation">)</span> <span class="token keyword">for</span> div <span class="token keyword">in</span> divs<span class="token punctuation">:</span> <span class="token comment"># 每一个服务商的信息</span> price_w <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[1]/div[2]/div[1]/span[1]/text()'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token keyword">not</span> price_w<span class="token punctuation">:</span> <span class="token comment"># 我在爬取价格时遇到空字符,因此设个if语句跳过该价格</span> <span class="token keyword">break</span> price <span class="token operator">=</span> price_w<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> title <span class="token operator">=</span> <span class="token string">"小程序"</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[1]/div[2]/div[2]/p/text()'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> company <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[2]/div[1]/p/text()'</span><span class="token punctuation">)</span> <span class="token comment"># 爬取结果含有换行符</span> company <span class="token operator">=</span> <span class="token builtin">list</span><span class="token punctuation">(</span><span class="token builtin">filter</span><span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token punctuation">[</span>x<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> x <span class="token keyword">in</span> company<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 去除换行符后再将list中的空字符去除</span> location <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[2]/div[1]/div/span/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> price<span class="token punctuation">,</span> company<span class="token punctuation">,</span> location<span class="token punctuation">)</span> </code></pre> <h2>第三章 Requests进阶</h2> <h3>3.1 Requests进阶概述</h3> <p>我们在之前的爬虫中其实已经使用过headers了。header为HTTP协议中的请求头,一般存放一些和请求内容无关的数据,有时也会存放一些安全验证信息。比如常见的User-Agent,token,cookie等。</p> <p>通过requests发送的请求,我们可以把请求头信息放在headers中,也可以单独进行存放,最终由requests自动帮我们拼接成完整的http请求头。</p> <p>本章内容:</p> <ol> <li>模拟浏览器登录->处理cookie</li> <li>防盗链处理->抓取梨视频数据</li> <li>代理->放hi被封IP</li> </ol> <p>综合训练:抓取网易云评论信息</p> <h3>3.2 处理cookie 登录小说网</h3> <p>登录->得到cookie</p> <p>带着cookie去请求到书架url -> 书架上的内容</p> <p>必须得把上面的两个操作连起来 我们可以使用session进行请求->session可以认为一连串的请求。在这个过程中cookie不会丢失</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token comment"># 会话</span> session <span class="token operator">=</span> requests<span class="token punctuation">.</span>session<span class="token punctuation">(</span><span class="token punctuation">)</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"loginName"</span><span class="token punctuation">:</span> <span class="token string">"13757696746"</span><span class="token punctuation">,</span> <span class="token string">"password"</span><span class="token punctuation">:</span> <span class="token string">"123qweasdzxc"</span> <span class="token punctuation">}</span> <span class="token comment"># 1.登录</span> url <span class="token operator">=</span> <span class="token string">"https://passport.17k.com/ck/user/login"</span> resp <span class="token operator">=</span> session<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span> <span class="token comment"># 拿书架的数据</span> resp_b <span class="token operator">=</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp_b<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 另一种方法</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"</span><span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">"Cookie"</span><span class="token punctuation">:</span> <span class="token string">"浏览器中复制的cookie"</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>3.3 防盗链 抓取梨视频</h3> <p>爬取过程中视频url并不会出现在页面源代码里,推测视频链接是由js生成,通过拦截发现一段与视频链接非常相似的链接,于是需要将其拼接</p> <ol> <li>拿到contID</li> <li>拿到videoStatus返回的json -> srcURL</li> <li>srcURL里面的内容进行修整</li> <li>下载视频</li> </ol> <p><strong>什么是防盗链</strong>:溯源,防盗链相当于在页面请求过程中有个层级关系,它要求你必须是从第一个页面转到第二个页面,否则你直接访问第二个页面是不行的,防盗链就是这个页面的上一级页面</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests url <span class="token operator">=</span> <span class="token string">"https://www.pearvideo.com/video_1738675"</span> contID <span class="token operator">=</span> url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">"_"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> videoStatusUrl <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"https://www.pearvideo.com/videoStatus.jsp?contId=</span><span class="token interpolation"><span class="token punctuation">{ </span>contID<span class="token punctuation">}</span></span><span class="token string">&mrd=0.5611111607819312"</span></span> headers <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"User-Agent"</span><span class="token punctuation">:</span> <span class="token string">"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "</span> <span class="token string">"Safari/537.36 "</span><span class="token punctuation">,</span> <span class="token comment"># 防盗链:</span> <span class="token string">"Referer"</span><span class="token punctuation">:</span> url <span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>videoStatusUrl<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span> dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> srcUrl <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">"videoInfo"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"videos"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"srcUrl"</span><span class="token punctuation">]</span> systemTime <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">"systemTime"</span><span class="token punctuation">]</span> srcUrl <span class="token operator">=</span> srcUrl<span class="token punctuation">.</span>replace<span class="token punctuation">(</span>systemTime<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"cont-</span><span class="token interpolation"><span class="token punctuation">{ </span>contID<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"videos/a.mp4"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>srcUrl<span class="token punctuation">)</span><span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"success!"</span><span class="token punctuation">)</span> </code></pre> <h3>3.4 代理</h3> <p>原理:通过第三方的一个机器去发送请求</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token comment"># 36.112.139.146</span> proxies <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"http"</span><span class="token punctuation">:</span> <span class="token string">"http://36.112.139.146:3128"</span> <span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.baidu.com"</span><span class="token punctuation">,</span> proxies<span class="token operator">=</span>proxies<span class="token punctuation">)</span> resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span> </code></pre> <h3>3.5 综合训练 抓取网易云音乐评论信息</h3> <ol> <li>找到未加密的参数</li> <li>想办法把参数进行加密(必须参考网易的洛基),params => encText,encSecKey => encSecKey</li> <li>请求到网易,拿到评论信息</li> </ol> <p>爬取过程中遇到极其复杂的信息加密,Network项目中拦截到神评后,可以发现该请求的data是加密了的,在Initiator里可以看到它生成神评都是经过哪些js,点击第一个也就是最后运行的js文件查看代码,对该行代码标记后往前推找到对应url,可以看到右边Scope栏中Local底下有加密的data信息,那么我们可以倒推代码找到它是在哪一行里加密的,所以在右边Call Stack栏里往后倒推,一个一个查看Local属性里的data是否有加密,最后排查到u0x.be1x这一步中data还未加密,可以推测这段js就是对data的加密。注意:js文件中的变量名每次刷新都会变化</p> <pre><code class="prism language-js"> u9l<span class="token punctuation">.</span><span class="token function-variable function">be9V</span> <span class="token operator">=</span> <span class="token keyword">function</span><span class="token punctuation">(</span><span class="token parameter"><span class="token constant">Y9P</span><span class="token punctuation">,</span> e9f</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token keyword">var</span> i9b <span class="token operator">=</span> <span class="token punctuation">{ </span><span class="token punctuation">}</span> <span class="token punctuation">,</span> e9f <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span><span class="token punctuation">{ </span><span class="token punctuation">}</span><span class="token punctuation">,</span> e9f<span class="token punctuation">)</span> <span class="token punctuation">,</span> mo3x <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">indexOf</span><span class="token punctuation">(</span><span class="token string">"?"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>window<span class="token punctuation">.</span>GEnc <span class="token operator">&&</span> <span class="token regex"><span class="token regex-delimiter">/</span><span class="token regex-source language-regex">(^|\.com)\/api</span><span class="token regex-delimiter">/</span></span><span class="token punctuation">.</span><span class="token function">test</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">)</span> <span class="token operator">&&</span> <span class="token operator">!</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>headers <span class="token operator">&&</span> e9f<span class="token punctuation">.</span>headers<span class="token punctuation">[</span>eu0x<span class="token punctuation">.</span>Bl8d<span class="token punctuation">]</span> <span class="token operator">==</span> eu0x<span class="token punctuation">.</span>Io0x<span class="token punctuation">)</span> <span class="token operator">&&</span> <span class="token operator">!</span>e9f<span class="token punctuation">.</span>noEnc<span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token keyword">if</span> <span class="token punctuation">(</span>mo3x <span class="token operator">!=</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> i9b <span class="token operator">=</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">substring</span><span class="token punctuation">(</span>mo3x <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">substring</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> mo3x<span class="token punctuation">)</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token punctuation">{ </span> i9b <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span>i9b<span class="token punctuation">,</span> j9a<span class="token punctuation">.</span><span class="token function">fP1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token operator">?</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token operator">:</span> e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token punctuation">{ </span> i9b <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span>i9b<span class="token punctuation">,</span> j9a<span class="token punctuation">.</span><span class="token function">fP1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token operator">?</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token operator">:</span> e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token punctuation">}</span> i9b<span class="token punctuation">[</span><span class="token string">"csrf_token"</span><span class="token punctuation">]</span> <span class="token operator">=</span> u9l<span class="token punctuation">.</span><span class="token function">gP1x</span><span class="token punctuation">(</span><span class="token string">"__csrf"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span><span class="token string">"api"</span><span class="token punctuation">,</span> <span class="token string">"weapi"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> e9f<span class="token punctuation">.</span>method <span class="token operator">=</span> <span class="token string">"post"</span><span class="token punctuation">;</span> <span class="token keyword">delete</span> e9f<span class="token punctuation">.</span>query<span class="token punctuation">;</span> <span class="token keyword">var</span> bUG7z <span class="token operator">=</span> window<span class="token punctuation">.</span><span class="token function">asrsea</span><span class="token punctuation">(</span><span class="token constant">JSON</span><span class="token punctuation">.</span><span class="token function">stringify</span><span class="token punctuation">(</span>i9b<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"流泪"</span><span class="token punctuation">,</span> <span class="token string">"强"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token constant">WU8M</span><span class="token punctuation">.</span>md<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"爱心"</span><span class="token punctuation">,</span> <span class="token string">"女孩"</span><span class="token punctuation">,</span> <span class="token string">"惊恐"</span><span class="token punctuation">,</span> <span class="token string">"大笑"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> e9f<span class="token punctuation">.</span>data <span class="token operator">=</span> j9a<span class="token punctuation">.</span><span class="token function">cs0x</span><span class="token punctuation">(</span><span class="token punctuation">{ </span> params<span class="token operator">:</span> bUG7z<span class="token punctuation">.</span>encText<span class="token punctuation">,</span> encSecKey<span class="token operator">:</span> bUG7z<span class="token punctuation">.</span>encSecKey <span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token punctuation">}</span> <span class="token keyword">var</span> cdnHost <span class="token operator">=</span> <span class="token string">"y.music.163.com"</span><span class="token punctuation">;</span> <span class="token keyword">var</span> apiHost <span class="token operator">=</span> <span class="token string">"interface.music.163.com"</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>location<span class="token punctuation">.</span>host <span class="token operator">===</span> cdnHost<span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span>cdnHost<span class="token punctuation">,</span> apiHost<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">match</span><span class="token punctuation">(</span><span class="token regex"><span class="token regex-delimiter">/</span><span class="token regex-source language-regex">^\/(we)?api</span><span class="token regex-delimiter">/</span></span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token string">"//"</span> <span class="token operator">+</span> apiHost <span class="token operator">+</span> <span class="token constant">Y9P</span> <span class="token punctuation">}</span> e9f<span class="token punctuation">.</span>cookie <span class="token operator">=</span> <span class="token boolean">true</span> <span class="token punctuation">}</span> <span class="token function">cwR2x</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">,</span> e9f<span class="token punctuation">)</span> <span class="token punctuation">}</span> </code></pre> <p>过程比较复杂,最好跟着视频学习.</p> <p>在该方法里一步一步推导,可以发现</p> <pre><code class="prism language-js"><span class="token keyword">var</span> bUG7z <span class="token operator">=</span> window<span class="token punctuation">.</span><span class="token function">asrsea</span><span class="token punctuation">(</span><span class="token constant">JSON</span><span class="token punctuation">.</span><span class="token function">stringify</span><span class="token punctuation">(</span>i9b<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"流泪"</span><span class="token punctuation">,</span> <span class="token string">"强"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token constant">WU8M</span><span class="token punctuation">.</span>md<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"爱心"</span><span class="token punctuation">,</span> <span class="token string">"女孩"</span><span class="token punctuation">,</span> <span class="token string">"惊恐"</span><span class="token punctuation">,</span> <span class="token string">"大笑"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> </code></pre> <p>这里后面开始的加密,仔细研究可以看出来是替换了内容params => encText,encSecKey => encSecKey,那么就去找window.asrsea()这个方法,搜索后发现它的值全靠这一句window.asrsea = d,网上看可以看到d方法的定义过程</p> <pre><code class="prism language-js"> <span class="token keyword">function</span> <span class="token function">d</span><span class="token punctuation">(</span><span class="token parameter">d<span class="token punctuation">,</span> e<span class="token punctuation">,</span> f<span class="token punctuation">,</span> g</span><span class="token punctuation">)</span> <span class="token punctuation">{ </span> <span class="token keyword">var</span> h <span class="token operator">=</span> <span class="token punctuation">{ </span><span class="token punctuation">}</span> <span class="token punctuation">,</span> i <span class="token operator">=</span> <span class="token function">a</span><span class="token punctuation">(</span><span class="token number">16</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> h<span class="token punctuation">.</span>encText <span class="token operator">=</span> <span class="token function">b</span><span class="token punctuation">(</span>d<span class="token punctuation">,</span> g<span class="token punctuation">)</span><span class="token punctuation">,</span> h<span class="token punctuation">.</span>encText <span class="token operator">=</span> <span class="token function">b</span><span class="token punctuation">(</span>h<span class="token punctuation">.</span>encText<span class="token punctuation">,</span> i<span class="token punctuation">)</span><span class="token punctuation">,</span> h<span class="token punctuation">.</span>encSecKey <span class="token operator">=</span> <span class="token function">c</span><span class="token punctuation">(</span>i<span class="token punctuation">,</span> e<span class="token punctuation">,</span> f<span class="token punctuation">)</span><span class="token punctuation">,</span> h <span class="token punctuation">}</span> </code></pre> <p>d()的四个元素中,d代表数据,e在控制台中过几遍可以发现是固定值010001,f是一串很长的外星文,g也是固定值“0CoJUm6Qyw8W8jud”</p> <p>然后就根据属性值,分析d()究竟要干什么,接下来内容的分析就不再做详细的介绍,a()返回16位随机字符串</p> <p>我这爬取了用户的昵称以及评论,具体步骤需要去b站看视频</p> <pre><code class="prism language-python"><span class="token keyword">from</span> Crypto<span class="token punctuation">.</span>Cipher <span class="token keyword">import</span> AES <span class="token keyword">from</span> base64 <span class="token keyword">import</span> b64encode <span class="token keyword">import</span> requests <span class="token keyword">import</span> json url <span class="token operator">=</span> <span class="token string">"https://music.163.com/weapi/comment/resource/comments/get?csrf_token="</span> <span class="token comment"># 请求方式POST</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"csrf_token"</span><span class="token punctuation">:</span> <span class="token string">""</span><span class="token punctuation">,</span> <span class="token string">"cursor"</span><span class="token punctuation">:</span> <span class="token string">"-1"</span><span class="token punctuation">,</span> <span class="token string">"offset"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span> <span class="token string">"orderType"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span> <span class="token string">"pageNo"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span> <span class="token string">"pageSize"</span><span class="token punctuation">:</span> <span class="token string">"20"</span><span class="token punctuation">,</span> <span class="token string">"rid"</span><span class="token punctuation">:</span> <span class="token string">"R_SO_4_65538"</span><span class="token punctuation">,</span> <span class="token string">"threadId"</span><span class="token punctuation">:</span> <span class="token string">"R_SO_4_65538"</span> <span class="token punctuation">}</span> <span class="token comment"># 服务于d</span> e <span class="token operator">=</span> <span class="token string">"010001"</span> f <span class="token operator">=</span> <span class="token string">"00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e "</span> g <span class="token operator">=</span> <span class="token string">"0CoJUm6Qyw8W8jud"</span> i <span class="token operator">=</span> <span class="token string">"7HCsoSguhIA6SpNw"</span> <span class="token comment"># 手动固定 函数中是随机的</span> encSecKey <span class="token operator">=</span> <span class="token string">"21fb180e564113d59d37865081a91daf1f775fb67ef063dc046bda9966613ea4a384b597e11ce05c442df9dfa8538347c58aa87d9be92636fbda399b28f04bbf31e91751e25f359a05538b8d5c51999a03e1348e21cbe90fbfa54d013399c0ab240e41c73750ef463542fe5c14637db16abeffa8a2ab74027e085aa570c01395 "</span> <span class="token comment"># 转化成16的倍数,为下方的加密算法服务</span> <span class="token keyword">def</span> <span class="token function">to_16</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">:</span> pad <span class="token operator">=</span> <span class="token number">16</span> <span class="token operator">-</span> <span class="token builtin">len</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span> <span class="token operator">%</span> <span class="token number">16</span> data <span class="token operator">+=</span> <span class="token builtin">chr</span><span class="token punctuation">(</span>pad<span class="token punctuation">)</span> <span class="token operator">*</span> pad <span class="token keyword">return</span> data <span class="token keyword">def</span> <span class="token function">get_encSecKey</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 由于i是固定的,因此encSecKey也是固定的,c()函数获得的结果也是固定的</span> <span class="token keyword">return</span> encSecKey <span class="token keyword">def</span> <span class="token function">get_params</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 默认这里接受到的为字符串</span> first <span class="token operator">=</span> enc_params<span class="token punctuation">(</span>data<span class="token punctuation">,</span> g<span class="token punctuation">)</span> second <span class="token operator">=</span> enc_params<span class="token punctuation">(</span>first<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">return</span> second <span class="token comment"># 返回的就是params</span> <span class="token keyword">def</span> <span class="token function">enc_params</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span> key<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 加密过程</span> <span class="token comment"># 导入AES加密模块需要导入新包</span> iv <span class="token operator">=</span> <span class="token string">"0102030405060708"</span> data <span class="token operator">=</span> to_16<span class="token punctuation">(</span>data<span class="token punctuation">)</span> aes <span class="token operator">=</span> AES<span class="token punctuation">.</span>new<span class="token punctuation">(</span>key<span class="token operator">=</span>key<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> IV<span class="token operator">=</span>iv<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> mode<span class="token operator">=</span>AES<span class="token punctuation">.</span>MODE_CBC<span class="token punctuation">)</span> <span class="token comment"># 创建加密器</span> bs <span class="token operator">=</span> aes<span class="token punctuation">.</span>encrypt<span class="token punctuation">(</span>data<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 加密,加密内容的长度必须是16的倍数</span> <span class="token keyword">return</span> <span class="token builtin">str</span><span class="token punctuation">(</span>b64encode<span class="token punctuation">(</span>bs<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token comment"># 转化成字符串返回</span> <span class="token comment"># 处理加密过程</span> <span class="token triple-quoted-string string">""" function a(a) { # 返回随机的16位字符串 var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = ""; for (d = 0; a > d; d += 1) # 循环16次 e = Math.random() * b.length, # 随机数 e = Math.floor(e), # 取整 c += b.charAt(e); # 去字符串中的x位置 return c } function b(a, b) { # a是要加密的内容, var c = CryptoJS.enc.Utf8.parse(b) # b是密钥 , d = CryptoJS.enc.Utf8.parse("0102030405060708") , e = CryptoJS.enc.Utf8.parse(a) # e是数据 , f = CryptoJS.AES.encrypt(e, c, { # c 加密的密钥 iv: d, # 偏移量 mode: CryptoJS.mode.CBC # 模式:cbc }); return f.toString() } function c(a, b, c) { var d, e; return setMaxDigits(131), d = new RSAKeyPair(b,"",c), e = encryptedString(d, a) } function d(d, e, f, g) { var h = {} # 这里为空 , i = a(16); # i就是16位随机值,把i设为固定值 return h.encText = b(d, g), # g密钥 h.encText = b(h.encText, i), # 返回的就是params i也是密钥 h.encSecKey = c(i, e, f), # 返回的就是encSecKey,e和f是定死的,如果此时把i固定得到的key是固定的 h } function e(a, b, d, e) { var f = {}; return f.encText = c(a + e, b, d), f } 两次加密: 数据+g => b => 第一次加密+i => b => params """</span> <span class="token comment"># 发送请求,得到评论结果</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span><span class="token punctuation">{ </span> <span class="token string">"params"</span><span class="token punctuation">:</span> get_params<span class="token punctuation">(</span>json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"encSecKey"</span><span class="token punctuation">:</span> get_encSecKey<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">}</span><span class="token punctuation">)</span> dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> hotComments <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"hotComments"</span><span class="token punctuation">]</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> hotComments<span class="token punctuation">:</span> username <span class="token operator">=</span> i<span class="token punctuation">[</span><span class="token string">"user"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"nickname"</span><span class="token punctuation">]</span> content <span class="token operator">=</span> i<span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span>username<span class="token punctuation">,</span> <span class="token string">":"</span><span class="token punctuation">,</span> content<span class="token punctuation">)</span> </code></pre> <h2>第四章 异步</h2> <h3>4.1 第四章概述</h3> <p>到目前为止,我们可以解决爬虫的基本抓取流程了,但是抓取效率还不够高。如何提高抓取效率呢?我们可以选择多线程,多进程,协程等操作完成异步爬虫。</p> <p>什么是异步?假设我们有一万条数据需要爬取,一个一个爬的话就会需要很长的时间,那异步就是多条线路同时进行,可以一次性爬取多条数据。</p> <p>本章内容:</p> <ol> <li>快速学会多线程</li> <li>快速学会多进程</li> <li>线程池和进程池</li> <li>扒光新发地</li> <li>协程</li> <li>多任务异步协程实现</li> <li>aiohttp模块详解</li> <li>扒光一本小说</li> <li>综合训练-抓取一部电影</li> </ol> <h3>4.2 多线程</h3> <ul> <li>进程是资源单位,每一个进程至少要有一个线程</li> <li>线程是执行单位</li> </ul> <p>第一套写法</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread <span class="token keyword">def</span> <span class="token function">func</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"func "</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> t <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>func<span class="token punctuation">)</span> <span class="token comment"># 创建线程并给线程安排任务,相当于创建一个员工,括号内为他要做的工作</span> t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 多线程状态为可以开始工作状态,具体的执行时间由CPU决定</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"main"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> </code></pre> <p>第二套写法</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread <span class="token keyword">class</span> <span class="token class-name">MyThread</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"子线程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> t <span class="token operator">=</span> MyThread<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># t.run() # 方法调用了,依然是单线程</span> t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 开启线程</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"主线程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> </code></pre> <h3>4.3 多进程</h3> <p>多进程的写法与多线程基本相同</p> <pre><code class="prism language-python"><span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Process <span class="token keyword">def</span> <span class="token function">fuc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"子进程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> p <span class="token operator">=</span> Process<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">)</span> p<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"主线程"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span> </code></pre> <p>那如果要区分两个进程应该怎么写?</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread <span class="token keyword">def</span> <span class="token function">fuc</span><span class="token punctuation">(</span>name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 打印括号内的名字</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> t1 <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span><span class="token string">" 周杰伦"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 传递参数必须是元组</span> t1<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> t2 <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span><span class="token string">"王力宏"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> t2<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>4.4 线程池与进程池入门</h3> <p>线程池:一次性开辟一些线程,我们用户直接给线程池子提交任务。线程任务的调度交给线程池来完成</p> <pre><code class="prism language-python"><span class="token keyword">from</span> concurrent<span class="token punctuation">.</span>futures <span class="token keyword">import</span> ThreadPoolExecutor<span class="token punctuation">,</span> ProcessPoolExecutor <span class="token comment"># ThreadPoolExecutor, ProcessPoolExecutor一个对应线程一个对应进程,选择使用</span> <span class="token keyword">def</span> <span class="token function">fn</span><span class="token punctuation">(</span>name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 创建线程池</span> <span class="token keyword">with</span> ThreadPoolExecutor<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">as</span> t<span class="token punctuation">:</span> <span class="token comment"># 创建50个线程</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span> t<span class="token punctuation">.</span>submit<span class="token punctuation">(</span>fn<span class="token punctuation">,</span> name<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"线程</span><span class="token interpolation"><span class="token punctuation">{ </span>i<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span> <span class="token comment"># 等待线程池中的任务全部执行完毕,才继续执行(守护)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Done"</span><span class="token punctuation">)</span> </code></pre> <h3>4.5 线程池案例-抓取新发地菜价</h3> <ol> <li>如何提取单个页面的数据</li> <li>上线程池,多个页面同时抓取</li> </ol> <p>因为页面更新,数据不会保存在页面源代码,更新后是用json生成数据,因此与视频代码不同</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">import</span> csv <span class="token keyword">from</span> concurrent<span class="token punctuation">.</span>futures <span class="token keyword">import</span> ThreadPoolExecutor f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"data.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">,</span> newline<span class="token operator">=</span><span class="token string">""</span><span class="token punctuation">)</span> csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">download_one_page</span><span class="token punctuation">(</span>page<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 拿到页面源代码</span> url <span class="token operator">=</span> <span class="token string">"http://www.xinfadi.com.cn/getPriceData.html"</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"limit"</span><span class="token punctuation">:</span> <span class="token string">"20"</span><span class="token punctuation">,</span> <span class="token string">"current"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{ </span>page<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token comment"># 对应第几页</span> <span class="token punctuation">}</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span> <span class="token keyword">for</span> txt <span class="token keyword">in</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">"list"</span><span class="token punctuation">]</span><span class="token punctuation">:</span> <span class="token comment"># 提取自己需要的内容</span> dic <span class="token operator">=</span> <span class="token punctuation">[</span>txt<span class="token punctuation">[</span><span class="token string">"prodName"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"prodCat"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"lowPrice"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"highPrice"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"place"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"pubDate"</span><span class="token punctuation">]</span><span class="token punctuation">]</span> <span class="token comment"># 将数据存放至文件中</span> csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>dic<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"第</span><span class="token interpolation"><span class="token punctuation">{ </span>page<span class="token punctuation">}</span></span><span class="token string">页下载完成"</span></span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># for i in range(1, 17712): # 效率极其低下</span> <span class="token comment"># download_one_page(i)</span> <span class="token comment"># 创建线程池</span> <span class="token keyword">with</span> ThreadPoolExecutor<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">as</span> t<span class="token punctuation">:</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">200</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 把下载任务提交给线程池</span> t<span class="token punctuation">.</span>submit<span class="token punctuation">(</span>download_one_page<span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"全部下载完毕"</span><span class="token punctuation">)</span> </code></pre> <h3>4.6 协程</h3> <h4>4.6.1 协程概念</h4> <p>当代码中time.sleep()的时候,当前线程是处于阻塞状态,CPU是部位我工作的</p> <p>同样的,input()程序也是处于阻塞状态</p> <p>requests.get(url) 在网络请求返回数据之前,程序也是处于阻塞状态</p> <p>一般情况下,当程序处于IO操作的时候,线程都会处于阻塞状态</p> <p><strong>协程</strong>:当程序遇见IO操作的时候,可以选择性的切换到其他任务上。在微观上是一个任务一个任务的进行切换,切换条件一般就是IO操作;在宏观上,我们能看到的其实是多个任务一起在执行。</p> <h4>4.6.2 多任务异步交互</h4> <pre><code class="prism language-python"><span class="token keyword">import</span> asyncio <span class="token keyword">import</span> time <span class="token comment"># async def func():</span> <span class="token comment"># print("你好,我叫赛利亚")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># if __name__ == '__main__':</span> <span class="token comment"># g = func() # 此时的函数是异步协程函数,此时函数执行得到的是一个协程对象</span> <span class="token comment"># asyncio.run(g) # 协程程序运行需要asyncio模块的支持</span> <span class="token comment"># async def func1():</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment"># # time.sleep(3) # 当程序出现同步操作的时候,异步就中断了</span> <span class="token comment"># await asyncio.sleep(3) # 异步操作的代码,表明在这段等待时间切换到下一个任务</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func2():</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment"># # time.sleep(4)</span> <span class="token comment"># await asyncio.sleep(4)</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func3():</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment"># # time.sleep(2)</span> <span class="token comment"># await asyncio.sleep(2)</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment">#</span> <span class="token comment"># if __name__ == '__main__':</span> <span class="token comment"># f1 = func1()</span> <span class="token comment"># f2 = func2()</span> <span class="token comment"># f3 = func3()</span> <span class="token comment"># tasks = [</span> <span class="token comment"># f1, f2, f3</span> <span class="token comment"># ]</span> <span class="token comment"># t1 = time.time()</span> <span class="token comment"># # 一次性启动多个任务(协程)</span> <span class="token comment"># asyncio.run(asyncio.wait(tasks))</span> <span class="token comment"># t2 = time.time()</span> <span class="token comment"># print(t2-t1)</span> <span class="token comment"># 上面的这种并不是推荐写法,推荐写法为下方这种,因为这种写法可以套在爬虫上</span> <span class="token comment"># async def func1():</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment"># await asyncio.sleep(3)</span> <span class="token comment"># print("你好,我是func1")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func2():</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment"># await asyncio.sleep(4)</span> <span class="token comment"># print("你好,我是func2")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def func3():</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment"># await asyncio.sleep(2)</span> <span class="token comment"># print("你好,我是func3")</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># async def main():</span> <span class="token comment"># # 第一种写法</span> <span class="token comment"># # f1 = func1()</span> <span class="token comment"># # await f1 # 一般await挂起操作放在协程对象前面</span> <span class="token comment"># # 第二种写法(推荐)</span> <span class="token comment"># tasks = [</span> <span class="token comment"># asyncio.create_task(func1()), # py3.8以后加上asyncio.create_task()</span> <span class="token comment"># asyncio.create_task(func2()),</span> <span class="token comment"># asyncio.create_task(func3())</span> <span class="token comment"># ]</span> <span class="token comment"># await asyncio.wait(tasks)</span> <span class="token comment">#</span> <span class="token comment">#</span> <span class="token comment"># if __name__ == '__main__':</span> <span class="token comment"># asyncio.run(main())</span> <span class="token comment"># 在爬虫领域的应用</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">download</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"准备开始下载"</span><span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 模拟网络请求</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"下载完成"</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> urls <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">"http://www.baidu.com"</span><span class="token punctuation">,</span> <span class="token string">"http://www.bilibili.com"</span><span class="token punctuation">,</span> <span class="token string">"http://www.163.com"</span> <span class="token punctuation">]</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span> d <span class="token operator">=</span> download<span class="token punctuation">(</span>url<span class="token punctuation">)</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>d<span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h4>4.6.3 关于异步协程-过时警告</h4> <p>在python3.8的版本后,task打包需要添加asyncio.create_task(),括号内为任务,3.11版本后将会彻底删除,到时候会直接报错。</p> <h3>4.7 异步http请求aiohttp模块</h3> <p>首先要安装模块pip install aiohttp</p> <p>requests.get()同步的代码–>异步操作aiohttp</p> <pre><code class="prism language-python"><span class="token keyword">import</span> aiohttp <span class="token keyword">import</span> asyncio urls <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan.jpg"</span><span class="token punctuation">,</span> <span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan-001.jpg"</span><span class="token punctuation">,</span> <span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan-003.jpg"</span> <span class="token punctuation">}</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">aiodownload</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> name <span class="token operator">=</span> url<span class="token punctuation">.</span>rsplit<span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token comment"># s = aiohttp.ClientSession() <==> requests.session()</span> <span class="token comment"># s.get(),post() = requests.get(),post()</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span> <span class="token comment"># 请求回来了 写入文件</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"Wallpaper/"</span><span class="token operator">+</span>name<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token keyword">await</span> resp<span class="token punctuation">.</span>content<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 读取内容是异步的 需要await挂起</span> <span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">"done!"</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>aiodownload<span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 这里使用asyncio.run(main())会报RuntimeError: Event loop is closed,改为下方这种就不会报错了</span> loop <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span> loop<span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>4.8 异步爬虫实战-扒光一部小说</h3> <ol> <li>同步操作:访问 getCatalog 拿到所有章节cid和名称</li> <li>异步操作:访问 getChapterContent 下载所有的文章内容</li> </ol> <pre><code class="prism language-python"><span class="token comment"># http://dushu.baidu.com/api/pc/getCatalog?data={'book_id':'4306063500'} # 获取章节的内容</span> <span class="token comment"># 获得小说内容</span> <span class="token comment"># http://dushu.baidu.com/api/pc/getChapterContent?data={'book_id':'4306063500','cid':'4306063500|11348571','need_bookinfo':1}</span> <span class="token keyword">import</span> requests <span class="token keyword">import</span> asyncio <span class="token keyword">import</span> aiohttp <span class="token keyword">import</span> aiofiles <span class="token keyword">import</span> json <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">aiodownload</span><span class="token punctuation">(</span>cid<span class="token punctuation">,</span> b_id<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token punctuation">:</span> data <span class="token operator">=</span> <span class="token punctuation">{ </span> <span class="token string">"book_id"</span><span class="token punctuation">:</span> b_id<span class="token punctuation">,</span> <span class="token string">"cid"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{ </span>b_id<span class="token punctuation">}</span></span><span class="token string">|</span><span class="token interpolation"><span class="token punctuation">{ </span>cid<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token string">"need_bookinfo"</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token punctuation">}</span> data <span class="token operator">=</span> json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>data<span class="token punctuation">)</span> url <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"http://dushu.baidu.com/api/pc/getChapterContent?data=</span><span class="token interpolation"><span class="token punctuation">{ </span>data<span class="token punctuation">}</span></span><span class="token string">"</span></span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span> <span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span> dic <span class="token operator">=</span> <span class="token keyword">await</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">with</span> aiofiles<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"西游记/"</span> <span class="token operator">+</span> title<span class="token operator">+</span><span class="token string">".txt"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> <span class="token keyword">await</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>dic<span class="token punctuation">[</span><span class="token string">"data"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"novel"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> <span class="token string">"success"</span><span class="token punctuation">)</span> <span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">getCatalog</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">for</span> item <span class="token keyword">in</span> dic<span class="token punctuation">[</span><span class="token string">"data"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"novel"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"items"</span><span class="token punctuation">]</span><span class="token punctuation">:</span> <span class="token comment"># item就是对应每个章节的名称和id</span> title <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span> cid <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">"cid"</span><span class="token punctuation">]</span> <span class="token comment"># 准备异步任务</span> tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>aiodownload<span class="token punctuation">(</span>cid<span class="token punctuation">,</span> b_id<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> b_id <span class="token operator">=</span> <span class="token string">"4306063500"</span> url <span class="token operator">=</span> <span class="token string">'http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'</span> <span class="token operator">+</span> b_id <span class="token operator">+</span> <span class="token string">'"}'</span> asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>getCatalog<span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <h3>4.9 爬取视频</h3> <h4>4.9.1 综合训练-视频网站的工作原理</h4> <p>我们在编写网站的时候,对于视频文件会有一个视频标签,但是如果一个视频网站这样放视频那么每次播放的时候都相当于把视频完整下载,那这个会非常耗时。</p> <p><strong>那一般的视频网站是怎么做的</strong>?</p> <p>用户上传 -> 转码(把视频做处理,2k,1080,标清) -> 切片处理(把单个文件进行拆分成多个文件,用户在拖动进度条的时候只需要加载对应文件)</p> <p>既然要把视频切成非常多个小碎片,那就需要一个文件来记录:1.视频播放顺序,2.视频存放的路径。该文件一般为M3U文件,M3U文件中的内容经过utf-8的编码后,就是M3U8文件,今天我们看到的各大视频网站平台使用的几乎都是M3U8文件。</p> <p>M3U8文件解读:</p> <pre><code class="prism language-python"><span class="token comment">#EXTM3U</span> <span class="token comment">#EXT-X-VERSION:3</span> <span class="token comment">#EXT-X-TARGETDURATION:13 每个视频功片最大时长 </span> <span class="token comment">#EXT-X-MEDIA-SEQUENCE:0</span> <span class="token comment">#EXT-X-KEY:METH0D=AES-128,URI="key.key" 切片文件的加密方式以及加密的密钥地址,如果有加密,需要先解密才能播放</span> <span class="token comment">#EXTINF:12.600000, 持续时间 </span> cFN803436000<span class="token punctuation">.</span>ts 这里面不带<span class="token string">'#'</span>开头的就是每个ts文作的地址 <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436001<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000, </span> cFN8o3436002<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436003<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436004<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:10.000000,</span> cFN8o3436005<span class="token punctuation">.</span>ts <span class="token comment">#EXTINF:6.880000 </span> cFN803436006<span class="token punctuation">.</span>ts </code></pre> <p>那么想要抓取一个视频的流程:</p> <ol> <li>找到M3U8(各种手段)</li> <li>通过M3U8下载到ts文件</li> <li>可以通过各种手段(不仅是编程手段)把ts文件合并为一个mp4文件</li> </ol> <h4>4.9.2 抓取云播TV-简单版</h4> <p>网站失效,使用云播tv</p> <p>url:https://www.yunbtv.com/</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">import</span> re url <span class="token operator">=</span> <span class="token string">"https://video.buycar5.cn/20200813/uNqvsBhl/2000kb/hls/index.m3u8"</span> key_uri<span class="token operator">=</span> <span class="token string">"https://ts1.yuyuangewh.com:9999/20200813/uNqvsBhl/2000kb/hls/key.key"</span> <span class="token comment"># 1.首先打印出m3u8文件的内容 发现内容有加密</span> m3u8_text <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># 2.将m3u8文件下载并改名为index.m3u8</span> <span class="token comment"># with open("download_video/"+"index.m3u8", mode="wb") as f:</span> <span class="token comment"># f.write(m3u8_text.content)</span> <span class="token comment"># m3u8_text.close()</span> <span class="token comment"># print("m3u8 success")</span> <span class="token comment"># 3.下载key.key文件并改名为key.m3u8</span> <span class="token comment"># key_text = requests.get(key_uri)</span> <span class="token comment"># with open("download_video/"+"key.m3u8", mode="wb") as f:</span> <span class="token comment"># f.write(key_text.content)</span> <span class="token comment"># key_text.close()</span> <span class="token comment"># print("key success")</span> <span class="token comment"># 4.解析m3u8文件</span> n <span class="token operator">=</span> <span class="token number">1</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"download_video/index.m3u8"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'r'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> <span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">:</span> line <span class="token operator">=</span> line<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 先去掉空格,换行符</span> <span class="token keyword">if</span> line<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">"#"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 如果以#开头跳过该行</span> <span class="token keyword">continue</span> <span class="token comment"># 下载视频片段</span> resp2 <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>line<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"download_video/"</span><span class="token operator">+</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{ </span>n<span class="token punctuation">}</span></span><span class="token string">.ts"</span></span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"wb"</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>resp2<span class="token punctuation">.</span>content<span class="token punctuation">)</span> f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> resp2<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> n <span class="token operator">+=</span> <span class="token number">1</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"第</span><span class="token interpolation"><span class="token punctuation">{ </span>n<span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">}</span></span><span class="token string">个完成"</span></span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span> </code></pre> <p>这是根据我在网上搜到的一些资料做的,与视频不同,并且还未优化</p> <h2>第五章 selenium</h2> <h3>5.1 selenium引入概念</h3> <p>selenium是一个自动化测试工具,它可以打开浏览器,然后像人一样去操作浏览器,程序员可以从selenium中直接提取网页上的各种信息</p> <p>环境搭建:</p> <ul> <li>pip install selenium</li> <li>下载浏览器驱动http://npm.taobao.org/mirrors/chromedriver</li> <li>下载对应浏览器版本的文件解压缩,把浏览器驱动chromedriver放在python解释器所在的文件夹</li> <li>让selenium启动chrome</li> </ul> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token comment"># 1.创建浏览器对象</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 2.打开一个网址</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.baidu.com"</span><span class="token punctuation">)</span> </code></pre> <h3>5.2 selenium各种操作-抓拉钩</h3> <p>本节中使用selenium来抓取抓钩招聘网的岗位信息</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>keys <span class="token keyword">import</span> Keys <span class="token keyword">import</span> time web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://lagou.com"</span><span class="token punctuation">)</span> <span class="token comment"># 找到某个元素 点击</span> el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="changeCityBox"]/p[1]/a'</span><span class="token punctuation">)</span> el<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 点击事件</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 让浏览器缓一会</span> <span class="token comment"># 找到输入框 输入python => 输入回车/点击搜索按钮</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="search_input"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"python"</span><span class="token punctuation">,</span> Keys<span class="token punctuation">.</span>ENTER<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 查找存放数据的位置 进行数据提取</span> <span class="token comment"># 找到页面中存放数据的所有li</span> li_list <span class="token operator">=</span> web<span class="token punctuation">.</span>find_elements_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span> job_name <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_tag_name<span class="token punctuation">(</span><span class="token string">"h3"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text job_price <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"./div/div/div[2]/div/span"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text job_company <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"./div/div[2]/div/a"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text <span class="token keyword">print</span><span class="token punctuation">(</span>job_name<span class="token punctuation">,</span> job_company<span class="token punctuation">,</span> job_price<span class="token punctuation">)</span> </code></pre> <h3>5.3 各种操作-窗口间的切换</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>keys <span class="token keyword">import</span> Keys <span class="token keyword">import</span> time web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://lagou.com"</span><span class="token punctuation">)</span> el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="changeCityBox"]/p[1]/a'</span><span class="token punctuation">)</span> el<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="search_input"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"python"</span><span class="token punctuation">,</span> Keys<span class="token punctuation">.</span>ENTER<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 在selenium眼中 新窗口是默认切换不过来的</span> web<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>window<span class="token punctuation">(</span>web<span class="token punctuation">.</span>window_handles<span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># 在新窗口中提取内容</span> job_detail <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="job_detail"]/dd[2]/div'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text <span class="token keyword">print</span><span class="token punctuation">(</span>job_detail<span class="token punctuation">)</span> <span class="token comment"># 关掉子窗口</span> web<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 变更selenium的窗口视角 回到原本的窗口</span> web<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>window<span class="token punctuation">(</span>web<span class="token punctuation">.</span>window_handles<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text<span class="token punctuation">)</span> </code></pre> <h3>5.4 selenium操作-无头浏览器</h3> <p>爬取某个页面信息时希望浏览器在后台默默运行</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>support<span class="token punctuation">.</span>select <span class="token keyword">import</span> Select <span class="token keyword">import</span> time <span class="token comment"># 无头浏览器 准备好参数配置</span> opt <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> opt<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">"--headless"</span><span class="token punctuation">)</span> opt<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">"--disable-gpu"</span><span class="token punctuation">)</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>opt<span class="token punctuation">)</span> <span class="token comment"># 把参数配置设置到浏览器中</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.endata.com.cn/BoxOffice/BO/Year/index.html"</span><span class="token punctuation">)</span> <span class="token comment"># 定位到下拉列表</span> sel_el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="OptionDate"]'</span><span class="token punctuation">)</span> <span class="token comment"># 对元素进行包装,包装成下拉菜单</span> sel <span class="token operator">=</span> Select<span class="token punctuation">(</span>sel_el<span class="token punctuation">)</span> <span class="token comment"># 让浏览器进行调整选项</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>sel<span class="token punctuation">.</span>options<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># i就是每一个下拉框选项的索引位置</span> sel<span class="token punctuation">.</span>select_by_index<span class="token punctuation">(</span>i<span class="token punctuation">)</span> <span class="token comment"># 按照索引切换</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> table <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="TableList"]/table'</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>table<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># 打印所有文本信息</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"============================================="</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 如何拿到页面代码Elements(经过数据加载以及js执行之后的结果的html内容)</span> <span class="token comment"># print(web.page_source)</span> </code></pre> <h3>5.5 selenium各种操作-超级鹰处理验证码</h3> <ol> <li>图像识别</li> <li>选择互联网上成熟的验证码破解工具</li> </ol> <p>超级鹰就是网上的一种识别验证码的工具,需要自行注册以及购买使用积分,在官网的开发文档中可以找到对应语言的文档,只需运行该文档就可以实现功能</p> <h3>5.6 selenium -超级鹰干超级鹰</h3> <p>这一节的内容就是使用超级鹰自动登录超级鹰网站,主要考验的就是对超级鹰方法的使用</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> chaojiying <span class="token keyword">import</span> Chaojiying_Client <span class="token keyword">import</span> time web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.chaojiying.com/user/login/"</span><span class="token punctuation">)</span> <span class="token comment"># 处理验证码</span> img <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"/html/body/div[3]/div/div[3]/div[1]/form/div/img"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>screenshot_as_png chaojiying <span class="token operator">=</span> Chaojiying_Client<span class="token punctuation">(</span><span class="token string">'超级鹰用户名'</span><span class="token punctuation">,</span> <span class="token string">'超级鹰密码'</span><span class="token punctuation">,</span> <span class="token string">'ID'</span><span class="token punctuation">)</span> verity_code <span class="token operator">=</span> chaojiying<span class="token punctuation">.</span>PostPic<span class="token punctuation">(</span>img<span class="token punctuation">,</span> <span class="token number">1902</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">'pic_str'</span><span class="token punctuation">]</span> <span class="token comment"># 向页面中填入用户名,密码,验证码</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"超级鹰用户名"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"超级鹰密码"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span>verity_code<span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> <span class="token comment"># 点击登录</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre> <h3>5.7 selenium-搞定12306的登陆问题</h3> <p>12306登陆页面已取消图片验证,因此与视频有所不同</p> <p>12306可以检测你的浏览器是否是自动测试软件控制,因此如果没有特殊方法无法通过滑块验证,检测原理就是浏览器控制台中输入<strong>window.navigator.webdriver</strong>,可以发现我们测试中的Chrome浏览器返回的结果为True,而一般浏览器是False,所以12306就是根据这个返回的结果判断你是不是在自动测试。</p> <p>不被检测方法:</p> <ul> <li> <p>Chrome版本号小于88:在你启动浏览器的时候(此时没有加载任何网页内容),向页面嵌入js代码,去掉webdriver,也就是在web.get()代码前嵌入</p> </li> <li> <pre><code class="prism language-python">web<span class="token punctuation">.</span>execute_cdp_cmd<span class="token punctuation">(</span><span class="token string">"Page.addScriptToEvaluateOnNewDocument"</span><span class="token punctuation">,</span> <span class="token punctuation">{ </span> <span class="token string">"source"</span><span class="token punctuation">:</span> <span class="token triple-quoted-string string">""" navigator.webdriver = undefined Object.defineProperty(navigator, 'webdriver', { get: () => undefined }] """</span> <span class="token punctuation">}</span><span class="token punctuation">)</span>xxxxxxxxxx web<span class="token punctuation">.</span>executeweb<span class="token punctuation">.</span>execute_cdp_cmd<span class="token punctuation">(</span><span class="token string">"Page.addScriptToEvaluateOnNewDocument"</span><span class="token punctuation">,</span> <span class="token punctuation">{ </span> <span class="token string">"source"</span><span class="token punctuation">:</span> <span class="token triple-quoted-string string">""" navigator.webdriver = undefined Object.defineProperty(navigator, 'webdriver', { get: () => undefined }] """</span><span class="token punctuation">}</span><span class="token punctuation">)</span> </code></pre> </li> <li> <p>Chrome版本号大于88:需要导入一个包,增加options属性</p> </li> <li> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options option <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> option<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-blink-features=AutomationControlled'</span><span class="token punctuation">)</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span> </code></pre> </li> </ul> <p>以下是我的代码</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>action_chains <span class="token keyword">import</span> ActionChains <span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options <span class="token keyword">import</span> time option <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span> option<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-blink-features=AutomationControlled'</span><span class="token punctuation">)</span> web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span> web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://kyfw.12306.cn/otn/resources/login.html"</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 等待响应</span> <span class="token comment"># 切换到账号登陆</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="toolbar_Div"]/div[2]/div[2]/ul/li[2]/a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 填写账号密码</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-userName"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"123456789"</span><span class="token punctuation">)</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-password"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"123456789"</span><span class="token punctuation">)</span> <span class="token comment"># 点击登录</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-login"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 滑块拖拽验证 使用动作链</span> span_element <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="nc_1_n1z"]'</span><span class="token punctuation">)</span> ActionChains<span class="token punctuation">(</span>web<span class="token punctuation">)</span><span class="token punctuation">.</span>drag_and_drop_by_offset<span class="token punctuation">(</span>span_element<span class="token punctuation">,</span> <span class="token number">320</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>perform<span class="token punctuation">(</span><span class="token punctuation">)</span> </code></pre></li> </ul> </div> </div> </div> </div> </div> <!--PC和WAP自适应版--> <div id="SOHUCS" sid="1450727448940462080"></div> <script type="text/javascript" src="/views/front/js/chanyan.js"></script> <!-- 文章页-底部 动态广告位 --> <div class="youdao-fixed-ad" id="detail_ad_bottom"></div> </div> <div class="col-md-3"> <div class="row" id="ad"> <!-- 文章页-右侧1 动态广告位 --> <div id="right-1" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_1"> </div> </div> <!-- 文章页-右侧2 动态广告位 --> <div id="right-2" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_2"></div> </div> <!-- 文章页-右侧3 动态广告位 --> <div id="right-3" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_3"></div> </div> </div> </div> </div> </div> </div> <div class="container"> <h4 class="pt20 mb15 mt0 border-top">你可能感兴趣的:(笔记,python,爬虫,python,爬虫)</h4> <div id="paradigm-article-related"> <div class="recommend-post mb30"> <ul class="widget-links"> <li><a href="/article/1773613272952537088.htm" title="【Python】一文详细介绍 py格式 文件" target="_blank">【Python】一文详细介绍 py格式 文件</a> <span class="text-muted">高斯小哥</span> <a class="tag" taget="_blank" href="/search/Python%E5%9F%BA%E7%A1%80%E3%80%90%E9%AB%98%E8%B4%A8%E9%87%8F%E5%90%88%E9%9B%86%E3%80%91/1.htm">Python基础【高质量合集】</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%96%B0%E6%89%8B%E5%85%A5%E9%97%A8/1.htm">新手入门</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a> <div>【Python】一文详细介绍py格式文件个人主页:高斯小哥高质量专栏:Matplotlib之旅:零基础精通数据可视化、Python基础【高质量合集】、PyTorch零基础入门教程希望得到您的订阅和支持~创作高质量博文(平均质量分92+),分享更多关于深度学习、PyTorch、Python领域的优质内容!(希望得到您的关注~)文章目录一、py格式文件简介二、如何创建和编辑py格式文件三、如何运行py</div> </li> <li><a href="/article/1773604712310964224.htm" title="python抓包与解包_Python—网络抓包与解包(pcap、dpkt)" target="_blank">python抓包与解包_Python—网络抓包与解包(pcap、dpkt)</a> <span class="text-muted">weixin_39691055</span> <a class="tag" taget="_blank" href="/search/python%E6%8A%93%E5%8C%85%E4%B8%8E%E8%A7%A3%E5%8C%85/1.htm">python抓包与解包</a> <div>pcap安装[root@localhost~]#pipinstallpypcap抓包与解包#-*-coding:utf-8-*-importpcap,dpktimportre,threading,requests__black_ip=['103.224.249.123','203.66.1.212']#抓包:param1eth_name网卡名,如:eth0,eth3。param2p_type日志捕</div> </li> <li><a href="/article/1773583062202908672.htm" title="新网师的精神肤色(幕布笔记)" target="_blank">新网师的精神肤色(幕布笔记)</a> <span class="text-muted">悦读书香</span> <div>王子老师的《极简100小妙招》收到已经几天了,之前大概的浏览了全书,今天起给自己定了一个计划,必须每天学习极简小妙招里面的一个妙招,并加以运用。一、今天要打卡什么内容因有完成每天学习极简小妙招的计划,所以今天晚饭吃的比较简单,草草吃完以后带着小宝到广场溜达一圈,急忙赶回来学习极简小妙招。再重看的时候不知道自己要学点什么,打卡哪一招,感觉哪个都简单,就看这一环节像王子老师说的“一看就会”,但做这一环</div> </li> <li><a href="/article/1773582305621770240.htm" title="华为OD机试 - 单向链表中间节点(Java & JS & Python & C & C++)" target="_blank">华为OD机试 - 单向链表中间节点(Java & JS & Python & C & C++)</a> <span class="text-muted">华为OD题库</span> <a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BAod/1.htm">华为od</a><a class="tag" taget="_blank" href="/search/%E9%93%BE%E8%A1%A8/1.htm">链表</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>须知哈喽,本题库完全免费,收费是为了防止被爬,大家订阅专栏后可以私信联系退款。感谢支持文章目录须知题目描述输出描述解析代码题目描述给定一个单链表L,请编写程序输出L中间结点保存的数据。如果有两个中间结点,则输出第二个中间结点保存的数据。例如:给定L为1→7→5,则输出应该为7;给定L为1→2→3→4,则输出应该为3;输入描述每个输入包含1个测试用例。每个测试用例:第一行给出链表首结点的地址、结点总</div> </li> <li><a href="/article/1773571355124498432.htm" title="python 推导式(派生、衍生)" target="_blank">python 推导式(派生、衍生)</a> <span class="text-muted">sanduo112</span> <a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/windows/1.htm">windows</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>python推导式一、推导式(派生、衍生)1.Python推导式是一种独特的数据处理方式,可以从一个数据序列构建另一个新的数据序列的结构体。2.列表(list)推导式3.字典(dict)推导式4.集合(set)推导式5.元组(tuple)推导式二、代码概述一、推导式(派生、衍生)1.Python推导式是一种独特的数据处理方式,可以从一个数据序列构建另一个新的数据序列的结构体。Python支持各种数</div> </li> <li><a href="/article/1773549956938924032.htm" title="数据挖掘|数据预处理|基于Python的数据标准化方法" target="_blank">数据挖掘|数据预处理|基于Python的数据标准化方法</a> <span class="text-muted">皖山文武</span> <a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BB%BA%E6%A8%A1%E4%B8%8E%E5%88%86%E6%9E%90/1.htm">数据建模与分析</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>基于Python的数据标准化方法1.z-score方法2.极差标准化方法3.最大绝对值标准化方法在数据分析之前,通常需要先将数据标准化(Standardization),利用标准化后的数据进行数据分析,以避免属性之间不同度量和取值范围差异造成数据对分析结果的影响。1.z-score方法Z-score方法是基于原始数据的均值和标准差来进行数据标准化的,处理后的数据均值为0,方差为1,符合标准正态分布</div> </li> <li><a href="/article/1773545802220765184.htm" title="C++学习笔记(lambda函数)" target="_blank">C++学习笔记(lambda函数)</a> <span class="text-muted">__TAT__</span> <a class="tag" taget="_blank" href="/search/C%26amp%3BC%2B%2B/1.htm">C&C++</a><a class="tag" taget="_blank" href="/search/c%2B%2B/1.htm">c++</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>C++learningnote1、lambda函数的语法2、lambda函数的几种用法1、lambda函数的语法lambda函数的一般语法如下:[capture_clause](parameters)->return_type{function_body}capture_clause:需要捕获的变量,但要求该变量必须在这个作用域中。通常的捕获方式有以下几种:[]:不捕获任何变量[&]:按引用捕获变</div> </li> <li><a href="/article/1773540012541935616.htm" title="CSV指南:Python程序获取大型CSV文件行数" target="_blank">CSV指南:Python程序获取大型CSV文件行数</a> <span class="text-muted">孤独打铁匠Julian</span> <a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a><a class="tag" taget="_blank" href="/search/%E7%BB%8F%E9%AA%8C%E5%88%86%E4%BA%AB/1.htm">经验分享</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>本指南提供了几种使用Python来获取大型CSV文件行数的方法,并解释了每种方法的适用场景。方法1:使用csv.reader处理复杂CSV文件当你的CSV文件中包含多行字段(即某些字段的值中包含换行符)时,使用csv.reader是一个可靠的选择,因为它能够正确处理这些复杂情况。这个方法适用于大多数大小的CSV文件,但是对于非常大的文件,读取整个文件可能会占用较多的时间和内存。对于极大的文件,考虑</div> </li> <li><a href="/article/1773504261557125120.htm" title="谷歌浏览器驱动Chromedriver(114-120版本)文件以及驱动下载教程" target="_blank">谷歌浏览器驱动Chromedriver(114-120版本)文件以及驱动下载教程</a> <span class="text-muted">pigerr杨</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/chrome/1.htm">chrome</a><a class="tag" taget="_blank" href="/search/drivers/1.htm">drivers</a> <div>ChromeDriver官方网站GitHub||GoogleChromeLabs/chrome-for-testingChromeDriver113-125_JSONChromeforTestingavailability123-125zip白月黑羽Python基础|进阶|Qt图形界面|Django|自动化测试|性能测试|JS语言|JS前端|原理与安装</div> </li> <li><a href="/article/1773500735770656768.htm" title="大创项目推荐 深度学习 opencv python 公式识别(图像识别 机器视觉)" target="_blank">大创项目推荐 深度学习 opencv python 公式识别(图像识别 机器视觉)</a> <span class="text-muted">laafeer</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>文章目录0前言1课题说明2效果展示3具体实现4关键代码实现5算法综合效果6最后0前言优质竞赛项目系列,今天要分享的是基于深度学习的数学公式识别算法实现该项目较为新颖,适合作为竞赛课题方向,学长非常推荐!学长这里给一个题目综合评分(每项满分5分)难度系数:3分工作量:4分创新点:4分更多资料,项目分享:https://gitee.com/dancheng-senior/postgraduate1课题</div> </li> <li><a href="/article/1773499616306724864.htm" title="读书笔记《穿越寒冬》" target="_blank">读书笔记《穿越寒冬》</a> <span class="text-muted">如雪般飞舞</span> <div>各位好,我们今天来讲一本书,名字叫作《穿越寒冬》。看起来特别应景,大家觉得现在创业的状况不景气,大家都在忍受着寒冬的煎熬。但实际上,这本书的英文名字并不是这个意思,它的英文名叫作“如何创立一家新公司,并且能够活下来”。我在整个读完了以后,我发现这本书真正要翻译得好,它的名字应该叫作《创业生存手册》。这个书的作者,来自硅谷的霍夫曼船长。霍夫曼船长写过一本让创业者觉得特别贴心的书,叫作《让大象飞》它和</div> </li> <li><a href="/article/1773475780618158080.htm" title="2018-11-18成长小组学习笔记" target="_blank">2018-11-18成长小组学习笔记</a> <span class="text-muted">实验中学45</span> <div>因为嗓子“罢工”,我面对众人只能借“微笑”代言。在开始授课前,绣霞老师先反馈上次作业的情况,提到“接纳”需是真正发自内心的完全接纳,而不是口头上的接纳,内心却是排斥的。提到一个“问题”孩子恰恰对家爱的更加“深沉”,夫妻间的问题不能影响到孩子,对孩子更好的爱不是你为他做的更多,而是给他自由、健康成长的空间。图片发自App一、孩子:家庭的一面镜子夫妻成了彼此的“投射”,婚姻便“吵的不可开交”,婚姻便成</div> </li> <li><a href="/article/1773458565772673024.htm" title="【鸿蒙HarmonyOS开发笔记】ArkUI常用组件介绍汇总(更新中)" target="_blank">【鸿蒙HarmonyOS开发笔记】ArkUI常用组件介绍汇总(更新中)</a> <span class="text-muted">温、</span> <a class="tag" taget="_blank" href="/search/%E9%B8%BF%E8%92%99HarmonyOS%E5%BC%80%E5%8F%91%E7%AC%94%E8%AE%B0/1.htm">鸿蒙HarmonyOS开发笔记</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0%E8%AE%B0%E5%BD%95/1.htm">学习记录</a><a class="tag" taget="_blank" href="/search/harmonyos/1.htm">harmonyos</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a><a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BA/1.htm">华为</a> <div>概述此文总结开发中用到的一些常用组件,便于查阅,此文持续更新,闲的没事就更线性布局(Row/Column)不多介绍了,最常用的布局组件,两者除了方向不一样,别的都一样方便起见下面只写Column常用属性排列方向上的间距:spaceColumn({space:20}){Row().width('90%').height(50).backgroundColor(0xF5DEB3)Row().width</div> </li> <li><a href="/article/1773450885851054080.htm" title="python转码" target="_blank">python转码</a> <span class="text-muted">Desamond</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>转码在许多场景中都有应用,以下是一些常见的场景:网页开发:当用户在网页上输入文本时,可能需要将特殊字符(如空格、引号、特殊符号等)进行转码,以防止这些字符对URL或HTML代码产生干扰。文件名处理:在处理文件名时,可能需要将特殊字符进行转码,以避免文件名被错误地解析或显示。数据传输:在数据传输过程中,为了确保数据的完整性和正确性,可能需要将数据中的特殊字符进行转码。数据存储:在数据库或数据存储中,</div> </li> <li><a href="/article/1773444718840053760.htm" title="排序算法太多?常用排序都在这了,一篇文章总结和实现所有面试会考的排序算法(基于Python实现)" target="_blank">排序算法太多?常用排序都在这了,一篇文章总结和实现所有面试会考的排序算法(基于Python实现)</a> <span class="text-muted">宇宙之一粟</span> <a class="tag" taget="_blank" href="/search/%E4%B8%8D%E5%BD%92%E8%B7%AF%E4%B9%8BPython/1.htm">不归路之Python</a><a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/IT%E9%9D%A2%E8%AF%95%E9%A2%98%E6%94%B6%E9%9B%86%E4%B8%8E%E6%80%BB%E7%BB%93/1.htm">IT面试题收集与总结</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84%E4%B8%8E%E7%AE%97%E6%B3%95/1.htm">数据结构与算法</a><a class="tag" taget="_blank" href="/search/%E7%AE%97%E6%B3%95/1.htm">算法</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84/1.htm">数据结构</a><a class="tag" taget="_blank" href="/search/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95/1.htm">排序算法</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>文章目录排序算法1.常见的排序算法1.1选择排序1.1.1思想1.1.2实现**1.1.3选择排序分析**1.2冒泡排序**1.2.1思想****1.2.2实现****1.2.3冒泡排序分析**1.3插入排序**1.3.1思想****1.3.2实现****1.3.3插入排序分析**1.4归并排序☆☆★**1.4.1思想****1.4.2实现****1.4.3归并排序分析**1.5快速排序☆★★**</div> </li> <li><a href="/article/1773442075023441920.htm" title="27.Python从入门到精通—Python异常处理 抛出异常 用户自定义异常 定义清理行为 预定义的清理行为" target="_blank">27.Python从入门到精通—Python异常处理 抛出异常 用户自定义异常 定义清理行为 预定义的清理行为</a> <span class="text-muted">以山河作礼。</span> <a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/Python%E5%9F%BA%E7%A1%80%E5%85%A5%E9%97%A8%E2%80%94%E8%AF%A6%E8%A7%A3%E7%89%88/1.htm">Python基础入门—详解版</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E6%9C%8D%E5%8A%A1%E5%99%A8/1.htm">服务器</a> <div>27.从入门到精通:Python异常处理抛出异常用户自定义异常定义清理行为预定义的清理行为异常处理抛出异常用户自定义异常定义清理行为预定义的清理行为异常处理在Python中,异常处理是一种处理程序在执行期间可能遇到的错误的方法。当Python解释器遇到错误时,它会引发异常。异常是一种Python对象,它包含有关错误的信息,例如错误类型和错误位置。为了处理异常,您可以使用try-except语句。在</div> </li> <li><a href="/article/1773439683024453632.htm" title="python清华大学出版社答案_Python机器学习及实践" target="_blank">python清华大学出版社答案_Python机器学习及实践</a> <span class="text-muted">weixin_39805119</span> <a class="tag" taget="_blank" href="/search/python%E6%B8%85%E5%8D%8E%E5%A4%A7%E5%AD%A6%E5%87%BA%E7%89%88%E7%A4%BE%E7%AD%94%E6%A1%88/1.htm">python清华大学出版社答案</a> <div>第1章机器学习的基础知识1.1何谓机器学习1.1.1传感器和海量数据1.1.2机器学习的重要性1.1.3机器学习的表现1.1.4机器学习的主要任务1.1.5选择合适的算法1.1.6机器学习程序的步骤1.2综合分类1.3推荐系统和深度学习1.3.1推荐系统1.3.2深度学习1.4何为Python1.4.1使用Python软件的由来1.4.2为什么使用Python1.4.3Python设计定位1.4.</div> </li> <li><a href="/article/1773437213028188160.htm" title="安卓笔记本 - Handler Message MessageQueue Looper" target="_blank">安卓笔记本 - Handler Message MessageQueue Looper</a> <span class="text-muted">SocialException</span> <div>不爱写字,一张图解决。Handler,Message,MessageQueue,Looper工作原理</div> </li> <li><a href="/article/1773436409693143040.htm" title="枚举使用笔记" target="_blank">枚举使用笔记</a> <span class="text-muted">万变不离其宗_8</span> <a class="tag" taget="_blank" href="/search/%E9%A1%B9%E7%9B%AE%E7%AC%94%E8%AE%B0/1.htm">项目笔记</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>1.java枚举怎么放在方法上面的注释里面/***保存*@paramuserId用户id*@paramtype见枚举{@linkcom.common.enums.TypeEnum}*@return*/voidsave(LonguserId,Stringtype);</div> </li> <li><a href="/article/1773420547133210624.htm" title="ruoyi使用笔记" target="_blank">ruoyi使用笔记</a> <span class="text-muted">万变不离其宗_8</span> <a class="tag" taget="_blank" href="/search/%E9%A1%B9%E7%9B%AE%E7%AC%94%E8%AE%B0/1.htm">项目笔记</a><a class="tag" taget="_blank" href="/search/%E4%BB%A3%E7%A0%81%E5%8F%82%E8%80%83%E7%AC%94%E8%AE%B0/1.htm">代码参考笔记</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%89%8D%E7%AB%AF/1.htm">前端</a> <div>1.限流处理@RateLimiter@PostMapping("/createOrder")@ApiOperation("创建充值订单")@RateLimiter(key=CacheConstants.REPEAT_SUBMIT_KEY,time=10,count=1,limitType=LimitType.IP)publicRcreateOrder(@RequestBodyFormform){/</div> </li> <li><a href="/article/1773412869279383552.htm" title="Python | Redis工具类" target="_blank">Python | Redis工具类</a> <span class="text-muted">-拟墨画扇-</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/redis/1.htm">redis</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E7%BC%93%E5%AD%98/1.htm">缓存</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>一、需求自动连接Redis数据库,通过连接池处理数据对输出结果进行Log打印并保存到文件二、代码Utils.redisUtils.py#!/usr/bin/envpython#-*-coding:utf-8-*-importredisfromUtils.loggerimportlog"""Redis数据格式(1)字符串|存储形式:key-value:str-存储二进制数据:可以存储任意类型的数据,</div> </li> <li><a href="/article/1773411736095883264.htm" title="数据管理知识体系指南(第二版)-第五章——数据建模和设计-学习笔记" target="_blank">数据管理知识体系指南(第二版)-第五章——数据建模和设计-学习笔记</a> <span class="text-muted">键盘上的五花肉</span> <a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%B2%BB%E7%90%86/1.htm">数据治理</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E4%BB%93%E5%BA%93/1.htm">数据仓库</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%B2%BB%E7%90%86/1.htm">数据治理</a> <div>目录5.1引言5.1.1业务驱动因素5.1.2目标和原则5.1.3基本概念5.2活动5.2.1规划数据建模5.2.2建立数据模型5.2.3审核数据模型5.2.4维护数据模型5.3工具5.3.1数据建模工具5.3.2数据血缘工具5.3.3数据分析工具5.3.4元数据资料库5.3.5数据模型模式5.3.6行业数据模型5.4方法5.4.1命名约定的最佳实践5.4.2数据库设计中的最佳实践5.5数据建模和</div> </li> <li><a href="/article/1773403175781466112.htm" title="Python dict字符串转json对象,小数精度丢失问题" target="_blank">Python dict字符串转json对象,小数精度丢失问题</a> <span class="text-muted">朝如青丝 暮成雪</span> <a class="tag" taget="_blank" href="/search/json/1.htm">json</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>一前言JSON(JavaScriptObjectNotation)是一种轻量级的数据交换格式,dict是Python的一种数据格式。本篇介绍一个float数据转换时精度丢失的案例。二问题描述importjsontest_str1='{"π":3.1415926535897932384626433832795028841971}'test_str2='{"value":10.00000}'print</div> </li> <li><a href="/article/1773351821931249664.htm" title="Java学习笔记01" target="_blank">Java学习笔记01</a> <span class="text-muted">.wsy.</span> <a class="tag" taget="_blank" href="/search/%E6%97%A5%E5%B8%B8/1.htm">日常</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>1.1Java简介Java的前身是Oak,詹姆斯·高斯林是java之父。1.2Java体系Java是一种与平台无关的语言,其源代码可以被编译成一种结构中立的中间文件(.class,字节码文件)于Java虚拟机上运行。1.2.3专有名词JDK提供编译、运行Java程序所需要的种种工具及资源。JRE是运行Java所依赖的环境的集合。JVM是一个虚构出来的计算机,通过在实际的计算机上仿真模拟各种计算机功</div> </li> <li><a href="/article/1773343638902865920.htm" title="Python+Requests模拟发送GET请求" target="_blank">Python+Requests模拟发送GET请求</a> <span class="text-muted">爱学习的执念</span> <a class="tag" taget="_blank" href="/search/%E8%87%AA%E5%8A%A8%E5%8C%96%E6%B5%8B%E8%AF%95/1.htm">自动化测试</a><a class="tag" taget="_blank" href="/search/%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95/1.htm">软件测试</a><a class="tag" taget="_blank" href="/search/%E6%8A%80%E6%9C%AF%E5%88%86%E4%BA%AB/1.htm">技术分享</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>模拟发送GET请求前置条件:导入requests库一、发送不带参数的get请求代码如下:以百度首页为例importrequests#发送get请求response=requests.get(url="http://www.baidu.com")print(response.content.decode("utf-8"))#以utf-8的编码输出内容二、发送带参数的get请求发送带参数的get请求有</div> </li> <li><a href="/article/1773340139582455808.htm" title="《老子》笔记19 2018-10-28" target="_blank">《老子》笔记19 2018-10-28</a> <span class="text-muted">海上明月共</span> <div>第二十二章[原文]曲则全,枉则直,洼则盈,敝则新,少则得,多则惑。是以圣人抱一为天下式。不自见,故明;不自是,故彰,不自伐,故有功;不自矜,故长。夫唯不争,故天下莫能与之争。古之所谓"曲则全"者,岂虚言哉?诚全而归之。[译文]委曲便会保全,屈枉便会直伸;低洼便会充盈,陈旧便会更新;少取便会获得,贪多便会迷惑。所以有道的人坚守这一原则作为天下事理的范式,不自我表扬,反能显明;不自以为是,反能是非彰明</div> </li> <li><a href="/article/1773328787191169024.htm" title="Python极速入门:五分钟开启实战之旅!" target="_blank">Python极速入门:五分钟开启实战之旅!</a> <span class="text-muted">知白守黑V</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/%E7%BC%96%E7%A8%8B%E8%AF%AD%E8%A8%80/1.htm">编程语言</a><a class="tag" taget="_blank" href="/search/%E7%B3%BB%E7%BB%9F%E8%BF%90%E7%BB%B4/1.htm">系统运维</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E7%BC%96%E7%A8%8B%E8%AF%AD%E8%A8%80/1.htm">编程语言</a><a class="tag" taget="_blank" href="/search/python%E5%BC%80%E5%8F%91/1.htm">python开发</a><a class="tag" taget="_blank" href="/search/python%E5%AD%A6%E4%B9%A0/1.htm">python学习</a><a class="tag" taget="_blank" href="/search/python%E5%85%A5%E9%97%A8/1.htm">python入门</a><a class="tag" taget="_blank" href="/search/python%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">python数据分析</a> <div>1.Python基础语法和结构:了解Python的基本语法,包括变量、数据类型、运算符、注释等。控制流:掌握条件语句(if-elif-else)、循环(for和while)及其控制(break和continue)。函数:学习如何定义和使用函数,包括参数传递、返回值、作用域和闭包。模块和包:理解如何导入和使用模块,以及如何创建和使用自己的包。2.数据处理列表、元组和集合:学习这些序列类型的操作和方法</div> </li> <li><a href="/article/1773323123093995520.htm" title="Python Flask 使用数据库" target="_blank">Python Flask 使用数据库</a> <span class="text-muted">安果移不动</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/flask/1.htm">flask</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>pipinstallflask_sqlalchemy官方文档:Flask-SQLAlchemy—Flask-SQLAlchemyDocumentation(3.1.x)为了不报错也需要导入另外两个库#pipinstallflask_sqlalchemy#pipinstallmysqlclient完整代码importosfromflaskimportFlaskfromflask_sqlalchemy</div> </li> <li><a href="/article/1773321618093834240.htm" title="以客户为中心的企业设计(咨询执业笔记)" target="_blank">以客户为中心的企业设计(咨询执业笔记)</a> <span class="text-muted">觉者看世界</span> <div>以客户为中心的企业设计(咨询执业笔记)——何伏全案咨询知名专家数字经济大行其道,过剩的风险资本自由流动,股权市场日益强势,这些力量综合在一起,产生出诸多不合理的企业设计。这些事实使得企业设计的再创造越来越需要一种约束力,许多公司和投资者未能熟谙这种约束力,或者未能将其基本原理运用于具体的商业行为中,因此付出了沉重的代价。无利润区的确存在,并且已在全球蔓延,有愈演愈烈之势。它席卷了数以千计的公司,涉</div> </li> <li><a href="/article/1773306760505917440.htm" title="PaperWeekly" target="_blank">PaperWeekly</a> <span class="text-muted">sapienst</span> <a class="tag" taget="_blank" href="/search/Papers/1.htm">Papers</a><a class="tag" taget="_blank" href="/search/PaperwithCode/1.htm">PaperwithCode</a><a class="tag" taget="_blank" href="/search/General/1.htm">General</a><a class="tag" taget="_blank" href="/search/ML/1.htm">ML</a> <div>1.Python软件包解决DL在未见过的数据分布下性能差的问题:(1)神经网络和损失分离的模块化设计(2)强大便捷的基准测试能力(3)易于使用但难以修改(4)github:https://github.com/marrlab/domainlabTrainer和Models之间是什么关系Trainer和Models是DomainLab中的两个核心概念。Trainer是一个用于指导数据流向模型并计算S</div> </li> <li><a href="/article/68.htm" title="mongodb3.03开启认证" target="_blank">mongodb3.03开启认证</a> <span class="text-muted">21jhf</span> <a class="tag" taget="_blank" href="/search/mongodb/1.htm">mongodb</a> <div>下载了最新mongodb3.03版本,当使用--auth 参数命令行开启mongodb用户认证时遇到很多问题,现总结如下: (百度上搜到的基本都是老版本的,看到db.addUser的就是,请忽略) Windows下我做了一个bat文件,用来启动mongodb,命令行如下: mongod --dbpath db\data --port 27017 --directoryperdb --logp</div> </li> <li><a href="/article/195.htm" title="【Spark103】Task not serializable" target="_blank">【Spark103】Task not serializable</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/Serializable/1.htm">Serializable</a> <div>Task not serializable是Spark开发过程最令人头疼的问题之一,这里记录下出现这个问题的两个实例,一个是自己遇到的,另一个是stackoverflow上看到。等有时间了再仔细探究出现Task not serialiazable的各种原因以及出现问题后如何快速定位问题的所在,至少目前阶段碰到此类问题,没有什么章法 1.   package spark.exampl</div> </li> <li><a href="/article/322.htm" title="你所熟知的 LRU(最近最少使用)" target="_blank">你所熟知的 LRU(最近最少使用)</a> <span class="text-muted">dalan_123</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>关于LRU这个名词在很多地方或听说,或使用,接下来看下lru缓存回收的实现 1、大体的想法     a、查询出最近最晚使用的项     b、给最近的使用的项做标记 通过使用链表就可以完成这两个操作,关于最近最少使用的项只需要返回链表的尾部;标记最近使用的项,只需要将该项移除并放置到头部,那么难点就出现 你如何能够快速在链表定位对应的该项? 这时候多</div> </li> <li><a href="/article/449.htm" title="Javascript 跨域" target="_blank">Javascript 跨域</a> <span class="text-muted">周凡杨</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a><a class="tag" taget="_blank" href="/search/jsonp/1.htm">jsonp</a><a class="tag" taget="_blank" href="/search/%E8%B7%A8%E5%9F%9F/1.htm">跨域</a><a class="tag" taget="_blank" href="/search/cross-domain/1.htm">cross-domain</a> <div>                                   </div> </li> <li><a href="/article/576.htm" title="linux下安装apache服务器" target="_blank">linux下安装apache服务器</a> <span class="text-muted">g21121</span> <a class="tag" taget="_blank" href="/search/apache/1.htm">apache</a> <div>安装apache 下载windows版本apache,下载地址:http://httpd.apache.org/download.cgi   1.windows下安装apache Windows下安装apache比较简单,注意选择路径和端口即可,这里就不再赘述了。 2.linux下安装apache: 下载之后上传到linux的相关目录,这里指定为/home/apach</div> </li> <li><a href="/article/703.htm" title="FineReport的JS编辑框和URL地址栏语法简介" target="_blank">FineReport的JS编辑框和URL地址栏语法简介</a> <span class="text-muted">老A不折腾</span> <a class="tag" taget="_blank" href="/search/finereport/1.htm">finereport</a><a class="tag" taget="_blank" href="/search/web%E6%8A%A5%E8%A1%A8/1.htm">web报表</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E8%BD%AF%E4%BB%B6/1.htm">报表软件</a><a class="tag" taget="_blank" href="/search/%E8%AF%AD%E6%B3%95%E6%80%BB%E7%BB%93/1.htm">语法总结</a> <div>  JS编辑框: 1.FineReport的js。 作为一款BS产品,browser端的JavaScript是必不可少的。 FineReport中的js是已经调用了finereport.js的。 大家知道,预览报表时,报表servlet会将cpt模板转为html,在这个html的head头部中会引入FineReport的js,这个finereport.js中包含了许多内置的fun</div> </li> <li><a href="/article/830.htm" title="根据STATUS信息对MySQL进行优化" target="_blank">根据STATUS信息对MySQL进行优化</a> <span class="text-muted">墙头上一根草</span> <a class="tag" taget="_blank" href="/search/status/1.htm">status</a> <div>mysql  查看当前正在执行的操作,即正在执行的sql语句的方法为:      show processlist 命令   mysql> show global status;可以列出MySQL服务器运行各种状态值,我个人较喜欢的用法是show status like '查询值%';一、慢查询mysql> show variab</div> </li> <li><a href="/article/957.htm" title="我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名" target="_blank">我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名</a> <span class="text-muted">aijuans</span> <a class="tag" taget="_blank" href="/search/Spring+3/1.htm">Spring 3</a> <div>本文介绍如何给Spring的Bean配置文件的Bean定义别名? 原始的 <bean id="business" class="onlyfun.caterpillar.device.Business"> <property name="writer"> <ref b</div> </li> <li><a href="/article/1084.htm" title="高性能mysql 之 性能剖析" target="_blank">高性能mysql 之 性能剖析</a> <span class="text-muted">annan211</span> <a class="tag" taget="_blank" href="/search/%E6%80%A7%E8%83%BD/1.htm">性能</a><a class="tag" taget="_blank" href="/search/mysql/1.htm">mysql</a><a class="tag" taget="_blank" href="/search/mysql+%E6%80%A7%E8%83%BD%E5%89%96%E6%9E%90/1.htm">mysql 性能剖析</a><a class="tag" taget="_blank" href="/search/%E5%89%96%E6%9E%90/1.htm">剖析</a> <div> 1 定义性能优化 mysql服务器性能,此处定义为 响应时间。 在解释性能优化之前,先来消除一个误解,很多人认为,性能优化就是降低cpu的利用率或者减少对资源的使用。 这是一个陷阱。 资源时用来消耗并用来工作的,所以有时候消耗更多的资源能够加快查询速度,保持cpu忙绿,这是必要的。很多时候发现 编译进了新版本的InnoDB之后,cpu利用率上升的很厉害,这并不</div> </li> <li><a href="/article/1211.htm" title="主外键和索引唯一性约束" target="_blank">主外键和索引唯一性约束</a> <span class="text-muted">百合不是茶</span> <a class="tag" taget="_blank" href="/search/%E7%B4%A2%E5%BC%95/1.htm">索引</a><a class="tag" taget="_blank" href="/search/%E5%94%AF%E4%B8%80%E6%80%A7%E7%BA%A6%E6%9D%9F/1.htm">唯一性约束</a><a class="tag" taget="_blank" href="/search/%E4%B8%BB%E5%A4%96%E9%94%AE%E7%BA%A6%E6%9D%9F/1.htm">主外键约束</a><a class="tag" taget="_blank" href="/search/%E8%81%94%E6%9C%BA%E5%88%A0%E9%99%A4/1.htm">联机删除</a> <div>目标;第一步;创建两张表 用户表和文章表         第二步;发表文章       1,建表; ---用户表 BlogUsers --userID唯一的 --userName --pwd --sex create </div> </li> <li><a href="/article/1338.htm" title="线程的调度" target="_blank">线程的调度</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">多线程</a><a class="tag" taget="_blank" href="/search/thread/1.htm">thread</a><a class="tag" taget="_blank" href="/search/%E7%BA%BF%E7%A8%8B%E7%9A%84%E8%B0%83%E5%BA%A6/1.htm">线程的调度</a><a class="tag" taget="_blank" href="/search/java%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">java多线程</a> <div>1.       Java提供一个线程调度程序来监控程序中启动后进入可运行状态的所有线程。线程调度程序按照线程的优先级决定应调度哪些线程来执行。   2.       多数线程的调度是抢占式的(即我想中断程序运行就中断,不需要和将被中断的程序协商) a) </div> </li> <li><a href="/article/1465.htm" title="查看日志常用命令" target="_blank">查看日志常用命令</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/%E5%91%BD%E4%BB%A4/1.htm">命令</a><a class="tag" taget="_blank" href="/search/unix/1.htm">unix</a> <div>一.日志查找方法,可以用通配符查某台主机上的所有服务器grep "关键字" /wls/applogs/custom-*/error.log   二.查看日志常用命令1.grep '关键字' error.log:在error.log中搜索'关键字'2.grep -C10 '关键字' error.log:显示关键字前后10行记录3.grep '关键字' error.l</div> </li> <li><a href="/article/1592.htm" title="【持久化框架MyBatis3一】MyBatis版HelloWorld" target="_blank">【持久化框架MyBatis3一】MyBatis版HelloWorld</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/helloworld/1.htm">helloworld</a> <div>MyBatis这个系列的文章,主要参考《Java Persistence with MyBatis 3》。   样例数据 本文以MySQL数据库为例,建立一个STUDENTS表,插入两条数据,然后进行单表的增删改查     CREATE TABLE STUDENTS ( stud_id int(11) NOT NULL AUTO_INCREMENT, </div> </li> <li><a href="/article/1719.htm" title="【Hadoop十五】Hadoop Counter" target="_blank">【Hadoop十五】Hadoop Counter</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a> <div>   1. 只有Map任务的Map Reduce Job File System Counters FILE: Number of bytes read=3629530 FILE: Number of bytes written=98312 FILE: Number of read operations=0 FILE: Number of lar</div> </li> <li><a href="/article/1846.htm" title="解决Tomcat数据连接池无法释放" target="_blank">解决Tomcat数据连接池无法释放</a> <span class="text-muted">ronin47</span> <a class="tag" taget="_blank" href="/search/tomcat+%E8%BF%9E%E6%8E%A5%E6%B1%A0%E3%80%80%E4%BC%98%E5%8C%96/1.htm">tomcat 连接池 优化</a> <div> 近段时间,公司的检测中心报表系统(SMC)的开发人员时不时找到我,说用户老是出现无法登录的情况。前些日子因为手头上 有Jboss集群的测试工作,发现用户不能登录时,都是在Tomcat中将这个项目Reload一下就好了,不过只是治标而已,因为大概几个小时之后又会 再次出现无法登录的情况。 今天上午,开发人员小毛又找到我,要我协助将这个问题根治一下,拖太久用户难保不投诉。 简单分析了一</div> </li> <li><a href="/article/1973.htm" title="java-75-二叉树两结点的最低共同父结点" target="_blank">java-75-二叉树两结点的最低共同父结点</a> <span class="text-muted">bylijinnan</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div> import java.util.LinkedList; import java.util.List; import ljn.help.*; public class BTreeLowestParentOfTwoNodes { public static void main(String[] args) { /* * node data is stored in</div> </li> <li><a href="/article/2100.htm" title="行业垂直搜索引擎网页抓取项目" target="_blank">行业垂直搜索引擎网页抓取项目</a> <span class="text-muted">carlwu</span> <a class="tag" taget="_blank" href="/search/Lucene/1.htm">Lucene</a><a class="tag" taget="_blank" href="/search/Nutch/1.htm">Nutch</a><a class="tag" taget="_blank" href="/search/Heritrix/1.htm">Heritrix</a><a class="tag" taget="_blank" href="/search/Solr/1.htm">Solr</a> <div>公司有一个搜索引擎项目,希望各路高人有空来帮忙指导,谢谢! 这是详细需求: (1) 通过提供的网站地址(大概100-200个网站),网页抓取程序能不断抓取网页和其它类型的文件(如Excel、PDF、Word、ppt及zip类型),并且程序能够根据事先提供的规则,过滤掉不相干的下载内容。 (2) 程序能够搜索这些抓取的内容,并能对这些抓取文件按照油田名进行分类,然后放到服务器不同的目录中。 </div> </li> <li><a href="/article/2227.htm" title="[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费" target="_blank">[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费</a> <span class="text-muted">comsci</span> <a class="tag" taget="_blank" href="/search/%E8%B5%84%E6%BA%90/1.htm">资源</a> <div>       降低通讯服务资费,就意味着有更多的用户进入,就意味着通讯服务提供商要接待和服务更多的用户,在总体运维成本没有由于技术升级而大幅下降的情况下,这种降低资费的行为将导致每个用户的平均带宽不断下降,而享受到的服务质量也在下降,这对用户和服务商都是不利的。。。。。。。。     &nbs</div> </li> <li><a href="/article/2354.htm" title="Java时区转换及时间格式" target="_blank">Java时区转换及时间格式</a> <span class="text-muted">Cwind</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>本文介绍Java API 中 Date, Calendar, TimeZone和DateFormat的使用,以及不同时区时间相互转化的方法和原理。   问题描述: 向处于不同时区的服务器发请求时需要考虑时区转换的问题。譬如,服务器位于东八区(北京时间,GMT+8:00),而身处东四区的用户想要查询当天的销售记录。则需把东四区的“今天”这个时间范围转换为服务器所在时区的时间范围。 </div> </li> <li><a href="/article/2481.htm" title="readonly,只读,不可用" target="_blank">readonly,只读,不可用</a> <span class="text-muted">dashuaifu</span> <a class="tag" taget="_blank" href="/search/js/1.htm">js</a><a class="tag" taget="_blank" href="/search/jsp/1.htm">jsp</a><a class="tag" taget="_blank" href="/search/disable/1.htm">disable</a><a class="tag" taget="_blank" href="/search/readOnly/1.htm">readOnly</a><a class="tag" taget="_blank" href="/search/readOnly/1.htm">readOnly</a> <div>readOnly 和 readonly 不同,在做js开发时一定要注意函数大小写和jsp黄线的警告!!!我就经历过这么一件事: 使用readOnly在某些浏览器或同一浏览器不同版本有的可以实现“只读”功能,有的就不行,而且函数readOnly有黄线警告!!!就这样被折磨了不短时间!!!(期间使用过disable函数,但是发现disable函数之后后台接收不到前台的的数据!!!)   </div> </li> <li><a href="/article/2608.htm" title="LABjs、RequireJS、SeaJS 介绍" target="_blank">LABjs、RequireJS、SeaJS 介绍</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/js/1.htm">js</a><a class="tag" taget="_blank" href="/search/Web/1.htm">Web</a> <div>LABjs 的核心是 LAB(Loading and Blocking):Loading 指异步并行加载,Blocking 是指同步等待执行。LABjs 通过优雅的语法(script 和 wait)实现了这两大特性,核心价值是性能优化。LABjs 是一个文件加载器。RequireJS 和 SeaJS 则是模块加载器,倡导的是一种模块化开发理念,核心价值是让 JavaScript 的模块化开发变得更</div> </li> <li><a href="/article/2735.htm" title="[应用结构]入口脚本" target="_blank">[应用结构]入口脚本</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/PHP/1.htm">PHP</a><a class="tag" taget="_blank" href="/search/yii2/1.htm">yii2</a> <div>入口脚本 入口脚本是应用启动流程中的第一环,一个应用(不管是网页应用还是控制台应用)只有一个入口脚本。终端用户的请求通过入口脚本实例化应用并将将请求转发到应用。 Web 应用的入口脚本必须放在终端用户能够访问的目录下,通常命名为 index.php,也可以使用 Web 服务器能定位到的其他名称。 控制台应用的入口脚本一般在应用根目录下命名为 yii(后缀为.php),该文</div> </li> <li><a href="/article/2862.htm" title="haoop shell命令" target="_blank">haoop shell命令</a> <span class="text-muted">eksliang</span> <a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a><a class="tag" taget="_blank" href="/search/hadoop+shell/1.htm">hadoop shell</a> <div> cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text </div> </li> <li><a href="/article/2989.htm" title="MultiStateView不同的状态下显示不同的界面" target="_blank">MultiStateView不同的状态下显示不同的界面</a> <span class="text-muted">gundumw100</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>只要将指定的view放在该控件里面,可以该view在不同的状态下显示不同的界面,这对ListView很有用,比如加载界面,空白界面,错误界面。而且这些见面由你指定布局,非常灵活。 PS:ListView虽然可以设置一个EmptyView,但使用起来不方便,不灵活,有点累赘。 <com.kennyc.view.MultiStateView xmlns:android=&qu</div> </li> <li><a href="/article/3116.htm" title="jQuery实现页面内锚点平滑跳转" target="_blank">jQuery实现页面内锚点平滑跳转</a> <span class="text-muted">ini</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a><a class="tag" taget="_blank" href="/search/html/1.htm">html</a><a class="tag" taget="_blank" href="/search/jquery/1.htm">jquery</a><a class="tag" taget="_blank" href="/search/html5/1.htm">html5</a><a class="tag" taget="_blank" href="/search/css/1.htm">css</a> <div>平时我们做导航滚动到内容都是通过锚点来做,刷的一下就直接跳到内容了,没有一丝的滚动效果,而且 url 链接最后会有“小尾巴”,就像#keleyi,今天我就介绍一款 jquery 做的滚动的特效,既可以设置滚动速度,又可以在 url 链接上没有“小尾巴”。   效果体验:http://keleyi.com/keleyi/phtml/jqtexiao/37.htmHTML文件代码: &</div> </li> <li><a href="/article/3243.htm" title="kafka offset迁移" target="_blank">kafka offset迁移</a> <span class="text-muted">kane_xie</span> <a class="tag" taget="_blank" href="/search/kafka/1.htm">kafka</a> <div>在早前的kafka版本中(0.8.0),offset是被存储在zookeeper中的。   到当前版本(0.8.2)为止,kafka同时支持offset存储在zookeeper和offset manager(broker)中。   从官方的说明来看,未来offset的zookeeper存储将会被弃用。因此现有的基于kafka的项目如果今后计划保持更新的话,可以考虑在合适</div> </li> <li><a href="/article/3370.htm" title="android > 搭建 cordova 环境" target="_blank">android > 搭建 cordova 环境</a> <span class="text-muted">mft8899</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>  1 , 安装 node.js        http://nodejs.org      node -v   查看版本   2, 安装 npm   可以先从  https://github.com/isaacs/npm/tags  下载 源码 解压到</div> </li> <li><a href="/article/3497.htm" title="java封装的比较器,比较是否全相同,获取不同字段名字" target="_blank">java封装的比较器,比较是否全相同,获取不同字段名字</a> <span class="text-muted">qifeifei</span> <div> 非常实用的java比较器,贴上代码: import java.util.HashSet; import java.util.List; import java.util.Set; import net.sf.json.JSONArray; import net.sf.json.JSONObject; import net.sf.json.JsonConfig; i</div> </li> <li><a href="/article/3624.htm" title="记录一些函数用法" target="_blank">记录一些函数用法</a> <span class="text-muted">.Aky.</span> <a class="tag" taget="_blank" href="/search/%E4%BD%8D%E8%BF%90%E7%AE%97/1.htm">位运算</a><a class="tag" taget="_blank" href="/search/PHP/1.htm">PHP</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E5%87%BD%E6%95%B0/1.htm">函数</a><a class="tag" taget="_blank" href="/search/IP/1.htm">IP</a> <div>高手们照旧忽略。 想弄个全天朝IP段数据库,找了个今天最新更新的国内所有运营商IP段,copy到文件,用文件函数,字符串函数把玩下。分割出startIp和endIp这样格式写入.txt文件,直接用phpmyadmin导入.csv文件的形式导入。(生命在于折腾,也许你们觉得我傻X,直接下载人家弄好的导入不就可以,做自己的菜鸟,让别人去说吧) 当然用到了ip2long()函数把字符串转为整型数</div> </li> <li><a href="/article/3751.htm" title="sublime text 3 rust" target="_blank">sublime text 3 rust</a> <span class="text-muted">wudixiaotie</span> <a class="tag" taget="_blank" href="/search/Sublime+Text/1.htm">Sublime Text</a> <div>1.sublime text 3 => install package => Rust 2.cd ~/.config/sublime-text-3/Packages 3.mkdir rust 4.git clone https://github.com/sp0/rust-style 5.cd rust-style 6.cargo build --release 7.ctrl</div> </li> </ul> </div> </div> </div> <div> <div class="container"> <div class="indexes"> <strong>按字母分类:</strong> <a href="/tags/A/1.htm" target="_blank">A</a><a href="/tags/B/1.htm" target="_blank">B</a><a href="/tags/C/1.htm" target="_blank">C</a><a href="/tags/D/1.htm" target="_blank">D</a><a href="/tags/E/1.htm" target="_blank">E</a><a href="/tags/F/1.htm" target="_blank">F</a><a href="/tags/G/1.htm" target="_blank">G</a><a href="/tags/H/1.htm" target="_blank">H</a><a href="/tags/I/1.htm" target="_blank">I</a><a href="/tags/J/1.htm" target="_blank">J</a><a href="/tags/K/1.htm" target="_blank">K</a><a href="/tags/L/1.htm" target="_blank">L</a><a href="/tags/M/1.htm" target="_blank">M</a><a href="/tags/N/1.htm" target="_blank">N</a><a href="/tags/O/1.htm" target="_blank">O</a><a href="/tags/P/1.htm" target="_blank">P</a><a href="/tags/Q/1.htm" target="_blank">Q</a><a href="/tags/R/1.htm" target="_blank">R</a><a href="/tags/S/1.htm" target="_blank">S</a><a href="/tags/T/1.htm" target="_blank">T</a><a href="/tags/U/1.htm" target="_blank">U</a><a href="/tags/V/1.htm" target="_blank">V</a><a href="/tags/W/1.htm" target="_blank">W</a><a href="/tags/X/1.htm" target="_blank">X</a><a href="/tags/Y/1.htm" target="_blank">Y</a><a href="/tags/Z/1.htm" target="_blank">Z</a><a href="/tags/0/1.htm" target="_blank">其他</a> </div> </div> </div> <footer id="footer" class="mb30 mt30"> <div class="container"> <div class="footBglm"> <a target="_blank" href="/">首页</a> - <a target="_blank" href="/custom/about.htm">关于我们</a> - <a target="_blank" href="/search/Java/1.htm">站内搜索</a> - <a target="_blank" href="/sitemap.txt">Sitemap</a> - <a target="_blank" href="/custom/delete.htm">侵权投诉</a> </div> <div class="copyright">版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved. <!-- <a href="https://beian.miit.gov.cn/" rel="nofollow" target="_blank">京ICP备09083238号</a><br>--> </div> </div> </footer> <!-- 代码高亮 --> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shCore.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shLegacy.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shAutoloader.js"></script> <link type="text/css" rel="stylesheet" href="/static/syntaxhighlighter/styles/shCoreDefault.css"/> <script type="text/javascript" src="/static/syntaxhighlighter/src/my_start_1.js"></script> </body> </html>