该笔记是我在学习b站up主路飞学城IT的爬虫视频时做的,详细内容请去b站找原视频,文章仅供参考,如有不对请指正,另外文章内可能有些网站已失效,请自行寻找适合的网站
爬虫是从互联网上爬取各类资源,包括图片,文字,视频等格式,其原理就是用代码模拟浏览器下载各种资源。爬虫不一定要使用python语言,也可以使用java、c等,其原因还是因为python比较简洁,并且有丰富的第三方库,使爬虫技术更为简便。
什么是robots.txt?robots.txt就是一个文件包含了这个网页哪些可以爬哪些不可爬,查看方法就是在该url后面添加"/robots.txt",例http://www.bilibili.com/robots.txt。
第一个小爬虫就是爬取整个百度的网页,比较简单
from urllib.request import urlopen
url = "http://www.baidu.com"
resp = urlopen(url)
with open("myBaidu.html", mode="w", encoding="utf-8") as f: # 这里需要注意Windows用户需要添一个“encoding='utf-8'”,因为百度网页编码格式是utf-8,而open()函数默认是gbk,否则出现的网页将会乱码
f.write(resp.read().decode("utf-8"))
print('success!')
要熟练使用浏览器数据抓包工具,F12-Network
协议:就是两个计算机之间为了能够流畅的进行沟通而设置的一个君子协议,常见的协议有TCP/IP,SOAP协议,HTTP协议,SMTP协议等等······
HTTP协议,Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网(www:World Wide Web)服务器传输超文本到本地浏览器的传送协议,直白点就是浏览器和服务器之间的数据交互遵守的就是HTTP协议。
HTTP协议把一条消息分为三大块内容,无论是请求还是响应都是三块内容
请求:
请求行 -> 请求方式 请求url地址 协议
请求头 -> 放一些服务器要使用的附加信息
请求体 -> 一般放一些请求参数
响应:
状态行 -> 协议 状态码
响应头 -> 放一些客户端要使用的一些附加信息
响应体 -> 服务器返回的真正客户端要用的内容(HTML,json)等
请求头中最常见的一些重要内容(爬虫需要):
响应头中一些重要的内容:
首先安装requests模块 pip install requests
import requests
url = 'https://www.sogou.com/web?query=周杰伦'
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36" # 这里的消息头可以去浏览器审查元素Network里找到,具体为Network第一个文件里的Request Headers——user-agent,可以理解为模拟浏览器标识
}
resp = requests.get(url, headers=headers)
print(resp)
print(resp.text)
在百度翻译上找到获取翻译结果的url:https://fanyi.baidu.com/sug
在这里用的是POST方法,上传需要翻译的单词,返回翻译结果,post上传参数为data
import requests
url = "https://fanyi.baidu.com/sug"
text = input("请输入你要翻译的英文单词")
data = {
"kw": text
}
# 发送post请求,发送的数据必须放在字典中,通过data参数进行传递
resp = requests.post(url, data=data)
print(resp.json()) # 将服务器返回的内容直接处理成json() -> dict
爬虫不好使第一个尝试User-Agent,python爬虫默认的user-agent:python-requests/2.25.1,不是浏览器标识
在这里使用的是GET方法,获取豆瓣电影排行,get上传参数为param
import requests
url = "https://movie.douban.com/j/chart/top_list"
# 重新封装参数
param = {
"type": "24",
"interval_id": "100:90",
"action": "",
"start": 0,
"limit": 20
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
"Safari/537.36 "
}
resp = requests.get(url, params=param, headers=headers)
print(resp.json())
resp.close()
在程序的最后需要关闭resp(连接端口),不关闭的话可能会因为多次访问最后进不去,所以需要在最后添加一句resp.close(),包括打开文件,最后也要关闭
在上一章中,我们基本上掌握了抓取整个网页的基本技能。但是呢,大多数情况下,我们并不是需要整个网页的内容,只需要其中的一小部分。那么这就涉及到了数据提取的问题。
本课程中,提供三种解析方式:
这三种方式可以混合进行使用,完全以结果做导向,只要能拿到你想要的数据,用什么方案并不重要,当你掌握这些之后再考虑性能问题。
Regular Expression,正则表达式,一种使用表达式的方式对字符串进行匹配的语法规则。
我们抓取到的网页源代码本质上就是一个超长的字符串,想从里面提取内容,用正则再适合不过。
正则的优点:速度快,效率高,准确性高
正则的缺点:新手上手难度比较高
正则的语法:使用元字符进行排列组合用来匹配字符串,在线测试正则表达式http://tool.oschina.net/regex/
元字符:具有固定含义的特殊符号
常用元字符:
. 匹配除换行以外的任意字符
\w 匹配字母或数字或下划线
\s 匹配任意的空白符
\d 匹配数字
\n 匹配一个换行符
\t 匹配一个制表符
^ 匹配字符串的开始
$ 匹配字符串的结尾
\W 匹配非字母或数字或下划线
\S 匹配非空白符
\D 匹配非数字
a|b 匹配字符a或字符b
() 匹配括号内的表达式,也表示一个组
[...] 匹配字符组中的字符
[^...] 匹配除了字符组中字符的所有字符
量词:控制前面的元字符出现的次数
* 重复零次或更多次
+ 重复一次或更多次
? 重复零次或一次
{
n} 重复n次
{
n,} 重复n次或更多次
{
n,m} 重复n到m次
贪婪匹配和惰性匹配
.* 贪婪匹配
.*? 惰性匹配
爬虫中最多使用的就是惰性匹配,因此对此需要重视
惰性匹配就是尽可能少的去匹配内容,举例
str:玩儿吃鸡游戏,晚上一起上游戏,干嘛呢?打游戏啊
reg:玩儿.*?游戏
# 这里的原理是:首先匹配“玩儿”两个字,然后再找“.*”次“游戏”,“.*”是尽可能多的进行匹配,因此此时匹配到的会是“玩儿吃鸡游戏,晚上一起上游戏,干嘛呢?打游戏”,然后“?”限制搜索次数,限制到最小次数,最终结果就为“玩儿吃鸡游戏”
此时结果为:玩儿吃鸡游戏
str:<div class="jay">周杰伦</div><div class="jj">林俊杰</div>
reg: <div class=".*?">.*?</div>
结果:<div class="jay">周杰伦</div>
<div class="jj">林俊杰</div>
学习正则后,该如何在程序中使用呢?
import re
# findall:匹配字符串中所有符合正则的内容
lst = re.findall(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(lst)
# finditer:匹配字符串中所有的内容[返回的迭代器],从迭代器中拿到内容需要.group()
it = re.finditer(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
for i in it:
print(i.group())
# search是找到一个结果就返回,返回的结果是match对象,拿数据需要.group()
s = re.search(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(s.group())
# match是从头开始匹配,因此第一个是中文匹配不到
s = re.match(r"\d+", "我的电话号是10086,我的女朋友电话号是10010")
print(s.group())
当正则表达式很长的时候,我们也可以使用预加载正则表达式
# 预加载正则表达式
obj = re.compile(r"\d+")
ret = obj.finditer("我的电话号是10086,我的女朋友电话号是10010")
for it in ret:
print(it.group())
obj.findall("sadadsa223dawswefq123fasdigjoihuiohuiogsdf")
print(ret)
那么如何单独提取出字符串中的内容呢?
import re
s = """
张富帅
张富贵
吕富帅
小狗头
小煞笔
"""
# (?P<分组名字>正则)可以单独从正则匹配的内容中进一步提取内容
obj = re.compile(r"(?P.*?) ", re.S) # re.S 让.能匹配换行符
res = obj.finditer(s)
for it in res:
print(it.group("name"))
print(it.group("id"))
import requests
import re
import csv
# 提取页面
url = "http://movie.douban.com/top250"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
"Safari/537.36 "
}
resp = requests.get(url, headers=headers)
page_content = resp.text
# 解析数据 拿到电影名,导演,年份,评分,评分人数信息
obj = re.compile(r'.*?.*?(?P.*?)</span>.*?'</span>
<span class="token string">r'<p class="">.*?导演: (?P<director>.*?) .*?<br>(?P<year>.*?) .*?'</span>
<span class="token string">r'<span class="rating_num" property="v:average">(?P<average>.*?)</span>.*?'</span>
<span class="token string">r'<span>(?P<people>.*?)</span>'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span>
<span class="token comment"># 开始匹配</span>
result <span class="token operator">=</span> obj<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>page_content<span class="token punctuation">)</span>
f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"top250.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span>
csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span>
<span class="token keyword">for</span> it <span class="token keyword">in</span> result<span class="token punctuation">:</span>
<span class="token comment"># print(it.group("title"))</span>
<span class="token comment"># print(it.group("director").strip())</span>
<span class="token comment"># print(it.group("year").strip())</span>
<span class="token comment"># print('评分'+it.group("average"))</span>
<span class="token comment"># print(it.group('people'))</span>
dic <span class="token operator">=</span> it<span class="token punctuation">.</span>groupdict<span class="token punctuation">(</span><span class="token punctuation">)</span>
dic<span class="token punctuation">[</span><span class="token string">'director'</span><span class="token punctuation">]</span> <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'director'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span>
dic<span class="token punctuation">[</span><span class="token string">'year'</span><span class="token punctuation">]</span> <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'year'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span>
csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>dic<span class="token punctuation">.</span>values<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'success'</span><span class="token punctuation">)</span>
</code></pre>
<h3>2.5 屠戮电影天堂电影信息</h3>
<ol>
<li>定位到2021必看热片</li>
<li>从2021必看热片中提取电影子页面的链接地址</li>
<li>请求子页面中的链接地址,拿到我们想要的下载磁链接</li>
</ol>
<pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token keyword">import</span> re
<span class="token comment"># 定位阶段</span>
domain <span class="token operator">=</span> <span class="token string">"https://dytt89.com/"</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>domain<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># verify=False 去掉安全验证</span>
resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'gb2312'</span> <span class="token comment"># 指定字符集</span>
<span class="token comment"># print(resp.text)</span>
<span class="token comment"># 提取阶段 拿到<ul>里面的<li></span>
obj1 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r"2021必看热片.*?<ul>(?P<ul>.*?)</ul>"</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span>
obj2 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r"<a href='(?P<href>.*?)'"</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span>
obj3 <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">r'◎片 名 (?P<movie>.*?)<br />.*?<td style="WORD-WRAP: break-word" bgcolor="#fdfddf"><a href="('</span>
<span class="token string">r'?P<download>.*?)"'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span>
result1 <span class="token operator">=</span> obj1<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span>
child_href_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> it <span class="token keyword">in</span> result1<span class="token punctuation">:</span>
ul <span class="token operator">=</span> it<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'ul'</span><span class="token punctuation">)</span>
<span class="token comment"># 提取子页面链接</span>
result2 <span class="token operator">=</span> obj2<span class="token punctuation">.</span>finditer<span class="token punctuation">(</span>ul<span class="token punctuation">)</span>
<span class="token keyword">for</span> itt <span class="token keyword">in</span> result2<span class="token punctuation">:</span>
<span class="token comment"># 拼接子页面的url地址:域名+子页面地址</span>
child_href <span class="token operator">=</span> domain <span class="token operator">+</span> itt<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span>
child_href_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span>child_href<span class="token punctuation">)</span> <span class="token comment"># 把子页面链接存储起来</span>
<span class="token comment"># 提取子页面内容</span>
<span class="token keyword">for</span> href <span class="token keyword">in</span> child_href_list<span class="token punctuation">:</span>
child_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>
child_resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'gb2312'</span>
result3 <span class="token operator">=</span> obj3<span class="token punctuation">.</span>search<span class="token punctuation">(</span>child_resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>result3<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'movie'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>result3<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token string">'download'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre>
<h3>2.5 Bs解析前戏-Html语法规则</h3>
<p>bs4解析比较简单,但是需要一定的html知识,然后再去使用bs4去提取,逻辑和编写难度就会非常简单清晰,有前端基础的可略过</p>
<p>HTML(Hyper Text Markup Language)超文本标记语言,是我们编写网页的最基本也是最核心的一种语言。其语法规则就是用不同的标签对网页上的内容进行标记,从而使网页显示出不同的展示效果。</p>
<pre><code class="prism language-html"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span><span class="token punctuation">></span></span>
Hello World!
<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span>
</code></pre>
<p>上述代码的含义是在页面显示“Hello World!”一句,但是这句话被</p>
<h1>和</h1>标记了。白话就是括起来了,被H1标签括起来了。这个时候,浏览器在展示的时候就会让“Hello World!”这句话加粗加大,变为标题,所以HTML的语法就是用类似这样的标签对页面内容进行标记。不同的标签表现出来的效果也是不一样的。
<p></p>
<pre><code class="prism language-html">h1:一级标题
h2:二级标题
p:段落
font:字体(已被废弃,但还能用)
body:主体
</code></pre>
<p>标签还有很多,这里就不一一列举。接下来是属性</p>
<pre><code class="prism language-html"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span> <span class="token attr-name">align</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>center<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>
Hello World!
<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>1<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>a<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>2<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>b<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>li</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">'</span>3<span class="token punctuation">'</span></span><span class="token punctuation">></span></span>c<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>li</span><span class="token punctuation">></span></span>
</code></pre>
<p>其中"align"就是标签属性,"center"就是属性值,后续的bs4解析就是可以根据id的属性值进行检索。</p>
<h3>2.6 Bs4解析入门-搞搞菜价</h3>
<p>首先pip install bs4安装模块</p>
<ol>
<li>拿到页面源代码</li>
<li>使用bs4进行解析 拿到数据</li>
</ol>
<p>视频中的网站源代码已改变,因此这里选用的url是:http://www.bjtzh.gov.cn/bjtz/home/jrcj/index.shtml,最后结果类似</p>
<pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup
<span class="token keyword">import</span> csv
url <span class="token operator">=</span> <span class="token string">"http://www.bjtzh.gov.cn/bjtz/home/jrcj/index.shtml"</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span>
f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"vegetable_price.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span>
csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span>
<span class="token comment"># 解析数据</span>
<span class="token comment"># 1.把页面源代码交给BeautifulSoup进行处理,生成bs对象</span>
page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span> <span class="token comment"># 指定html解析器</span>
<span class="token comment"># 2.从bs对象中查找数据</span>
<span class="token comment"># find(标签,属性=值) 只找第一个</span>
<span class="token comment"># findall(标签,属性=值) 找到所有的</span>
table <span class="token operator">=</span> page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"table"</span><span class="token punctuation">,</span> attrs<span class="token operator">=</span><span class="token punctuation">{
</span>
<span class="token string">"style"</span><span class="token punctuation">:</span> <span class="token string">"margin: 0px auto; width: 588px; height: 847px; border-collapse: collapse;"</span><span class="token punctuation">,</span>
<span class="token string">"width"</span><span class="token punctuation">:</span> <span class="token string">"588"</span><span class="token punctuation">,</span>
<span class="token string">"cellspacing"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span>
<span class="token string">"cellpadding"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span>
<span class="token string">"border"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span>
<span class="token string">"align"</span><span class="token punctuation">:</span> <span class="token string">"center"</span>
<span class="token punctuation">}</span><span class="token punctuation">)</span>
<span class="token comment"># 拿到所有数据行</span>
trs <span class="token operator">=</span> table<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"tr"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">7</span><span class="token punctuation">:</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> tr <span class="token keyword">in</span> trs<span class="token punctuation">:</span> <span class="token comment"># 每一行数据</span>
tds <span class="token operator">=</span> tr<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'td'</span><span class="token punctuation">)</span> <span class="token comment"># 拿到每行数据中的td</span>
name <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text <span class="token comment"># .text表示拿到被标签标记的内容</span>
kind <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text
high <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text
low <span class="token operator">=</span> tds<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text
csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span><span class="token punctuation">[</span>name<span class="token punctuation">,</span> kind<span class="token punctuation">,</span> high<span class="token punctuation">,</span> low<span class="token punctuation">]</span><span class="token punctuation">)</span>
f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'success!'</span><span class="token punctuation">)</span>
</code></pre>
<h3>2.7 Bs4解析案例-抓取优美图库图片</h3>
<ol>
<li>拿到主页面的源代码 提取子页面的链接地址 href</li>
<li>通过href拿到子页面的内容,从子页面找到图片的下载地址 img->src</li>
<li>下载图片</li>
</ol>
<pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup
<span class="token keyword">import</span> time
url_index <span class="token operator">=</span> <span class="token string">"https://umei.cc"</span>
url <span class="token operator">=</span> <span class="token string">"https://umei.cc/bizhitupian/weimeibizhi/"</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span>
<span class="token comment"># 把源代码交给BeautifulSoup</span>
main_page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span>
a_list <span class="token operator">=</span> main_page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"div"</span><span class="token punctuation">,</span> class_<span class="token operator">=</span><span class="token string">"TypeList"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"a"</span><span class="token punctuation">)</span>
<span class="token comment"># print(a_list)</span>
<span class="token keyword">for</span> a <span class="token keyword">in</span> a_list<span class="token punctuation">:</span>
href <span class="token operator">=</span> url_index <span class="token operator">+</span> a<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'href'</span><span class="token punctuation">)</span> <span class="token comment"># 直接通过get就可以直接拿到属性值</span>
<span class="token comment"># 拿到子页面源代码</span>
child_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>href<span class="token punctuation">)</span>
child_resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span>
<span class="token comment"># 从子页面拿到图片下载链接</span>
child_page <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>child_resp<span class="token punctuation">.</span>text<span class="token punctuation">,</span> <span class="token string">"html.parser"</span><span class="token punctuation">)</span>
p <span class="token operator">=</span> child_page<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"p"</span><span class="token punctuation">,</span> align<span class="token operator">=</span><span class="token string">"center"</span><span class="token punctuation">)</span>
img <span class="token operator">=</span> p<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"img"</span><span class="token punctuation">)</span>
src <span class="token operator">=</span> img<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"src"</span><span class="token punctuation">)</span>
<span class="token comment"># 下载图片</span>
img_resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>src<span class="token punctuation">)</span>
<span class="token comment"># img_resp.content # 这里拿到的是字节</span>
img_name <span class="token operator">=</span> src<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token comment"># 切割 拿到url中的最后一个/以后的内容</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"Wallpaper/"</span><span class="token operator">+</span>img_name<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>img_resp<span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token comment"># 图片内容写入文件</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"success!"</span><span class="token punctuation">,</span> img_name<span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 防止访问过多服务器压力过大</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"all over"</span><span class="token punctuation">)</span>
</code></pre>
<h3>2.8 XPath入门</h3>
<p>xpath是在XML文档中搜索内容的一门语言</p>
<p>html是xml的一个子集</p>
<p>安装lxml模块 pip install lxml</p>
<pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree
xml <span class="token operator">=</span> <span class="token triple-quoted-string string">"""
<book>
<id>1</id>
<name>野花遍地香</name>
<price>1.23</price>
<author>
<nick id="10086">周大强</nick>
<nick id="10010">周芷若</nick>
<nick class="joy">周杰伦</nick>
<nick class="jolin">蔡依林</nick>
<div>
<nick>rerererererer</nick>
</div>
<div>
<nick>rerererererer2</nick>
<div>
<nick>rerererererer3</nick>
</div>
</div>
</author>
<partner>
<nick id="ppc">胖胖陈</nick>
<nick id="ppbc">胖胖不陈</nick>
</partner>
</book>
"""</span>
tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>XML<span class="token punctuation">(</span>xml<span class="token punctuation">)</span>
<span class="token comment"># result = tree.xpath("/book") # /表示层级关系,第一个/是根节点</span>
<span class="token comment"># result = tree.xpath("/book/name/text()") # text()表示拿文本</span>
<span class="token comment"># result = tree.xpath("/book/author//nick/text()") # 后代 拿出nick里的文本以及三个rerere</span>
<span class="token comment"># result = tree.xpath("/book/author/*/nick/text()") # *任意节点,通配符 只拿出re1,re2</span>
result <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">"/book//nick/text()"</span><span class="token punctuation">)</span> <span class="token comment"># 拿出所有nick的文本</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>result<span class="token punctuation">)</span>
</code></pre>
<p>在html文件中,[]可以表示索引,索引为第几个,例如///</p>
<ul>
<li>[1]//text()表示第一条</li>
<li>中标签的文字内容;</li>
<li><p></p> <p>[]里面也可以表示为标签的属性筛选,例如///</p> </li>
<li>/[@href=‘dapao’]/text(),表示href为“dapao”的标签的文字内容;</li>
<li><p></p> <p>///</p> </li>
<li>//@href可以单取a标签href的属性值。</li>
<li><p></p> <p><strong>小技巧</strong>:可以从网页中按F12,页面源代码中可以快速复制xpath</p> <h3>2.9 Xpath实战 抓取猪八戒网信息</h3>
<ol>
<li>拿到页面源代码</li>
<li>提取和解析数据</li>
</ol> <p>在这里我搜索的是“小程序开发”,遇到许多视频中没有出现的问题,好在通过百度也算是解决了,如果有更好的解决方法麻烦大佬留言</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree
<span class="token comment"># 我这搜索的是小程序开发,爬取过程中有许多不方便的,尽量尝试搜索英文</span>
url <span class="token operator">=</span> <span class="token string">"https://beijing.zbj.com/search/f/?kw=%E5%B0%8F%E7%A8%8B%E5%BA%8F%E5%BC%80%E5%8F%91"</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
<span class="token comment"># print(resp.text)</span>
<span class="token comment"># 解析</span>
html <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span>
<span class="token comment"># 拿到第一个服务商的div</span>
divs <span class="token operator">=</span> html<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">"/html/body/div[6]/div/div/div[3]/div[4]/div[1]/div"</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> div <span class="token keyword">in</span> divs<span class="token punctuation">:</span> <span class="token comment"># 每一个服务商的信息</span>
price_w <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[1]/div[2]/div[1]/span[1]/text()'</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> <span class="token keyword">not</span> price_w<span class="token punctuation">:</span> <span class="token comment"># 我在爬取价格时遇到空字符,因此设个if语句跳过该价格</span>
<span class="token keyword">break</span>
price <span class="token operator">=</span> price_w<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
title <span class="token operator">=</span> <span class="token string">"小程序"</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[1]/div[2]/div[2]/p/text()'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
company <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[2]/div[1]/p/text()'</span><span class="token punctuation">)</span> <span class="token comment"># 爬取结果含有换行符</span>
company <span class="token operator">=</span> <span class="token builtin">list</span><span class="token punctuation">(</span><span class="token builtin">filter</span><span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token punctuation">[</span>x<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">for</span> x <span class="token keyword">in</span> company<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token comment"># 去除换行符后再将list中的空字符去除</span>
location <span class="token operator">=</span> div<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'./div/div/a[2]/div[1]/div/span/text()'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> price<span class="token punctuation">,</span> company<span class="token punctuation">,</span> location<span class="token punctuation">)</span>
</code></pre> <h2>第三章 Requests进阶</h2> <h3>3.1 Requests进阶概述</h3> <p>我们在之前的爬虫中其实已经使用过headers了。header为HTTP协议中的请求头,一般存放一些和请求内容无关的数据,有时也会存放一些安全验证信息。比如常见的User-Agent,token,cookie等。</p> <p>通过requests发送的请求,我们可以把请求头信息放在headers中,也可以单独进行存放,最终由requests自动帮我们拼接成完整的http请求头。</p> <p>本章内容:</p>
<ol>
<li>模拟浏览器登录->处理cookie</li>
<li>防盗链处理->抓取梨视频数据</li>
<li>代理->放hi被封IP</li>
</ol> <p>综合训练:抓取网易云评论信息</p> <h3>3.2 处理cookie 登录小说网</h3> <p>登录->得到cookie</p> <p>带着cookie去请求到书架url -> 书架上的内容</p> <p>必须得把上面的两个操作连起来 我们可以使用session进行请求->session可以认为一连串的请求。在这个过程中cookie不会丢失</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token comment"># 会话</span>
session <span class="token operator">=</span> requests<span class="token punctuation">.</span>session<span class="token punctuation">(</span><span class="token punctuation">)</span>
data <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"loginName"</span><span class="token punctuation">:</span> <span class="token string">"13757696746"</span><span class="token punctuation">,</span>
<span class="token string">"password"</span><span class="token punctuation">:</span> <span class="token string">"123qweasdzxc"</span>
<span class="token punctuation">}</span>
<span class="token comment"># 1.登录</span>
url <span class="token operator">=</span> <span class="token string">"https://passport.17k.com/ck/user/login"</span>
resp <span class="token operator">=</span> session<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span>
<span class="token comment"># 拿书架的数据</span>
resp_b <span class="token operator">=</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>resp_b<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># 另一种方法</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919"</span><span class="token punctuation">,</span> headers<span class="token operator">=</span><span class="token punctuation">{
</span>
<span class="token string">"Cookie"</span><span class="token punctuation">:</span> <span class="token string">"浏览器中复制的cookie"</span>
<span class="token punctuation">}</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre> <h3>3.3 防盗链 抓取梨视频</h3> <p>爬取过程中视频url并不会出现在页面源代码里,推测视频链接是由js生成,通过拦截发现一段与视频链接非常相似的链接,于是需要将其拼接</p>
<ol>
<li>拿到contID</li>
<li>拿到videoStatus返回的json -> srcURL</li>
<li>srcURL里面的内容进行修整</li>
<li>下载视频</li>
</ol> <p><strong>什么是防盗链</strong>:溯源,防盗链相当于在页面请求过程中有个层级关系,它要求你必须是从第一个页面转到第二个页面,否则你直接访问第二个页面是不行的,防盗链就是这个页面的上一级页面</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests
url <span class="token operator">=</span> <span class="token string">"https://www.pearvideo.com/video_1738675"</span>
contID <span class="token operator">=</span> url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">"_"</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>
videoStatusUrl <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"https://www.pearvideo.com/videoStatus.jsp?contId=</span><span class="token interpolation"><span class="token punctuation">{
</span>contID<span class="token punctuation">}</span></span><span class="token string">&mrd=0.5611111607819312"</span></span>
headers <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"User-Agent"</span><span class="token punctuation">:</span> <span class="token string">"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "</span>
<span class="token string">"Safari/537.36 "</span><span class="token punctuation">,</span>
<span class="token comment"># 防盗链:</span>
<span class="token string">"Referer"</span><span class="token punctuation">:</span> url
<span class="token punctuation">}</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>videoStatusUrl<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span>
dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span>
srcUrl <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">"videoInfo"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"videos"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"srcUrl"</span><span class="token punctuation">]</span>
systemTime <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">"systemTime"</span><span class="token punctuation">]</span>
srcUrl <span class="token operator">=</span> srcUrl<span class="token punctuation">.</span>replace<span class="token punctuation">(</span>systemTime<span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"cont-</span><span class="token interpolation"><span class="token punctuation">{
</span>contID<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"videos/a.mp4"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>srcUrl<span class="token punctuation">)</span><span class="token punctuation">.</span>content<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"success!"</span><span class="token punctuation">)</span>
</code></pre> <h3>3.4 代理</h3> <p>原理:通过第三方的一个机器去发送请求</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token comment"># 36.112.139.146</span>
proxies <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"http"</span><span class="token punctuation">:</span> <span class="token string">"http://36.112.139.146:3128"</span>
<span class="token punctuation">}</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.baidu.com"</span><span class="token punctuation">,</span> proxies<span class="token operator">=</span>proxies<span class="token punctuation">)</span>
resp<span class="token punctuation">.</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>resp<span class="token punctuation">.</span>text<span class="token punctuation">)</span>
</code></pre> <h3>3.5 综合训练 抓取网易云音乐评论信息</h3>
<ol>
<li>找到未加密的参数</li>
<li>想办法把参数进行加密(必须参考网易的洛基),params => encText,encSecKey => encSecKey</li>
<li>请求到网易,拿到评论信息</li>
</ol> <p>爬取过程中遇到极其复杂的信息加密,Network项目中拦截到神评后,可以发现该请求的data是加密了的,在Initiator里可以看到它生成神评都是经过哪些js,点击第一个也就是最后运行的js文件查看代码,对该行代码标记后往前推找到对应url,可以看到右边Scope栏中Local底下有加密的data信息,那么我们可以倒推代码找到它是在哪一行里加密的,所以在右边Call Stack栏里往后倒推,一个一个查看Local属性里的data是否有加密,最后排查到u0x.be1x这一步中data还未加密,可以推测这段js就是对data的加密。注意:js文件中的变量名每次刷新都会变化</p> <pre><code class="prism language-js"> u9l<span class="token punctuation">.</span><span class="token function-variable function">be9V</span> <span class="token operator">=</span> <span class="token keyword">function</span><span class="token punctuation">(</span><span class="token parameter"><span class="token constant">Y9P</span><span class="token punctuation">,</span> e9f</span><span class="token punctuation">)</span> <span class="token punctuation">{
</span>
<span class="token keyword">var</span> i9b <span class="token operator">=</span> <span class="token punctuation">{
</span><span class="token punctuation">}</span>
<span class="token punctuation">,</span> e9f <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span><span class="token punctuation">{
</span><span class="token punctuation">}</span><span class="token punctuation">,</span> e9f<span class="token punctuation">)</span>
<span class="token punctuation">,</span> mo3x <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">indexOf</span><span class="token punctuation">(</span><span class="token string">"?"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span>window<span class="token punctuation">.</span>GEnc <span class="token operator">&&</span> <span class="token regex"><span class="token regex-delimiter">/</span><span class="token regex-source language-regex">(^|\.com)\/api</span><span class="token regex-delimiter">/</span></span><span class="token punctuation">.</span><span class="token function">test</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">)</span> <span class="token operator">&&</span> <span class="token operator">!</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>headers <span class="token operator">&&</span> e9f<span class="token punctuation">.</span>headers<span class="token punctuation">[</span>eu0x<span class="token punctuation">.</span>Bl8d<span class="token punctuation">]</span> <span class="token operator">==</span> eu0x<span class="token punctuation">.</span>Io0x<span class="token punctuation">)</span> <span class="token operator">&&</span> <span class="token operator">!</span>e9f<span class="token punctuation">.</span>noEnc<span class="token punctuation">)</span> <span class="token punctuation">{
</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span>mo3x <span class="token operator">!=</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token punctuation">{
</span>
i9b <span class="token operator">=</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">substring</span><span class="token punctuation">(</span>mo3x <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">substring</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> mo3x<span class="token punctuation">)</span>
<span class="token punctuation">}</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token punctuation">{
</span>
i9b <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span>i9b<span class="token punctuation">,</span> j9a<span class="token punctuation">.</span><span class="token function">fP1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token operator">?</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span> <span class="token operator">:</span> e9f<span class="token punctuation">.</span>query<span class="token punctuation">)</span>
<span class="token punctuation">}</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token punctuation">{
</span>
i9b <span class="token operator">=</span> <span class="token constant">NEJ</span><span class="token punctuation">.</span><span class="token constant">X</span><span class="token punctuation">(</span>i9b<span class="token punctuation">,</span> j9a<span class="token punctuation">.</span><span class="token function">fP1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token operator">?</span> j9a<span class="token punctuation">.</span><span class="token function">gX1x</span><span class="token punctuation">(</span>e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span> <span class="token operator">:</span> e9f<span class="token punctuation">.</span>data<span class="token punctuation">)</span>
<span class="token punctuation">}</span>
i9b<span class="token punctuation">[</span><span class="token string">"csrf_token"</span><span class="token punctuation">]</span> <span class="token operator">=</span> u9l<span class="token punctuation">.</span><span class="token function">gP1x</span><span class="token punctuation">(</span><span class="token string">"__csrf"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span><span class="token string">"api"</span><span class="token punctuation">,</span> <span class="token string">"weapi"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
e9f<span class="token punctuation">.</span>method <span class="token operator">=</span> <span class="token string">"post"</span><span class="token punctuation">;</span>
<span class="token keyword">delete</span> e9f<span class="token punctuation">.</span>query<span class="token punctuation">;</span>
<span class="token keyword">var</span> bUG7z <span class="token operator">=</span> window<span class="token punctuation">.</span><span class="token function">asrsea</span><span class="token punctuation">(</span><span class="token constant">JSON</span><span class="token punctuation">.</span><span class="token function">stringify</span><span class="token punctuation">(</span>i9b<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"流泪"</span><span class="token punctuation">,</span> <span class="token string">"强"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token constant">WU8M</span><span class="token punctuation">.</span>md<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"爱心"</span><span class="token punctuation">,</span> <span class="token string">"女孩"</span><span class="token punctuation">,</span> <span class="token string">"惊恐"</span><span class="token punctuation">,</span> <span class="token string">"大笑"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
e9f<span class="token punctuation">.</span>data <span class="token operator">=</span> j9a<span class="token punctuation">.</span><span class="token function">cs0x</span><span class="token punctuation">(</span><span class="token punctuation">{
</span>
params<span class="token operator">:</span> bUG7z<span class="token punctuation">.</span>encText<span class="token punctuation">,</span>
encSecKey<span class="token operator">:</span> bUG7z<span class="token punctuation">.</span>encSecKey
<span class="token punctuation">}</span><span class="token punctuation">)</span>
<span class="token punctuation">}</span>
<span class="token keyword">var</span> cdnHost <span class="token operator">=</span> <span class="token string">"y.music.163.com"</span><span class="token punctuation">;</span>
<span class="token keyword">var</span> apiHost <span class="token operator">=</span> <span class="token string">"interface.music.163.com"</span><span class="token punctuation">;</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span>location<span class="token punctuation">.</span>host <span class="token operator">===</span> cdnHost<span class="token punctuation">)</span> <span class="token punctuation">{
</span>
<span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span>cdnHost<span class="token punctuation">,</span> apiHost<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">.</span><span class="token function">match</span><span class="token punctuation">(</span><span class="token regex"><span class="token regex-delimiter">/</span><span class="token regex-source language-regex">^\/(we)?api</span><span class="token regex-delimiter">/</span></span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{
</span>
<span class="token constant">Y9P</span> <span class="token operator">=</span> <span class="token string">"//"</span> <span class="token operator">+</span> apiHost <span class="token operator">+</span> <span class="token constant">Y9P</span>
<span class="token punctuation">}</span>
e9f<span class="token punctuation">.</span>cookie <span class="token operator">=</span> <span class="token boolean">true</span>
<span class="token punctuation">}</span>
<span class="token function">cwR2x</span><span class="token punctuation">(</span><span class="token constant">Y9P</span><span class="token punctuation">,</span> e9f<span class="token punctuation">)</span>
<span class="token punctuation">}</span>
</code></pre> <p>过程比较复杂,最好跟着视频学习.</p> <p>在该方法里一步一步推导,可以发现</p> <pre><code class="prism language-js"><span class="token keyword">var</span> bUG7z <span class="token operator">=</span> window<span class="token punctuation">.</span><span class="token function">asrsea</span><span class="token punctuation">(</span><span class="token constant">JSON</span><span class="token punctuation">.</span><span class="token function">stringify</span><span class="token punctuation">(</span>i9b<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"流泪"</span><span class="token punctuation">,</span> <span class="token string">"强"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token constant">WU8M</span><span class="token punctuation">.</span>md<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">bsB3x</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"爱心"</span><span class="token punctuation">,</span> <span class="token string">"女孩"</span><span class="token punctuation">,</span> <span class="token string">"惊恐"</span><span class="token punctuation">,</span> <span class="token string">"大笑"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
</code></pre> <p>这里后面开始的加密,仔细研究可以看出来是替换了内容params => encText,encSecKey => encSecKey,那么就去找window.asrsea()这个方法,搜索后发现它的值全靠这一句window.asrsea = d,网上看可以看到d方法的定义过程</p> <pre><code class="prism language-js"> <span class="token keyword">function</span> <span class="token function">d</span><span class="token punctuation">(</span><span class="token parameter">d<span class="token punctuation">,</span> e<span class="token punctuation">,</span> f<span class="token punctuation">,</span> g</span><span class="token punctuation">)</span> <span class="token punctuation">{
</span>
<span class="token keyword">var</span> h <span class="token operator">=</span> <span class="token punctuation">{
</span><span class="token punctuation">}</span>
<span class="token punctuation">,</span> i <span class="token operator">=</span> <span class="token function">a</span><span class="token punctuation">(</span><span class="token number">16</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token keyword">return</span> h<span class="token punctuation">.</span>encText <span class="token operator">=</span> <span class="token function">b</span><span class="token punctuation">(</span>d<span class="token punctuation">,</span> g<span class="token punctuation">)</span><span class="token punctuation">,</span>
h<span class="token punctuation">.</span>encText <span class="token operator">=</span> <span class="token function">b</span><span class="token punctuation">(</span>h<span class="token punctuation">.</span>encText<span class="token punctuation">,</span> i<span class="token punctuation">)</span><span class="token punctuation">,</span>
h<span class="token punctuation">.</span>encSecKey <span class="token operator">=</span> <span class="token function">c</span><span class="token punctuation">(</span>i<span class="token punctuation">,</span> e<span class="token punctuation">,</span> f<span class="token punctuation">)</span><span class="token punctuation">,</span>
h
<span class="token punctuation">}</span>
</code></pre> <p>d()的四个元素中,d代表数据,e在控制台中过几遍可以发现是固定值010001,f是一串很长的外星文,g也是固定值“0CoJUm6Qyw8W8jud”</p> <p>然后就根据属性值,分析d()究竟要干什么,接下来内容的分析就不再做详细的介绍,a()返回16位随机字符串</p> <p>我这爬取了用户的昵称以及评论,具体步骤需要去b站看视频</p> <pre><code class="prism language-python"><span class="token keyword">from</span> Crypto<span class="token punctuation">.</span>Cipher <span class="token keyword">import</span> AES
<span class="token keyword">from</span> base64 <span class="token keyword">import</span> b64encode
<span class="token keyword">import</span> requests
<span class="token keyword">import</span> json
url <span class="token operator">=</span> <span class="token string">"https://music.163.com/weapi/comment/resource/comments/get?csrf_token="</span>
<span class="token comment"># 请求方式POST</span>
data <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"csrf_token"</span><span class="token punctuation">:</span> <span class="token string">""</span><span class="token punctuation">,</span>
<span class="token string">"cursor"</span><span class="token punctuation">:</span> <span class="token string">"-1"</span><span class="token punctuation">,</span>
<span class="token string">"offset"</span><span class="token punctuation">:</span> <span class="token string">"0"</span><span class="token punctuation">,</span>
<span class="token string">"orderType"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span>
<span class="token string">"pageNo"</span><span class="token punctuation">:</span> <span class="token string">"1"</span><span class="token punctuation">,</span>
<span class="token string">"pageSize"</span><span class="token punctuation">:</span> <span class="token string">"20"</span><span class="token punctuation">,</span>
<span class="token string">"rid"</span><span class="token punctuation">:</span> <span class="token string">"R_SO_4_65538"</span><span class="token punctuation">,</span>
<span class="token string">"threadId"</span><span class="token punctuation">:</span> <span class="token string">"R_SO_4_65538"</span>
<span class="token punctuation">}</span>
<span class="token comment"># 服务于d</span>
e <span class="token operator">=</span> <span class="token string">"010001"</span>
f <span class="token operator">=</span> <span class="token string">"00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e "</span>
g <span class="token operator">=</span> <span class="token string">"0CoJUm6Qyw8W8jud"</span>
i <span class="token operator">=</span> <span class="token string">"7HCsoSguhIA6SpNw"</span> <span class="token comment"># 手动固定 函数中是随机的</span>
encSecKey <span class="token operator">=</span> <span class="token string">"21fb180e564113d59d37865081a91daf1f775fb67ef063dc046bda9966613ea4a384b597e11ce05c442df9dfa8538347c58aa87d9be92636fbda399b28f04bbf31e91751e25f359a05538b8d5c51999a03e1348e21cbe90fbfa54d013399c0ab240e41c73750ef463542fe5c14637db16abeffa8a2ab74027e085aa570c01395 "</span>
<span class="token comment"># 转化成16的倍数,为下方的加密算法服务</span>
<span class="token keyword">def</span> <span class="token function">to_16</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">:</span>
pad <span class="token operator">=</span> <span class="token number">16</span> <span class="token operator">-</span> <span class="token builtin">len</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span> <span class="token operator">%</span> <span class="token number">16</span>
data <span class="token operator">+=</span> <span class="token builtin">chr</span><span class="token punctuation">(</span>pad<span class="token punctuation">)</span> <span class="token operator">*</span> pad
<span class="token keyword">return</span> data
<span class="token keyword">def</span> <span class="token function">get_encSecKey</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 由于i是固定的,因此encSecKey也是固定的,c()函数获得的结果也是固定的</span>
<span class="token keyword">return</span> encSecKey
<span class="token keyword">def</span> <span class="token function">get_params</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 默认这里接受到的为字符串</span>
first <span class="token operator">=</span> enc_params<span class="token punctuation">(</span>data<span class="token punctuation">,</span> g<span class="token punctuation">)</span>
second <span class="token operator">=</span> enc_params<span class="token punctuation">(</span>first<span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">return</span> second <span class="token comment"># 返回的就是params</span>
<span class="token keyword">def</span> <span class="token function">enc_params</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span> key<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 加密过程</span>
<span class="token comment"># 导入AES加密模块需要导入新包</span>
iv <span class="token operator">=</span> <span class="token string">"0102030405060708"</span>
data <span class="token operator">=</span> to_16<span class="token punctuation">(</span>data<span class="token punctuation">)</span>
aes <span class="token operator">=</span> AES<span class="token punctuation">.</span>new<span class="token punctuation">(</span>key<span class="token operator">=</span>key<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> IV<span class="token operator">=</span>iv<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> mode<span class="token operator">=</span>AES<span class="token punctuation">.</span>MODE_CBC<span class="token punctuation">)</span> <span class="token comment"># 创建加密器</span>
bs <span class="token operator">=</span> aes<span class="token punctuation">.</span>encrypt<span class="token punctuation">(</span>data<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 加密,加密内容的长度必须是16的倍数</span>
<span class="token keyword">return</span> <span class="token builtin">str</span><span class="token punctuation">(</span>b64encode<span class="token punctuation">(</span>bs<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token comment"># 转化成字符串返回</span>
<span class="token comment"># 处理加密过程</span>
<span class="token triple-quoted-string string">"""
function a(a) { # 返回随机的16位字符串
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
for (d = 0; a > d; d += 1) # 循环16次
e = Math.random() * b.length, # 随机数
e = Math.floor(e), # 取整
c += b.charAt(e); # 去字符串中的x位置
return c
}
function b(a, b) { # a是要加密的内容,
var c = CryptoJS.enc.Utf8.parse(b) # b是密钥
, d = CryptoJS.enc.Utf8.parse("0102030405060708")
, e = CryptoJS.enc.Utf8.parse(a) # e是数据
, f = CryptoJS.AES.encrypt(e, c, { # c 加密的密钥
iv: d, # 偏移量
mode: CryptoJS.mode.CBC # 模式:cbc
});
return f.toString()
}
function c(a, b, c) {
var d, e;
return setMaxDigits(131),
d = new RSAKeyPair(b,"",c),
e = encryptedString(d, a)
}
function d(d, e, f, g) {
var h = {} # 这里为空
, i = a(16); # i就是16位随机值,把i设为固定值
return h.encText = b(d, g), # g密钥
h.encText = b(h.encText, i), # 返回的就是params i也是密钥
h.encSecKey = c(i, e, f), # 返回的就是encSecKey,e和f是定死的,如果此时把i固定得到的key是固定的
h
}
function e(a, b, d, e) {
var f = {};
return f.encText = c(a + e, b, d),
f
}
两次加密:
数据+g => b => 第一次加密+i => b => params
"""</span>
<span class="token comment"># 发送请求,得到评论结果</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span><span class="token punctuation">{
</span>
<span class="token string">"params"</span><span class="token punctuation">:</span> get_params<span class="token punctuation">(</span>json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
<span class="token string">"encSecKey"</span><span class="token punctuation">:</span> get_encSecKey<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">}</span><span class="token punctuation">)</span>
dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span>
hotComments <span class="token operator">=</span> dic<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"hotComments"</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> hotComments<span class="token punctuation">:</span>
username <span class="token operator">=</span> i<span class="token punctuation">[</span><span class="token string">"user"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"nickname"</span><span class="token punctuation">]</span>
content <span class="token operator">=</span> i<span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>username<span class="token punctuation">,</span> <span class="token string">":"</span><span class="token punctuation">,</span> content<span class="token punctuation">)</span>
</code></pre> <h2>第四章 异步</h2> <h3>4.1 第四章概述</h3> <p>到目前为止,我们可以解决爬虫的基本抓取流程了,但是抓取效率还不够高。如何提高抓取效率呢?我们可以选择多线程,多进程,协程等操作完成异步爬虫。</p> <p>什么是异步?假设我们有一万条数据需要爬取,一个一个爬的话就会需要很长的时间,那异步就是多条线路同时进行,可以一次性爬取多条数据。</p> <p>本章内容:</p>
<ol>
<li>快速学会多线程</li>
<li>快速学会多进程</li>
<li>线程池和进程池</li>
<li>扒光新发地</li>
<li>协程</li>
<li>多任务异步协程实现</li>
<li>aiohttp模块详解</li>
<li>扒光一本小说</li>
<li>综合训练-抓取一部电影</li>
</ol> <h3>4.2 多线程</h3>
<ul>
<li>进程是资源单位,每一个进程至少要有一个线程</li>
<li>线程是执行单位</li>
</ul> <p>第一套写法</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread
<span class="token keyword">def</span> <span class="token function">func</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"func "</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
t <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>func<span class="token punctuation">)</span> <span class="token comment"># 创建线程并给线程安排任务,相当于创建一个员工,括号内为他要做的工作</span>
t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 多线程状态为可以开始工作状态,具体的执行时间由CPU决定</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"main"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span>
</code></pre> <p>第二套写法</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread
<span class="token keyword">class</span> <span class="token class-name">MyThread</span><span class="token punctuation">(</span>Thread<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"子线程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
t <span class="token operator">=</span> MyThread<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># t.run() # 方法调用了,依然是单线程</span>
t<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 开启线程</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"主线程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span>
</code></pre> <h3>4.3 多进程</h3> <p>多进程的写法与多线程基本相同</p> <pre><code class="prism language-python"><span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Process
<span class="token keyword">def</span> <span class="token function">fuc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"子进程"</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
p <span class="token operator">=</span> Process<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">)</span>
p<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"主线程"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span>
</code></pre> <p>那如果要区分两个进程应该怎么写?</p> <pre><code class="prism language-python"><span class="token keyword">from</span> threading <span class="token keyword">import</span> Thread
<span class="token keyword">def</span> <span class="token function">fuc</span><span class="token punctuation">(</span>name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 打印括号内的名字</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
t1 <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span><span class="token string">" 周杰伦"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 传递参数必须是元组</span>
t1<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span>
t2 <span class="token operator">=</span> Thread<span class="token punctuation">(</span>target<span class="token operator">=</span>fuc<span class="token punctuation">,</span> args<span class="token operator">=</span><span class="token punctuation">(</span><span class="token string">"王力宏"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
t2<span class="token punctuation">.</span>start<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre> <h3>4.4 线程池与进程池入门</h3> <p>线程池:一次性开辟一些线程,我们用户直接给线程池子提交任务。线程任务的调度交给线程池来完成</p> <pre><code class="prism language-python"><span class="token keyword">from</span> concurrent<span class="token punctuation">.</span>futures <span class="token keyword">import</span> ThreadPoolExecutor<span class="token punctuation">,</span> ProcessPoolExecutor
<span class="token comment"># ThreadPoolExecutor, ProcessPoolExecutor一个对应线程一个对应进程,选择使用</span>
<span class="token keyword">def</span> <span class="token function">fn</span><span class="token punctuation">(</span>name<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
<span class="token comment"># 创建线程池</span>
<span class="token keyword">with</span> ThreadPoolExecutor<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">as</span> t<span class="token punctuation">:</span> <span class="token comment"># 创建50个线程</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
t<span class="token punctuation">.</span>submit<span class="token punctuation">(</span>fn<span class="token punctuation">,</span> name<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"线程</span><span class="token interpolation"><span class="token punctuation">{
</span>i<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span>
<span class="token comment"># 等待线程池中的任务全部执行完毕,才继续执行(守护)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Done"</span><span class="token punctuation">)</span>
</code></pre> <h3>4.5 线程池案例-抓取新发地菜价</h3>
<ol>
<li>如何提取单个页面的数据</li>
<li>上线程池,多个页面同时抓取</li>
</ol> <p>因为页面更新,数据不会保存在页面源代码,更新后是用json生成数据,因此与视频代码不同</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token keyword">import</span> csv
<span class="token keyword">from</span> concurrent<span class="token punctuation">.</span>futures <span class="token keyword">import</span> ThreadPoolExecutor
f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"data.csv"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">,</span> newline<span class="token operator">=</span><span class="token string">""</span><span class="token punctuation">)</span>
csvwriter <span class="token operator">=</span> csv<span class="token punctuation">.</span>writer<span class="token punctuation">(</span>f<span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">download_one_page</span><span class="token punctuation">(</span>page<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token comment"># 拿到页面源代码</span>
url <span class="token operator">=</span> <span class="token string">"http://www.xinfadi.com.cn/getPriceData.html"</span>
data <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"limit"</span><span class="token punctuation">:</span> <span class="token string">"20"</span><span class="token punctuation">,</span>
<span class="token string">"current"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{
</span>page<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token comment"># 对应第几页</span>
<span class="token punctuation">}</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span>
<span class="token keyword">for</span> txt <span class="token keyword">in</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">"list"</span><span class="token punctuation">]</span><span class="token punctuation">:</span>
<span class="token comment"># 提取自己需要的内容</span>
dic <span class="token operator">=</span> <span class="token punctuation">[</span>txt<span class="token punctuation">[</span><span class="token string">"prodName"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"prodCat"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"lowPrice"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"highPrice"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"place"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> txt<span class="token punctuation">[</span><span class="token string">"pubDate"</span><span class="token punctuation">]</span><span class="token punctuation">]</span>
<span class="token comment"># 将数据存放至文件中</span>
csvwriter<span class="token punctuation">.</span>writerow<span class="token punctuation">(</span>dic<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"第</span><span class="token interpolation"><span class="token punctuation">{
</span>page<span class="token punctuation">}</span></span><span class="token string">页下载完成"</span></span><span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
<span class="token comment"># for i in range(1, 17712): # 效率极其低下</span>
<span class="token comment"># download_one_page(i)</span>
<span class="token comment"># 创建线程池</span>
<span class="token keyword">with</span> ThreadPoolExecutor<span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token keyword">as</span> t<span class="token punctuation">:</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">200</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token comment"># 把下载任务提交给线程池</span>
t<span class="token punctuation">.</span>submit<span class="token punctuation">(</span>download_one_page<span class="token punctuation">,</span> i<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"全部下载完毕"</span><span class="token punctuation">)</span>
</code></pre> <h3>4.6 协程</h3> <h4>4.6.1 协程概念</h4> <p>当代码中time.sleep()的时候,当前线程是处于阻塞状态,CPU是部位我工作的</p> <p>同样的,input()程序也是处于阻塞状态</p> <p>requests.get(url) 在网络请求返回数据之前,程序也是处于阻塞状态</p> <p>一般情况下,当程序处于IO操作的时候,线程都会处于阻塞状态</p> <p><strong>协程</strong>:当程序遇见IO操作的时候,可以选择性的切换到其他任务上。在微观上是一个任务一个任务的进行切换,切换条件一般就是IO操作;在宏观上,我们能看到的其实是多个任务一起在执行。</p> <h4>4.6.2 多任务异步交互</h4> <pre><code class="prism language-python"><span class="token keyword">import</span> asyncio
<span class="token keyword">import</span> time
<span class="token comment"># async def func():</span>
<span class="token comment"># print("你好,我叫赛利亚")</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># if __name__ == '__main__':</span>
<span class="token comment"># g = func() # 此时的函数是异步协程函数,此时函数执行得到的是一个协程对象</span>
<span class="token comment"># asyncio.run(g) # 协程程序运行需要asyncio模块的支持</span>
<span class="token comment"># async def func1():</span>
<span class="token comment"># print("你好,我是func1")</span>
<span class="token comment"># # time.sleep(3) # 当程序出现同步操作的时候,异步就中断了</span>
<span class="token comment"># await asyncio.sleep(3) # 异步操作的代码,表明在这段等待时间切换到下一个任务</span>
<span class="token comment"># print("你好,我是func1")</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># async def func2():</span>
<span class="token comment"># print("你好,我是func2")</span>
<span class="token comment"># # time.sleep(4)</span>
<span class="token comment"># await asyncio.sleep(4)</span>
<span class="token comment"># print("你好,我是func2")</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># async def func3():</span>
<span class="token comment"># print("你好,我是func3")</span>
<span class="token comment"># # time.sleep(2)</span>
<span class="token comment"># await asyncio.sleep(2)</span>
<span class="token comment"># print("你好,我是func3")</span>
<span class="token comment">#</span>
<span class="token comment"># if __name__ == '__main__':</span>
<span class="token comment"># f1 = func1()</span>
<span class="token comment"># f2 = func2()</span>
<span class="token comment"># f3 = func3()</span>
<span class="token comment"># tasks = [</span>
<span class="token comment"># f1, f2, f3</span>
<span class="token comment"># ]</span>
<span class="token comment"># t1 = time.time()</span>
<span class="token comment"># # 一次性启动多个任务(协程)</span>
<span class="token comment"># asyncio.run(asyncio.wait(tasks))</span>
<span class="token comment"># t2 = time.time()</span>
<span class="token comment"># print(t2-t1)</span>
<span class="token comment"># 上面的这种并不是推荐写法,推荐写法为下方这种,因为这种写法可以套在爬虫上</span>
<span class="token comment"># async def func1():</span>
<span class="token comment"># print("你好,我是func1")</span>
<span class="token comment"># await asyncio.sleep(3)</span>
<span class="token comment"># print("你好,我是func1")</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># async def func2():</span>
<span class="token comment"># print("你好,我是func2")</span>
<span class="token comment"># await asyncio.sleep(4)</span>
<span class="token comment"># print("你好,我是func2")</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># async def func3():</span>
<span class="token comment"># print("你好,我是func3")</span>
<span class="token comment"># await asyncio.sleep(2)</span>
<span class="token comment"># print("你好,我是func3")</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># async def main():</span>
<span class="token comment"># # 第一种写法</span>
<span class="token comment"># # f1 = func1()</span>
<span class="token comment"># # await f1 # 一般await挂起操作放在协程对象前面</span>
<span class="token comment"># # 第二种写法(推荐)</span>
<span class="token comment"># tasks = [</span>
<span class="token comment"># asyncio.create_task(func1()), # py3.8以后加上asyncio.create_task()</span>
<span class="token comment"># asyncio.create_task(func2()),</span>
<span class="token comment"># asyncio.create_task(func3())</span>
<span class="token comment"># ]</span>
<span class="token comment"># await asyncio.wait(tasks)</span>
<span class="token comment">#</span>
<span class="token comment">#</span>
<span class="token comment"># if __name__ == '__main__':</span>
<span class="token comment"># asyncio.run(main())</span>
<span class="token comment"># 在爬虫领域的应用</span>
<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">download</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"准备开始下载"</span><span class="token punctuation">)</span>
<span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 模拟网络请求</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"下载完成"</span><span class="token punctuation">)</span>
<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
urls <span class="token operator">=</span> <span class="token punctuation">[</span>
<span class="token string">"http://www.baidu.com"</span><span class="token punctuation">,</span>
<span class="token string">"http://www.bilibili.com"</span><span class="token punctuation">,</span>
<span class="token string">"http://www.163.com"</span>
<span class="token punctuation">]</span>
tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span>
d <span class="token operator">=</span> download<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>d<span class="token punctuation">)</span>
<span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre> <h4>4.6.3 关于异步协程-过时警告</h4> <p>在python3.8的版本后,task打包需要添加asyncio.create_task(),括号内为任务,3.11版本后将会彻底删除,到时候会直接报错。</p> <h3>4.7 异步http请求aiohttp模块</h3> <p>首先要安装模块pip install aiohttp</p> <p>requests.get()同步的代码–>异步操作aiohttp</p> <pre><code class="prism language-python"><span class="token keyword">import</span> aiohttp
<span class="token keyword">import</span> asyncio
urls <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan.jpg"</span><span class="token punctuation">,</span>
<span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan-001.jpg"</span><span class="token punctuation">,</span>
<span class="token string">"https://img-pre.ivsky.com/img/tupian/pre/202101/31/weiershi_kejiquan-003.jpg"</span>
<span class="token punctuation">}</span>
<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">aiodownload</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span>
name <span class="token operator">=</span> url<span class="token punctuation">.</span>rsplit<span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>
<span class="token comment"># s = aiohttp.ClientSession() <==> requests.session()</span>
<span class="token comment"># s.get(),post() = requests.get(),post()</span>
<span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span>
<span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span>
<span class="token comment"># 请求回来了 写入文件</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"Wallpaper/"</span><span class="token operator">+</span>name<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token keyword">await</span> resp<span class="token punctuation">.</span>content<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># 读取内容是异步的 需要await挂起</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>name<span class="token punctuation">,</span> <span class="token string">"done!"</span><span class="token punctuation">)</span>
<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> url <span class="token keyword">in</span> urls<span class="token punctuation">:</span>
tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>aiodownload<span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
<span class="token comment"># 这里使用asyncio.run(main())会报RuntimeError: Event loop is closed,改为下方这种就不会报错了</span>
loop <span class="token operator">=</span> asyncio<span class="token punctuation">.</span>get_event_loop<span class="token punctuation">(</span><span class="token punctuation">)</span>
loop<span class="token punctuation">.</span>run_until_complete<span class="token punctuation">(</span>main<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre> <h3>4.8 异步爬虫实战-扒光一部小说</h3>
<ol>
<li>同步操作:访问 getCatalog 拿到所有章节cid和名称</li>
<li>异步操作:访问 getChapterContent 下载所有的文章内容</li>
</ol> <pre><code class="prism language-python"><span class="token comment"># http://dushu.baidu.com/api/pc/getCatalog?data={'book_id':'4306063500'} # 获取章节的内容</span>
<span class="token comment"># 获得小说内容</span>
<span class="token comment"># http://dushu.baidu.com/api/pc/getChapterContent?data={'book_id':'4306063500','cid':'4306063500|11348571','need_bookinfo':1}</span>
<span class="token keyword">import</span> requests
<span class="token keyword">import</span> asyncio
<span class="token keyword">import</span> aiohttp
<span class="token keyword">import</span> aiofiles
<span class="token keyword">import</span> json
<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">aiodownload</span><span class="token punctuation">(</span>cid<span class="token punctuation">,</span> b_id<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token punctuation">:</span>
data <span class="token operator">=</span> <span class="token punctuation">{
</span>
<span class="token string">"book_id"</span><span class="token punctuation">:</span> b_id<span class="token punctuation">,</span>
<span class="token string">"cid"</span><span class="token punctuation">:</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{
</span>b_id<span class="token punctuation">}</span></span><span class="token string">|</span><span class="token interpolation"><span class="token punctuation">{
</span>cid<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span>
<span class="token string">"need_bookinfo"</span><span class="token punctuation">:</span> <span class="token number">1</span>
<span class="token punctuation">}</span>
data <span class="token operator">=</span> json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>data<span class="token punctuation">)</span>
url <span class="token operator">=</span> <span class="token string-interpolation"><span class="token string">f"http://dushu.baidu.com/api/pc/getChapterContent?data=</span><span class="token interpolation"><span class="token punctuation">{
</span>data<span class="token punctuation">}</span></span><span class="token string">"</span></span>
<span class="token keyword">async</span> <span class="token keyword">with</span> aiohttp<span class="token punctuation">.</span>ClientSession<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> session<span class="token punctuation">:</span>
<span class="token keyword">async</span> <span class="token keyword">with</span> session<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token keyword">as</span> resp<span class="token punctuation">:</span>
dic <span class="token operator">=</span> <span class="token keyword">await</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">async</span> <span class="token keyword">with</span> aiofiles<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"西游记/"</span> <span class="token operator">+</span> title<span class="token operator">+</span><span class="token string">".txt"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"w"</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
<span class="token keyword">await</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>dic<span class="token punctuation">[</span><span class="token string">"data"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"novel"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"content"</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">,</span> <span class="token string">"success"</span><span class="token punctuation">)</span>
<span class="token keyword">async</span> <span class="token keyword">def</span> <span class="token function">getCatalog</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span>
resp <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
dic <span class="token operator">=</span> resp<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span>
tasks <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> item <span class="token keyword">in</span> dic<span class="token punctuation">[</span><span class="token string">"data"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"novel"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"items"</span><span class="token punctuation">]</span><span class="token punctuation">:</span> <span class="token comment"># item就是对应每个章节的名称和id</span>
title <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span>
cid <span class="token operator">=</span> item<span class="token punctuation">[</span><span class="token string">"cid"</span><span class="token punctuation">]</span>
<span class="token comment"># 准备异步任务</span>
tasks<span class="token punctuation">.</span>append<span class="token punctuation">(</span>asyncio<span class="token punctuation">.</span>create_task<span class="token punctuation">(</span>aiodownload<span class="token punctuation">(</span>cid<span class="token punctuation">,</span> b_id<span class="token punctuation">,</span> title<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">await</span> asyncio<span class="token punctuation">.</span>wait<span class="token punctuation">(</span>tasks<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
b_id <span class="token operator">=</span> <span class="token string">"4306063500"</span>
url <span class="token operator">=</span> <span class="token string">'http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"'</span> <span class="token operator">+</span> b_id <span class="token operator">+</span> <span class="token string">'"}'</span>
asyncio<span class="token punctuation">.</span>run<span class="token punctuation">(</span>getCatalog<span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre> <h3>4.9 爬取视频</h3> <h4>4.9.1 综合训练-视频网站的工作原理</h4> <p>我们在编写网站的时候,对于视频文件会有一个视频标签,但是如果一个视频网站这样放视频那么每次播放的时候都相当于把视频完整下载,那这个会非常耗时。</p> <p><strong>那一般的视频网站是怎么做的</strong>?</p> <p>用户上传 -> 转码(把视频做处理,2k,1080,标清) -> 切片处理(把单个文件进行拆分成多个文件,用户在拖动进度条的时候只需要加载对应文件)</p> <p>既然要把视频切成非常多个小碎片,那就需要一个文件来记录:1.视频播放顺序,2.视频存放的路径。该文件一般为M3U文件,M3U文件中的内容经过utf-8的编码后,就是M3U8文件,今天我们看到的各大视频网站平台使用的几乎都是M3U8文件。</p> <p>M3U8文件解读:</p> <pre><code class="prism language-python"><span class="token comment">#EXTM3U</span>
<span class="token comment">#EXT-X-VERSION:3</span>
<span class="token comment">#EXT-X-TARGETDURATION:13 每个视频功片最大时长 </span>
<span class="token comment">#EXT-X-MEDIA-SEQUENCE:0</span>
<span class="token comment">#EXT-X-KEY:METH0D=AES-128,URI="key.key" 切片文件的加密方式以及加密的密钥地址,如果有加密,需要先解密才能播放</span>
<span class="token comment">#EXTINF:12.600000, 持续时间 </span>
cFN803436000<span class="token punctuation">.</span>ts 这里面不带<span class="token string">'#'</span>开头的就是每个ts文作的地址
<span class="token comment">#EXTINF:10.000000,</span>
cFN8o3436001<span class="token punctuation">.</span>ts
<span class="token comment">#EXTINF:10.000000, </span>
cFN8o3436002<span class="token punctuation">.</span>ts
<span class="token comment">#EXTINF:10.000000,</span>
cFN8o3436003<span class="token punctuation">.</span>ts
<span class="token comment">#EXTINF:10.000000,</span>
cFN8o3436004<span class="token punctuation">.</span>ts
<span class="token comment">#EXTINF:10.000000,</span>
cFN8o3436005<span class="token punctuation">.</span>ts
<span class="token comment">#EXTINF:6.880000 </span>
cFN803436006<span class="token punctuation">.</span>ts
</code></pre> <p>那么想要抓取一个视频的流程:</p>
<ol>
<li>找到M3U8(各种手段)</li>
<li>通过M3U8下载到ts文件</li>
<li>可以通过各种手段(不仅是编程手段)把ts文件合并为一个mp4文件</li>
</ol> <h4>4.9.2 抓取云播TV-简单版</h4> <p>网站失效,使用云播tv</p> <p>url:https://www.yunbtv.com/</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests
<span class="token keyword">import</span> re
url <span class="token operator">=</span> <span class="token string">"https://video.buycar5.cn/20200813/uNqvsBhl/2000kb/hls/index.m3u8"</span>
key_uri<span class="token operator">=</span> <span class="token string">"https://ts1.yuyuangewh.com:9999/20200813/uNqvsBhl/2000kb/hls/key.key"</span>
<span class="token comment"># 1.首先打印出m3u8文件的内容 发现内容有加密</span>
m3u8_text <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>
<span class="token comment"># 2.将m3u8文件下载并改名为index.m3u8</span>
<span class="token comment"># with open("download_video/"+"index.m3u8", mode="wb") as f:</span>
<span class="token comment"># f.write(m3u8_text.content)</span>
<span class="token comment"># m3u8_text.close()</span>
<span class="token comment"># print("m3u8 success")</span>
<span class="token comment"># 3.下载key.key文件并改名为key.m3u8</span>
<span class="token comment"># key_text = requests.get(key_uri)</span>
<span class="token comment"># with open("download_video/"+"key.m3u8", mode="wb") as f:</span>
<span class="token comment"># f.write(key_text.content)</span>
<span class="token comment"># key_text.close()</span>
<span class="token comment"># print("key success")</span>
<span class="token comment"># 4.解析m3u8文件</span>
n <span class="token operator">=</span> <span class="token number">1</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"download_video/index.m3u8"</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'r'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
<span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">:</span>
line <span class="token operator">=</span> line<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 先去掉空格,换行符</span>
<span class="token keyword">if</span> line<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">"#"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 如果以#开头跳过该行</span>
<span class="token keyword">continue</span>
<span class="token comment"># 下载视频片段</span>
resp2 <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>line<span class="token punctuation">,</span> verify<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>
f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"download_video/"</span><span class="token operator">+</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{
</span>n<span class="token punctuation">}</span></span><span class="token string">.ts"</span></span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">"wb"</span><span class="token punctuation">)</span>
f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>resp2<span class="token punctuation">.</span>content<span class="token punctuation">)</span>
f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
resp2<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
n <span class="token operator">+=</span> <span class="token number">1</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"第</span><span class="token interpolation"><span class="token punctuation">{
</span>n<span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">}</span></span><span class="token string">个完成"</span></span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span>
</code></pre> <p>这是根据我在网上搜到的一些资料做的,与视频不同,并且还未优化</p> <h2>第五章 selenium</h2> <h3>5.1 selenium引入概念</h3> <p>selenium是一个自动化测试工具,它可以打开浏览器,然后像人一样去操作浏览器,程序员可以从selenium中直接提取网页上的各种信息</p> <p>环境搭建:</p>
<ul>
<li>pip install selenium</li>
<li>下载浏览器驱动http://npm.taobao.org/mirrors/chromedriver</li>
<li>下载对应浏览器版本的文件解压缩,把浏览器驱动chromedriver放在python解释器所在的文件夹</li>
<li>让selenium启动chrome</li>
</ul> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome
<span class="token comment"># 1.创建浏览器对象</span>
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># 2.打开一个网址</span>
web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.baidu.com"</span><span class="token punctuation">)</span>
</code></pre> <h3>5.2 selenium各种操作-抓拉钩</h3> <p>本节中使用selenium来抓取抓钩招聘网的岗位信息</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome
<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>keys <span class="token keyword">import</span> Keys
<span class="token keyword">import</span> time
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://lagou.com"</span><span class="token punctuation">)</span>
<span class="token comment"># 找到某个元素 点击</span>
el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="changeCityBox"]/p[1]/a'</span><span class="token punctuation">)</span>
el<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 点击事件</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 让浏览器缓一会</span>
<span class="token comment"># 找到输入框 输入python => 输入回车/点击搜索按钮</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="search_input"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"python"</span><span class="token punctuation">,</span> Keys<span class="token punctuation">.</span>ENTER<span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>
<span class="token comment"># 查找存放数据的位置 进行数据提取</span>
<span class="token comment"># 找到页面中存放数据的所有li</span>
li_list <span class="token operator">=</span> web<span class="token punctuation">.</span>find_elements_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li'</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> li <span class="token keyword">in</span> li_list<span class="token punctuation">:</span>
job_name <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_tag_name<span class="token punctuation">(</span><span class="token string">"h3"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text
job_price <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"./div/div/div[2]/div/span"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text
job_company <span class="token operator">=</span> li<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"./div/div[2]/div/a"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text
<span class="token keyword">print</span><span class="token punctuation">(</span>job_name<span class="token punctuation">,</span> job_company<span class="token punctuation">,</span> job_price<span class="token punctuation">)</span>
</code></pre> <h3>5.3 各种操作-窗口间的切换</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome
<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>keys <span class="token keyword">import</span> Keys
<span class="token keyword">import</span> time
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://lagou.com"</span><span class="token punctuation">)</span>
el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="changeCityBox"]/p[1]/a'</span><span class="token punctuation">)</span>
el<span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="search_input"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"python"</span><span class="token punctuation">,</span> Keys<span class="token punctuation">.</span>ENTER<span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># 在selenium眼中 新窗口是默认切换不过来的</span>
web<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>window<span class="token punctuation">(</span>web<span class="token punctuation">.</span>window_handles<span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token comment"># 在新窗口中提取内容</span>
job_detail <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="job_detail"]/dd[2]/div'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text
<span class="token keyword">print</span><span class="token punctuation">(</span>job_detail<span class="token punctuation">)</span>
<span class="token comment"># 关掉子窗口</span>
web<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># 变更selenium的窗口视角 回到原本的窗口</span>
web<span class="token punctuation">.</span>switch_to<span class="token punctuation">.</span>window<span class="token punctuation">(</span>web<span class="token punctuation">.</span>window_handles<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text<span class="token punctuation">)</span>
</code></pre> <h3>5.4 selenium操作-无头浏览器</h3> <p>爬取某个页面信息时希望浏览器在后台默默运行</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome
<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options
<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>support<span class="token punctuation">.</span>select <span class="token keyword">import</span> Select
<span class="token keyword">import</span> time
<span class="token comment"># 无头浏览器 准备好参数配置</span>
opt <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span>
opt<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">"--headless"</span><span class="token punctuation">)</span>
opt<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">"--disable-gpu"</span><span class="token punctuation">)</span>
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>opt<span class="token punctuation">)</span> <span class="token comment"># 把参数配置设置到浏览器中</span>
web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.endata.com.cn/BoxOffice/BO/Year/index.html"</span><span class="token punctuation">)</span>
<span class="token comment"># 定位到下拉列表</span>
sel_el <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="OptionDate"]'</span><span class="token punctuation">)</span>
<span class="token comment"># 对元素进行包装,包装成下拉菜单</span>
sel <span class="token operator">=</span> Select<span class="token punctuation">(</span>sel_el<span class="token punctuation">)</span>
<span class="token comment"># 让浏览器进行调整选项</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>sel<span class="token punctuation">.</span>options<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># i就是每一个下拉框选项的索引位置</span>
sel<span class="token punctuation">.</span>select_by_index<span class="token punctuation">(</span>i<span class="token punctuation">)</span> <span class="token comment"># 按照索引切换</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span>
table <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="TableList"]/table'</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>table<span class="token punctuation">.</span>text<span class="token punctuation">)</span> <span class="token comment"># 打印所有文本信息</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"============================================="</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"All Done"</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># 如何拿到页面代码Elements(经过数据加载以及js执行之后的结果的html内容)</span>
<span class="token comment"># print(web.page_source)</span>
</code></pre> <h3>5.5 selenium各种操作-超级鹰处理验证码</h3>
<ol>
<li>图像识别</li>
<li>选择互联网上成熟的验证码破解工具</li>
</ol> <p>超级鹰就是网上的一种识别验证码的工具,需要自行注册以及购买使用积分,在官网的开发文档中可以找到对应语言的文档,只需运行该文档就可以实现功能</p> <h3>5.6 selenium -超级鹰干超级鹰</h3> <p>这一节的内容就是使用超级鹰自动登录超级鹰网站,主要考验的就是对超级鹰方法的使用</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome
<span class="token keyword">from</span> chaojiying <span class="token keyword">import</span> Chaojiying_Client
<span class="token keyword">import</span> time
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"http://www.chaojiying.com/user/login/"</span><span class="token punctuation">)</span>
<span class="token comment"># 处理验证码</span>
img <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">"/html/body/div[3]/div/div[3]/div[1]/form/div/img"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>screenshot_as_png
chaojiying <span class="token operator">=</span> Chaojiying_Client<span class="token punctuation">(</span><span class="token string">'超级鹰用户名'</span><span class="token punctuation">,</span> <span class="token string">'超级鹰密码'</span><span class="token punctuation">,</span> <span class="token string">'ID'</span><span class="token punctuation">)</span>
verity_code <span class="token operator">=</span> chaojiying<span class="token punctuation">.</span>PostPic<span class="token punctuation">(</span>img<span class="token punctuation">,</span> <span class="token number">1902</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">'pic_str'</span><span class="token punctuation">]</span>
<span class="token comment"># 向页面中填入用户名,密码,验证码</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"超级鹰用户名"</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"超级鹰密码"</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span>verity_code<span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span>
<span class="token comment"># 点击登录</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre> <h3>5.7 selenium-搞定12306的登陆问题</h3> <p>12306登陆页面已取消图片验证,因此与视频有所不同</p> <p>12306可以检测你的浏览器是否是自动测试软件控制,因此如果没有特殊方法无法通过滑块验证,检测原理就是浏览器控制台中输入<strong>window.navigator.webdriver</strong>,可以发现我们测试中的Chrome浏览器返回的结果为True,而一般浏览器是False,所以12306就是根据这个返回的结果判断你是不是在自动测试。</p> <p>不被检测方法:</p>
<ul>
<li> <p>Chrome版本号小于88:在你启动浏览器的时候(此时没有加载任何网页内容),向页面嵌入js代码,去掉webdriver,也就是在web.get()代码前嵌入</p> </li>
<li> <pre><code class="prism language-python">web<span class="token punctuation">.</span>execute_cdp_cmd<span class="token punctuation">(</span><span class="token string">"Page.addScriptToEvaluateOnNewDocument"</span><span class="token punctuation">,</span> <span class="token punctuation">{
</span>
<span class="token string">"source"</span><span class="token punctuation">:</span> <span class="token triple-quoted-string string">"""
navigator.webdriver = undefined
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
}]
"""</span>
<span class="token punctuation">}</span><span class="token punctuation">)</span>xxxxxxxxxx web<span class="token punctuation">.</span>executeweb<span class="token punctuation">.</span>execute_cdp_cmd<span class="token punctuation">(</span><span class="token string">"Page.addScriptToEvaluateOnNewDocument"</span><span class="token punctuation">,</span> <span class="token punctuation">{
</span> <span class="token string">"source"</span><span class="token punctuation">:</span> <span class="token triple-quoted-string string">""" navigator.webdriver = undefined Object.defineProperty(navigator, 'webdriver', { get: () => undefined }] """</span><span class="token punctuation">}</span><span class="token punctuation">)</span>
</code></pre> </li>
<li> <p>Chrome版本号大于88:需要导入一个包,增加options属性</p> </li>
<li> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options
option <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span>
option<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-blink-features=AutomationControlled'</span><span class="token punctuation">)</span>
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span>
</code></pre> </li>
</ul> <p>以下是我的代码</p> <pre><code class="prism language-python"><span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver <span class="token keyword">import</span> Chrome
<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>common<span class="token punctuation">.</span>action_chains <span class="token keyword">import</span> ActionChains
<span class="token keyword">from</span> selenium<span class="token punctuation">.</span>webdriver<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>options <span class="token keyword">import</span> Options
<span class="token keyword">import</span> time
option <span class="token operator">=</span> Options<span class="token punctuation">(</span><span class="token punctuation">)</span>
option<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span><span class="token string">'--disable-blink-features=AutomationControlled'</span><span class="token punctuation">)</span>
web <span class="token operator">=</span> Chrome<span class="token punctuation">(</span>options<span class="token operator">=</span>option<span class="token punctuation">)</span>
web<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"https://kyfw.12306.cn/otn/resources/login.html"</span><span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># 等待响应</span>
<span class="token comment"># 切换到账号登陆</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="toolbar_Div"]/div[2]/div[2]/ul/li[2]/a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>
<span class="token comment"># 填写账号密码</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-userName"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"123456789"</span><span class="token punctuation">)</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-password"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>send_keys<span class="token punctuation">(</span><span class="token string">"123456789"</span><span class="token punctuation">)</span>
<span class="token comment"># 点击登录</span>
web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="J-login"]'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span>
<span class="token comment"># 滑块拖拽验证 使用动作链</span>
span_element <span class="token operator">=</span> web<span class="token punctuation">.</span>find_element_by_xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="nc_1_n1z"]'</span><span class="token punctuation">)</span>
ActionChains<span class="token punctuation">(</span>web<span class="token punctuation">)</span><span class="token punctuation">.</span>drag_and_drop_by_offset<span class="token punctuation">(</span>span_element<span class="token punctuation">,</span> <span class="token number">320</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>perform<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre></li>
</ul>
</div>
</div>
</div>
</div>
</div>
<!--PC和WAP自适应版-->
<div id="SOHUCS" sid="1450727448940462080"></div>
<script type="text/javascript" src="/views/front/js/chanyan.js"></script>
<!-- 文章页-底部 动态广告位 -->
<div class="youdao-fixed-ad" id="detail_ad_bottom"></div>
</div>
<div class="col-md-3">
<div class="row" id="ad">
<!-- 文章页-右侧1 动态广告位 -->
<div id="right-1" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad">
<div class="youdao-fixed-ad" id="detail_ad_1"> </div>
</div>
<!-- 文章页-右侧2 动态广告位 -->
<div id="right-2" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad">
<div class="youdao-fixed-ad" id="detail_ad_2"></div>
</div>
<!-- 文章页-右侧3 动态广告位 -->
<div id="right-3" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad">
<div class="youdao-fixed-ad" id="detail_ad_3"></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="container">
<h4 class="pt20 mb15 mt0 border-top">你可能感兴趣的:(笔记,python,爬虫,python,爬虫)</h4>
<div id="paradigm-article-related">
<div class="recommend-post mb30">
<ul class="widget-links">
<li><a href="/article/1835512809883004928.htm"
title="10月|愿你的青春不负梦想-读书笔记-01" target="_blank">10月|愿你的青春不负梦想-读书笔记-01</a>
<span class="text-muted">Tracy的小书斋</span>
<div>本书的作者是俞敏洪,大家都很熟悉他了吧。俞敏洪老师是我行业的领头羊吧,也是我事业上的偶像。本日摘录他书中第一章中的金句:『一个人如果什么目标都没有,就会浑浑噩噩,感觉生命中缺少能量。能给我们能量的,是对未来的期待。第一件事,我始终为了进步而努力。与其追寻全世界的骏马,不如种植丰美的草原,到时骏马自然会来。第二件事,我始终有阶段性的目标。什么东西能给我能量?答案是对未来的期待。』读到这里的时候,我便</div>
</li>
<li><a href="/article/1835511912843014144.htm"
title="理解Gunicorn:Python WSGI服务器的基石" target="_blank">理解Gunicorn:Python WSGI服务器的基石</a>
<span class="text-muted">范范0825</span>
<a class="tag" taget="_blank" href="/search/ipython/1.htm">ipython</a><a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/%E8%BF%90%E7%BB%B4/1.htm">运维</a>
<div>理解Gunicorn:PythonWSGI服务器的基石介绍Gunicorn,全称GreenUnicorn,是一个为PythonWSGI(WebServerGatewayInterface)应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具,Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置,帮助初学者快速上手。1.什么是Gunico</div>
</li>
<li><a href="/article/1835510025561403392.htm"
title="《投行人生》读书笔记" target="_blank">《投行人生》读书笔记</a>
<span class="text-muted">小蘑菇的树洞</span>
<div>《投行人生》----作者詹姆斯-A-朗德摩根斯坦利副主席40年的职业洞见-很短小精悍的篇幅,比较适合初入职场的新人。第一部分成功的职业生涯需要规划1.情商归为适应能力分享与协作同理心适应能力,更多的是自我意识,你有能力识别自己的情并分辨这些情绪如何影响你的思想和行为。2.对于初入职场的人的建议,细节,截止日期和数据很重要截止日期,一种有效的方法是请老板为你所有的任务进行优先级排序。和老板喝咖啡的好</div>
</li>
<li><a href="/article/1835507248395284480.htm"
title="【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数" target="_blank">【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数</a>
<span class="text-muted">广龙宇</span>
<a class="tag" taget="_blank" href="/search/%E4%B8%80%E8%B5%B7%E5%AD%A6Rust/1.htm">一起学Rust</a><a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/Rust%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">Rust设计模式</a><a class="tag" taget="_blank" href="/search/rust/1.htm">rust</a><a class="tag" taget="_blank" href="/search/%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">设计模式</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a>
<div>提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档文章目录前言一、使用借用类型作为参数二、格式化拼接字符串三、使用构造函数总结前言Rust不是传统的面向对象编程语言,它的所有特性,使其独一无二。因此,学习特定于Rust的设计模式是必要的。本系列文章为作者学习《Rust设计模式》的学习笔记以及自己的见解。因此,本系列文章的结构也与此书的结构相同(后续可能会调成结构),基本上分为三个部分</div>
</li>
<li><a href="/article/1835506869838376960.htm"
title="Python数据分析与可视化实战指南" target="_blank">Python数据分析与可视化实战指南</a>
<span class="text-muted">William数据分析</span>
<a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE/1.htm">数据</a>
<div>在数据驱动的时代,Python因其简洁的语法、强大的库生态系统以及活跃的社区,成为了数据分析与可视化的首选语言。本文将通过一个详细的案例,带领大家学习如何使用Python进行数据分析,并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前,我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学</div>
</li>
<li><a href="/article/1835505858444881920.htm"
title="git常用命令笔记" target="_blank">git常用命令笔记</a>
<span class="text-muted">咩酱-小羊</span>
<a class="tag" taget="_blank" href="/search/git/1.htm">git</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a>
<div>###用习惯了idea总是不记得git的一些常见命令,需要用到的时候总是担心旁边站了人~~~记个笔记@_@,告诉自己看笔记不丢人初始化初始化一个新的Git仓库gitinit配置配置用户信息gitconfig--globaluser.name"YourName"gitconfig--globaluser.email"youremail@example.com"基本操作克隆远程仓库gitclone查看</div>
</li>
<li><a href="/article/1835505858939809792.htm"
title="python os.environ" target="_blank">python os.environ</a>
<span class="text-muted">江湖偌大</span>
<a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/1.htm">深度学习</a>
<div>os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值,输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息(INFO)os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息(INFO\WARNING)os.environ['TF_CPP_MIN_LOG_LEVEL']='</div>
</li>
<li><a href="/article/1835505606245576704.htm"
title="Python中os.environ基本介绍及使用方法" target="_blank">Python中os.environ基本介绍及使用方法</a>
<span class="text-muted">鹤冲天Pro</span>
<a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%9C%8D%E5%8A%A1%E5%99%A8/1.htm">服务器</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a>
<div>文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi</div>
</li>
<li><a href="/article/1835505226933694464.htm"
title="Pyecharts数据可视化大屏:打造沉浸式数据分析体验" target="_blank">Pyecharts数据可视化大屏:打造沉浸式数据分析体验</a>
<span class="text-muted">我的运维人生</span>
<a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">信息可视化</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a><a class="tag" taget="_blank" href="/search/%E8%BF%90%E7%BB%B4%E5%BC%80%E5%8F%91/1.htm">运维开发</a><a class="tag" taget="_blank" href="/search/%E6%8A%80%E6%9C%AF%E5%85%B1%E4%BA%AB/1.htm">技术共享</a>
<div>Pyecharts数据可视化大屏:打造沉浸式数据分析体验在当今这个数据驱动的时代,如何将海量数据以直观、生动的方式展现出来,成为了数据分析师和企业决策者关注的焦点。Pyecharts,作为一款基于Python的开源数据可视化库,凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力,成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏,并通过实际代码案例</div>
</li>
<li><a href="/article/1835504217729626112.htm"
title="Python教程:一文了解使用Python处理XPath" target="_blank">Python教程:一文了解使用Python处理XPath</a>
<span class="text-muted">旦莫</span>
<a class="tag" taget="_blank" href="/search/Python%E8%BF%9B%E9%98%B6/1.htm">Python进阶</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a>
<div>目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath?2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代</div>
</li>
<li><a href="/article/1835503965563875328.htm"
title="python os.environ_python os.environ 读取和设置环境变量" target="_blank">python os.environ_python os.environ 读取和设置环境变量</a>
<span class="text-muted">weixin_39605414</span>
<a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/os.environ/1.htm">os.environ</a>
<div>>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA</div>
</li>
<li><a href="/article/1835502282603589632.htm"
title="509. 斐波那契数(每日一题)" target="_blank">509. 斐波那契数(每日一题)</a>
<span class="text-muted">lzyprime</span>
<div>lzyprime博客(github)创建时间:2021.01.04qq及邮箱:2383518170leetcode笔记题目描述斐波那契数,通常用F(n)表示,形成的序列称为斐波那契数列。该数列由0和1开始,后面的每一项数字都是前面两项数字的和。也就是:F(0)=0,F(1)=1F(n)=F(n-1)+F(n-2),其中n>1给你n,请计算F(n)。示例1:输入:2输出:1解释:F(2)=F(1)+</div>
</li>
<li><a href="/article/1835500750684385280.htm"
title="拥有断舍离的心态,过精简生活--《断舍离》读书笔记" target="_blank">拥有断舍离的心态,过精简生活--《断舍离》读书笔记</a>
<span class="text-muted">爱吃丸子的小樱桃</span>
<div>不知不觉间房间里的东西越来越多,虽然摆放整齐,但也时常会觉得空间逼仄,令人心生烦闷。抱着断舍离的态度,我开始阅读《断舍离》这本书,希望从书中能找到一些有效的方法,帮助我实现空间、物品上的断舍离。《断舍离》是日本作家山下英子通过自己的经历、思考和实践总结而成的,整体内涵也从刚开始的私人生活哲学的“断舍离”升华成了“人生实践哲学”,接着又成为每个人都能实行的“改变人生的断舍离”,从“哲学”逐渐升华成“</div>
</li>
<li><a href="/article/1835499615491813376.htm"
title="四章-32-点要素的聚合" target="_blank">四章-32-点要素的聚合</a>
<span class="text-muted">彩云飘过</span>
<div>本文基于腾讯课堂老胡的课《跟我学Openlayers--基础实例详解》做的学习笔记,使用的openlayers5.3.xapi。源码见1032.html,对应的官网示例https://openlayers.org/en/latest/examples/cluster.htmlhttps://openlayers.org/en/latest/examples/earthquake-clusters.</div>
</li>
<li><a href="/article/1835498219489030144.htm"
title="高端密码学院笔记285" target="_blank">高端密码学院笔记285</a>
<span class="text-muted">柚子_b4b4</span>
<div>高端幸福密码学院(高级班)幸福使者:李华第(598)期《幸福》之回归内在深层生命原动力基础篇——揭秘“激励”成长的喜悦心理案例分析主讲:刘莉一,知识扩充:成功=艰苦劳动+正确方法+少说空话。贪图省力的船夫,目标永远下游。智者的梦再美,也不如愚人实干的脚印。幸福早课堂2020.10.16星期五一笔记:1,重视和珍惜的前提是知道它的价值非常重要,当你珍惜了,你就真正定下来,真正的学到身上。2,大家需要</div>
</li>
<li><a href="/article/1835497664922349568.htm"
title="使用Faiss进行高效相似度搜索" target="_blank">使用Faiss进行高效相似度搜索</a>
<span class="text-muted">llzwxh888</span>
<a class="tag" taget="_blank" href="/search/faiss/1.htm">faiss</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a>
<div>在现代AI应用中,快速和高效的相似度搜索是至关重要的。Faiss(FacebookAISimilaritySearch)是一个专门用于快速相似度搜索和聚类的库,特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索,并结合Python代码演示其基本用法。什么是Faiss?Faiss是一个由FacebookAIResearch团队开发的开源库,主要用于高维向量的相似性搜索和聚类。Faiss</div>
</li>
<li><a href="/article/1835497665853485056.htm"
title="python是什么意思中文-在python中%是什么意思" target="_blank">python是什么意思中文-在python中%是什么意思</a>
<span class="text-muted">编程大乐趣</span>
<div>Python中%有两种:1、数值运算:%代表取模,返回除法的余数。如:>>>7%212、%操作符(字符串格式化,stringformatting),说明如下:%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+,-,''或0。+表示右对齐。-表示左对齐。''为一个空格,表示在正数的左侧填充一个空格,从而与负数对齐。0表示使用0填</div>
</li>
<li><a href="/article/1835495770502033408.htm"
title="Day17笔记-高阶函数" target="_blank">Day17笔记-高阶函数</a>
<span class="text-muted">~在杰难逃~</span>
<a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/pycharm/1.htm">pycharm</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a>
<div>高阶函数【重点掌握】函数的本质:函数是一个变量,函数名是一个变量名,一个函数可以作为另一个函数的参数或返回值使用如果A函数作为B函数的参数,B函数调用完成之后,会得到一个结果,则B函数被称为高阶函数常用的高阶函数:map(),reduce(),filter(),sorted()1.map()map(func,iterable),返回值是一个iterator【容器,迭代器】func:函数iterab</div>
</li>
<li><a href="/article/1835495644123459584.htm"
title="Day1笔记-Python简介&标识符和关键字&输入输出" target="_blank">Day1笔记-Python简介&标识符和关键字&输入输出</a>
<span class="text-muted">~在杰难逃~</span>
<a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/%E5%A4%A7%E6%95%B0%E6%8D%AE/1.htm">大数据</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a>
<div>大家好,从今天开始呢,杰哥开展一个新的专栏,当然,数据分析部分也会不定时更新的,这个新的专栏主要是讲解一些Python的基础语法和知识,帮助0基础的小伙伴入门和学习Python,感兴趣的小伙伴可以开始认真学习啦!一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码,再通过语言处理程序执行向计算机发送指令,让计算机完成对应的工作,编程</div>
</li>
<li><a href="/article/1835495517774245888.htm"
title="python八股文面试题分享及解析(1)" target="_blank">python八股文面试题分享及解析(1)</a>
<span class="text-muted">Shawn________</span>
<a class="tag" taget="_blank" href="/search/python/1.htm">python</a>
<div>#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果:21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型,不仅仅改变</div>
</li>
<li><a href="/article/1835493753557708800.htm"
title="每日算法&面试题,大厂特训二十八天——第二十天(树)" target="_blank">每日算法&面试题,大厂特训二十八天——第二十天(树)</a>
<span class="text-muted">肥学</span>
<a class="tag" taget="_blank" href="/search/%E2%9A%A1%E7%AE%97%E6%B3%95%E9%A2%98%E2%9A%A1%E9%9D%A2%E8%AF%95%E9%A2%98%E6%AF%8F%E6%97%A5%E7%B2%BE%E8%BF%9B/1.htm">⚡算法题⚡面试题每日精进</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E7%AE%97%E6%B3%95/1.htm">算法</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84/1.htm">数据结构</a>
<div>目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题,最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧!!特别介绍小白练手专栏,适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章</div>
</li>
<li><a href="/article/1835493626688401408.htm"
title="Python快速入门 —— 第三节:类与对象" target="_blank">Python快速入门 —— 第三节:类与对象</a>
<span class="text-muted">孤华暗香</span>
<a class="tag" taget="_blank" href="/search/Python%E5%BF%AB%E9%80%9F%E5%85%A5%E9%97%A8/1.htm">Python快速入门</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a>
<div>第三节:类与对象目标:了解面向对象编程的基础概念,并学会如何定义类和创建对象。内容:类与对象:定义类:class关键字。类的构造函数:__init__()。类的属性和方法。对象的创建与使用。示例:classStudent:def__init__(self,name,age,major):self.name&#</div>
</li>
<li><a href="/article/1835492869062881280.htm"
title="pyecharts——绘制柱形图折线图" target="_blank">pyecharts——绘制柱形图折线图</a>
<span class="text-muted">2224070247</span>
<a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">信息可视化</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">数据可视化</a>
<div>一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd)数据可视化团队研发的ECharts1.0发布到GitHub网站以来,ECharts一直备受业界权威的关注并获得广泛好评,成为目前成熟且流行的数据可视化图表工具,被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言,也加入ECharts的使用行列,并研发出方便Python开发者使用的数据</div>
</li>
<li><a href="/article/1835492740536823808.htm"
title="node.js学习" target="_blank">node.js学习</a>
<span class="text-muted">小猿L</span>
<a class="tag" taget="_blank" href="/search/node.js/1.htm">node.js</a><a class="tag" taget="_blank" href="/search/node.js/1.htm">node.js</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/vim/1.htm">vim</a>
<div>node.js学习实操及笔记温故node.js,node.js学习实操过程及笔记~node.js学习视频node.js官网node.js中文网实操笔记githubcsdn笔记为什么学node.js可以让别人访问我们编写的网页为后续的框架学习打下基础,三大框架vuereactangular离不开node.jsnode.js是什么官网:node.js是一个开源的、跨平台的运行JavaScript的运行</div>
</li>
<li><a href="/article/1835491859351302144.htm"
title="Python 实现图片裁剪(附代码) | Python工具" target="_blank">Python 实现图片裁剪(附代码) | Python工具</a>
<span class="text-muted">剑客阿良_ALiang</span>
<div>前言本文提供将图片按照自定义尺寸进行裁剪的工具方法,一如既往的实用主义。环境依赖ffmpeg环境安装,可以参考我的另一篇文章:windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg,而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装:pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了,上代码</div>
</li>
<li><a href="/article/1835491353451130880.htm"
title="【华为OD技术面试真题 - 技术面】- python八股文真题题库(4)" target="_blank">【华为OD技术面试真题 - 技术面】- python八股文真题题库(4)</a>
<span class="text-muted">算法大师</span>
<a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BAod/1.htm">华为od</a><a class="tag" taget="_blank" href="/search/%E9%9D%A2%E8%AF%95/1.htm">面试</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a>
<div>华为OD面试真题精选专栏:华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例:文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片(Slicing)操作**基本切片语法</div>
</li>
<li><a href="/article/1835491101276991488.htm"
title="数据仓库——维度表一致性" target="_blank">数据仓库——维度表一致性</a>
<span class="text-muted">墨染丶eye</span>
<a class="tag" taget="_blank" href="/search/%E8%83%8C%E8%AF%B5/1.htm">背诵</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E4%BB%93%E5%BA%93/1.htm">数据仓库</a>
<div>数据仓库基础笔记思维导图已经整理完毕,完整连接为:数据仓库基础知识笔记思维导图维度一致性问题从逻辑层面来看,当一系列星型模型共享一组公共维度时,所涉及的维度称为一致性维度。当维度表存在不一致时,短期的成功难以弥补长期的错误。维度时确保不同过程中信息集成起来实现横向钻取货活动的关键。造成横向钻取失败的原因维度结构的差别,因为维度的差别,分析工作涉及的领域从简单到复杂,但是都是通过复杂的报表来弥补设计</div>
</li>
<li><a href="/article/1835490974911000576.htm"
title="python os 环境变量" target="_blank">python os 环境变量</a>
<span class="text-muted">CV矿工</span>
<a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/numpy/1.htm">numpy</a>
<div>环境变量:环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里,比如数据库密码,个人账户密码,如果写进自己本机的环境变量里,程序用的时候通过os.environ.get()取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量:os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类</div>
</li>
<li><a href="/article/1835490218845761536.htm"
title="Python爬虫解析工具之xpath使用详解" target="_blank">Python爬虫解析工具之xpath使用详解</a>
<span class="text-muted">eqa11</span>
<a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E7%88%AC%E8%99%AB/1.htm">爬虫</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a>
<div>文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中,数据提取是一个至关重要的环节。xpath作为一门</div>
</li>
<li><a href="/article/1835484293607026688.htm"
title="【Git】常见命令(仅笔记)" target="_blank">【Git】常见命令(仅笔记)</a>
<span class="text-muted">好想有猫猫</span>
<a class="tag" taget="_blank" href="/search/Git/1.htm">Git</a><a class="tag" taget="_blank" href="/search/Linux%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/1.htm">Linux学习笔记</a><a class="tag" taget="_blank" href="/search/git/1.htm">git</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a><a class="tag" taget="_blank" href="/search/elasticsearch/1.htm">elasticsearch</a><a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/c%2B%2B/1.htm">c++</a>
<div>文章目录创建/初始化本地仓库添加本地仓库配置项提交文件查看仓库状态回退仓库查看日志分支删除文件暂存工作区代码远程仓库使用`.gitigore`文件让git不追踪一些文件标签创建/初始化本地仓库gitinit添加本地仓库配置项gitconfig-l#以列表形式显示配置项gitconfiguser.name"ljh"#配置user.namegitconfiguser.email"123123@qq.c</div>
</li>
<li><a href="/article/68.htm"
title="mongodb3.03开启认证" target="_blank">mongodb3.03开启认证</a>
<span class="text-muted">21jhf</span>
<a class="tag" taget="_blank" href="/search/mongodb/1.htm">mongodb</a>
<div>下载了最新mongodb3.03版本,当使用--auth 参数命令行开启mongodb用户认证时遇到很多问题,现总结如下:
(百度上搜到的基本都是老版本的,看到db.addUser的就是,请忽略)
Windows下我做了一个bat文件,用来启动mongodb,命令行如下:
mongod --dbpath db\data --port 27017 --directoryperdb --logp</div>
</li>
<li><a href="/article/195.htm"
title="【Spark103】Task not serializable" target="_blank">【Spark103】Task not serializable</a>
<span class="text-muted">bit1129</span>
<a class="tag" taget="_blank" href="/search/Serializable/1.htm">Serializable</a>
<div>Task not serializable是Spark开发过程最令人头疼的问题之一,这里记录下出现这个问题的两个实例,一个是自己遇到的,另一个是stackoverflow上看到。等有时间了再仔细探究出现Task not serialiazable的各种原因以及出现问题后如何快速定位问题的所在,至少目前阶段碰到此类问题,没有什么章法
1.
package spark.exampl</div>
</li>
<li><a href="/article/322.htm"
title="你所熟知的 LRU(最近最少使用)" target="_blank">你所熟知的 LRU(最近最少使用)</a>
<span class="text-muted">dalan_123</span>
<a class="tag" taget="_blank" href="/search/java/1.htm">java</a>
<div>关于LRU这个名词在很多地方或听说,或使用,接下来看下lru缓存回收的实现
1、大体的想法
a、查询出最近最晚使用的项
b、给最近的使用的项做标记
通过使用链表就可以完成这两个操作,关于最近最少使用的项只需要返回链表的尾部;标记最近使用的项,只需要将该项移除并放置到头部,那么难点就出现 你如何能够快速在链表定位对应的该项?
这时候多</div>
</li>
<li><a href="/article/449.htm"
title="Javascript 跨域" target="_blank">Javascript 跨域</a>
<span class="text-muted">周凡杨</span>
<a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a><a class="tag" taget="_blank" href="/search/jsonp/1.htm">jsonp</a><a class="tag" taget="_blank" href="/search/%E8%B7%A8%E5%9F%9F/1.htm">跨域</a><a class="tag" taget="_blank" href="/search/cross-domain/1.htm">cross-domain</a>
<div>
</div>
</li>
<li><a href="/article/576.htm"
title="linux下安装apache服务器" target="_blank">linux下安装apache服务器</a>
<span class="text-muted">g21121</span>
<a class="tag" taget="_blank" href="/search/apache/1.htm">apache</a>
<div>安装apache
下载windows版本apache,下载地址:http://httpd.apache.org/download.cgi
1.windows下安装apache
Windows下安装apache比较简单,注意选择路径和端口即可,这里就不再赘述了。 2.linux下安装apache:
下载之后上传到linux的相关目录,这里指定为/home/apach</div>
</li>
<li><a href="/article/703.htm"
title="FineReport的JS编辑框和URL地址栏语法简介" target="_blank">FineReport的JS编辑框和URL地址栏语法简介</a>
<span class="text-muted">老A不折腾</span>
<a class="tag" taget="_blank" href="/search/finereport/1.htm">finereport</a><a class="tag" taget="_blank" href="/search/web%E6%8A%A5%E8%A1%A8/1.htm">web报表</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E8%BD%AF%E4%BB%B6/1.htm">报表软件</a><a class="tag" taget="_blank" href="/search/%E8%AF%AD%E6%B3%95%E6%80%BB%E7%BB%93/1.htm">语法总结</a>
<div> JS编辑框:
1.FineReport的js。
作为一款BS产品,browser端的JavaScript是必不可少的。
FineReport中的js是已经调用了finereport.js的。
大家知道,预览报表时,报表servlet会将cpt模板转为html,在这个html的head头部中会引入FineReport的js,这个finereport.js中包含了许多内置的fun</div>
</li>
<li><a href="/article/830.htm"
title="根据STATUS信息对MySQL进行优化" target="_blank">根据STATUS信息对MySQL进行优化</a>
<span class="text-muted">墙头上一根草</span>
<a class="tag" taget="_blank" href="/search/status/1.htm">status</a>
<div>mysql 查看当前正在执行的操作,即正在执行的sql语句的方法为:
show processlist 命令
mysql> show global status;可以列出MySQL服务器运行各种状态值,我个人较喜欢的用法是show status like '查询值%';一、慢查询mysql> show variab</div>
</li>
<li><a href="/article/957.htm"
title="我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名" target="_blank">我的spring学习笔记7-Spring的Bean配置文件给Bean定义别名</a>
<span class="text-muted">aijuans</span>
<a class="tag" taget="_blank" href="/search/Spring+3/1.htm">Spring 3</a>
<div>本文介绍如何给Spring的Bean配置文件的Bean定义别名?
原始的
<bean id="business" class="onlyfun.caterpillar.device.Business">
<property name="writer">
<ref b</div>
</li>
<li><a href="/article/1084.htm"
title="高性能mysql 之 性能剖析" target="_blank">高性能mysql 之 性能剖析</a>
<span class="text-muted">annan211</span>
<a class="tag" taget="_blank" href="/search/%E6%80%A7%E8%83%BD/1.htm">性能</a><a class="tag" taget="_blank" href="/search/mysql/1.htm">mysql</a><a class="tag" taget="_blank" href="/search/mysql+%E6%80%A7%E8%83%BD%E5%89%96%E6%9E%90/1.htm">mysql 性能剖析</a><a class="tag" taget="_blank" href="/search/%E5%89%96%E6%9E%90/1.htm">剖析</a>
<div>
1 定义性能优化
mysql服务器性能,此处定义为 响应时间。
在解释性能优化之前,先来消除一个误解,很多人认为,性能优化就是降低cpu的利用率或者减少对资源的使用。
这是一个陷阱。
资源时用来消耗并用来工作的,所以有时候消耗更多的资源能够加快查询速度,保持cpu忙绿,这是必要的。很多时候发现
编译进了新版本的InnoDB之后,cpu利用率上升的很厉害,这并不</div>
</li>
<li><a href="/article/1211.htm"
title="主外键和索引唯一性约束" target="_blank">主外键和索引唯一性约束</a>
<span class="text-muted">百合不是茶</span>
<a class="tag" taget="_blank" href="/search/%E7%B4%A2%E5%BC%95/1.htm">索引</a><a class="tag" taget="_blank" href="/search/%E5%94%AF%E4%B8%80%E6%80%A7%E7%BA%A6%E6%9D%9F/1.htm">唯一性约束</a><a class="tag" taget="_blank" href="/search/%E4%B8%BB%E5%A4%96%E9%94%AE%E7%BA%A6%E6%9D%9F/1.htm">主外键约束</a><a class="tag" taget="_blank" href="/search/%E8%81%94%E6%9C%BA%E5%88%A0%E9%99%A4/1.htm">联机删除</a>
<div>目标;第一步;创建两张表 用户表和文章表
第二步;发表文章
1,建表;
---用户表 BlogUsers
--userID唯一的
--userName
--pwd
--sex
create </div>
</li>
<li><a href="/article/1338.htm"
title="线程的调度" target="_blank">线程的调度</a>
<span class="text-muted">bijian1013</span>
<a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">多线程</a><a class="tag" taget="_blank" href="/search/thread/1.htm">thread</a><a class="tag" taget="_blank" href="/search/%E7%BA%BF%E7%A8%8B%E7%9A%84%E8%B0%83%E5%BA%A6/1.htm">线程的调度</a><a class="tag" taget="_blank" href="/search/java%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">java多线程</a>
<div>1. Java提供一个线程调度程序来监控程序中启动后进入可运行状态的所有线程。线程调度程序按照线程的优先级决定应调度哪些线程来执行。
2. 多数线程的调度是抢占式的(即我想中断程序运行就中断,不需要和将被中断的程序协商)
a) </div>
</li>
<li><a href="/article/1465.htm"
title="查看日志常用命令" target="_blank">查看日志常用命令</a>
<span class="text-muted">bijian1013</span>
<a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/%E5%91%BD%E4%BB%A4/1.htm">命令</a><a class="tag" taget="_blank" href="/search/unix/1.htm">unix</a>
<div>一.日志查找方法,可以用通配符查某台主机上的所有服务器grep "关键字" /wls/applogs/custom-*/error.log
二.查看日志常用命令1.grep '关键字' error.log:在error.log中搜索'关键字'2.grep -C10 '关键字' error.log:显示关键字前后10行记录3.grep '关键字' error.l</div>
</li>
<li><a href="/article/1592.htm"
title="【持久化框架MyBatis3一】MyBatis版HelloWorld" target="_blank">【持久化框架MyBatis3一】MyBatis版HelloWorld</a>
<span class="text-muted">bit1129</span>
<a class="tag" taget="_blank" href="/search/helloworld/1.htm">helloworld</a>
<div>MyBatis这个系列的文章,主要参考《Java Persistence with MyBatis 3》。
样例数据
本文以MySQL数据库为例,建立一个STUDENTS表,插入两条数据,然后进行单表的增删改查
CREATE TABLE STUDENTS
(
stud_id int(11) NOT NULL AUTO_INCREMENT,
</div>
</li>
<li><a href="/article/1719.htm"
title="【Hadoop十五】Hadoop Counter" target="_blank">【Hadoop十五】Hadoop Counter</a>
<span class="text-muted">bit1129</span>
<a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a>
<div>
1. 只有Map任务的Map Reduce Job
File System Counters
FILE: Number of bytes read=3629530
FILE: Number of bytes written=98312
FILE: Number of read operations=0
FILE: Number of lar</div>
</li>
<li><a href="/article/1846.htm"
title="解决Tomcat数据连接池无法释放" target="_blank">解决Tomcat数据连接池无法释放</a>
<span class="text-muted">ronin47</span>
<a class="tag" taget="_blank" href="/search/tomcat+%E8%BF%9E%E6%8E%A5%E6%B1%A0%E3%80%80%E4%BC%98%E5%8C%96/1.htm">tomcat 连接池 优化</a>
<div>
近段时间,公司的检测中心报表系统(SMC)的开发人员时不时找到我,说用户老是出现无法登录的情况。前些日子因为手头上 有Jboss集群的测试工作,发现用户不能登录时,都是在Tomcat中将这个项目Reload一下就好了,不过只是治标而已,因为大概几个小时之后又会 再次出现无法登录的情况。
今天上午,开发人员小毛又找到我,要我协助将这个问题根治一下,拖太久用户难保不投诉。
简单分析了一</div>
</li>
<li><a href="/article/1973.htm"
title="java-75-二叉树两结点的最低共同父结点" target="_blank">java-75-二叉树两结点的最低共同父结点</a>
<span class="text-muted">bylijinnan</span>
<a class="tag" taget="_blank" href="/search/java/1.htm">java</a>
<div>
import java.util.LinkedList;
import java.util.List;
import ljn.help.*;
public class BTreeLowestParentOfTwoNodes {
public static void main(String[] args) {
/*
* node data is stored in</div>
</li>
<li><a href="/article/2100.htm"
title="行业垂直搜索引擎网页抓取项目" target="_blank">行业垂直搜索引擎网页抓取项目</a>
<span class="text-muted">carlwu</span>
<a class="tag" taget="_blank" href="/search/Lucene/1.htm">Lucene</a><a class="tag" taget="_blank" href="/search/Nutch/1.htm">Nutch</a><a class="tag" taget="_blank" href="/search/Heritrix/1.htm">Heritrix</a><a class="tag" taget="_blank" href="/search/Solr/1.htm">Solr</a>
<div>公司有一个搜索引擎项目,希望各路高人有空来帮忙指导,谢谢!
这是详细需求:
(1) 通过提供的网站地址(大概100-200个网站),网页抓取程序能不断抓取网页和其它类型的文件(如Excel、PDF、Word、ppt及zip类型),并且程序能够根据事先提供的规则,过滤掉不相干的下载内容。
(2) 程序能够搜索这些抓取的内容,并能对这些抓取文件按照油田名进行分类,然后放到服务器不同的目录中。
</div>
</li>
<li><a href="/article/2227.htm"
title="[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费" target="_blank">[通讯与服务]在总带宽资源没有大幅增加之前,不适宜大幅度降低资费</a>
<span class="text-muted">comsci</span>
<a class="tag" taget="_blank" href="/search/%E8%B5%84%E6%BA%90/1.htm">资源</a>
<div>
降低通讯服务资费,就意味着有更多的用户进入,就意味着通讯服务提供商要接待和服务更多的用户,在总体运维成本没有由于技术升级而大幅下降的情况下,这种降低资费的行为将导致每个用户的平均带宽不断下降,而享受到的服务质量也在下降,这对用户和服务商都是不利的。。。。。。。。
&nbs</div>
</li>
<li><a href="/article/2354.htm"
title="Java时区转换及时间格式" target="_blank">Java时区转换及时间格式</a>
<span class="text-muted">Cwind</span>
<a class="tag" taget="_blank" href="/search/java/1.htm">java</a>
<div>本文介绍Java API 中 Date, Calendar, TimeZone和DateFormat的使用,以及不同时区时间相互转化的方法和原理。
问题描述:
向处于不同时区的服务器发请求时需要考虑时区转换的问题。譬如,服务器位于东八区(北京时间,GMT+8:00),而身处东四区的用户想要查询当天的销售记录。则需把东四区的“今天”这个时间范围转换为服务器所在时区的时间范围。
</div>
</li>
<li><a href="/article/2481.htm"
title="readonly,只读,不可用" target="_blank">readonly,只读,不可用</a>
<span class="text-muted">dashuaifu</span>
<a class="tag" taget="_blank" href="/search/js/1.htm">js</a><a class="tag" taget="_blank" href="/search/jsp/1.htm">jsp</a><a class="tag" taget="_blank" href="/search/disable/1.htm">disable</a><a class="tag" taget="_blank" href="/search/readOnly/1.htm">readOnly</a><a class="tag" taget="_blank" href="/search/readOnly/1.htm">readOnly</a>
<div>readOnly 和 readonly 不同,在做js开发时一定要注意函数大小写和jsp黄线的警告!!!我就经历过这么一件事:
使用readOnly在某些浏览器或同一浏览器不同版本有的可以实现“只读”功能,有的就不行,而且函数readOnly有黄线警告!!!就这样被折磨了不短时间!!!(期间使用过disable函数,但是发现disable函数之后后台接收不到前台的的数据!!!)
</div>
</li>
<li><a href="/article/2608.htm"
title="LABjs、RequireJS、SeaJS 介绍" target="_blank">LABjs、RequireJS、SeaJS 介绍</a>
<span class="text-muted">dcj3sjt126com</span>
<a class="tag" taget="_blank" href="/search/js/1.htm">js</a><a class="tag" taget="_blank" href="/search/Web/1.htm">Web</a>
<div>LABjs 的核心是 LAB(Loading and Blocking):Loading 指异步并行加载,Blocking 是指同步等待执行。LABjs 通过优雅的语法(script 和 wait)实现了这两大特性,核心价值是性能优化。LABjs 是一个文件加载器。RequireJS 和 SeaJS 则是模块加载器,倡导的是一种模块化开发理念,核心价值是让 JavaScript 的模块化开发变得更</div>
</li>
<li><a href="/article/2735.htm"
title="[应用结构]入口脚本" target="_blank">[应用结构]入口脚本</a>
<span class="text-muted">dcj3sjt126com</span>
<a class="tag" taget="_blank" href="/search/PHP/1.htm">PHP</a><a class="tag" taget="_blank" href="/search/yii2/1.htm">yii2</a>
<div>入口脚本
入口脚本是应用启动流程中的第一环,一个应用(不管是网页应用还是控制台应用)只有一个入口脚本。终端用户的请求通过入口脚本实例化应用并将将请求转发到应用。
Web 应用的入口脚本必须放在终端用户能够访问的目录下,通常命名为 index.php,也可以使用 Web 服务器能定位到的其他名称。
控制台应用的入口脚本一般在应用根目录下命名为 yii(后缀为.php),该文</div>
</li>
<li><a href="/article/2862.htm"
title="haoop shell命令" target="_blank">haoop shell命令</a>
<span class="text-muted">eksliang</span>
<a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a><a class="tag" taget="_blank" href="/search/hadoop+shell/1.htm">hadoop shell</a>
<div>
cat
chgrp
chmod
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls
lsr
mkdir
movefromLocal
mv
put
rm
rmr
setrep
stat
tail
test
text
</div>
</li>
<li><a href="/article/2989.htm"
title="MultiStateView不同的状态下显示不同的界面" target="_blank">MultiStateView不同的状态下显示不同的界面</a>
<span class="text-muted">gundumw100</span>
<a class="tag" taget="_blank" href="/search/android/1.htm">android</a>
<div>只要将指定的view放在该控件里面,可以该view在不同的状态下显示不同的界面,这对ListView很有用,比如加载界面,空白界面,错误界面。而且这些见面由你指定布局,非常灵活。
PS:ListView虽然可以设置一个EmptyView,但使用起来不方便,不灵活,有点累赘。
<com.kennyc.view.MultiStateView xmlns:android=&qu</div>
</li>
<li><a href="/article/3116.htm"
title="jQuery实现页面内锚点平滑跳转" target="_blank">jQuery实现页面内锚点平滑跳转</a>
<span class="text-muted">ini</span>
<a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a><a class="tag" taget="_blank" href="/search/html/1.htm">html</a><a class="tag" taget="_blank" href="/search/jquery/1.htm">jquery</a><a class="tag" taget="_blank" href="/search/html5/1.htm">html5</a><a class="tag" taget="_blank" href="/search/css/1.htm">css</a>
<div>平时我们做导航滚动到内容都是通过锚点来做,刷的一下就直接跳到内容了,没有一丝的滚动效果,而且 url 链接最后会有“小尾巴”,就像#keleyi,今天我就介绍一款 jquery 做的滚动的特效,既可以设置滚动速度,又可以在 url 链接上没有“小尾巴”。
效果体验:http://keleyi.com/keleyi/phtml/jqtexiao/37.htmHTML文件代码:
&</div>
</li>
<li><a href="/article/3243.htm"
title="kafka offset迁移" target="_blank">kafka offset迁移</a>
<span class="text-muted">kane_xie</span>
<a class="tag" taget="_blank" href="/search/kafka/1.htm">kafka</a>
<div>在早前的kafka版本中(0.8.0),offset是被存储在zookeeper中的。
到当前版本(0.8.2)为止,kafka同时支持offset存储在zookeeper和offset manager(broker)中。
从官方的说明来看,未来offset的zookeeper存储将会被弃用。因此现有的基于kafka的项目如果今后计划保持更新的话,可以考虑在合适</div>
</li>
<li><a href="/article/3370.htm"
title="android > 搭建 cordova 环境" target="_blank">android > 搭建 cordova 环境</a>
<span class="text-muted">mft8899</span>
<a class="tag" taget="_blank" href="/search/android/1.htm">android</a>
<div>
1 , 安装 node.js
http://nodejs.org
node -v 查看版本
2, 安装 npm
可以先从 https://github.com/isaacs/npm/tags 下载 源码 解压到</div>
</li>
<li><a href="/article/3497.htm"
title="java封装的比较器,比较是否全相同,获取不同字段名字" target="_blank">java封装的比较器,比较是否全相同,获取不同字段名字</a>
<span class="text-muted">qifeifei</span>
<div> 非常实用的java比较器,贴上代码:
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import net.sf.json.JSONArray;
import net.sf.json.JSONObject;
import net.sf.json.JsonConfig;
i</div>
</li>
<li><a href="/article/3624.htm"
title="记录一些函数用法" target="_blank">记录一些函数用法</a>
<span class="text-muted">.Aky.</span>
<a class="tag" taget="_blank" href="/search/%E4%BD%8D%E8%BF%90%E7%AE%97/1.htm">位运算</a><a class="tag" taget="_blank" href="/search/PHP/1.htm">PHP</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E5%87%BD%E6%95%B0/1.htm">函数</a><a class="tag" taget="_blank" href="/search/IP/1.htm">IP</a>
<div>高手们照旧忽略。
想弄个全天朝IP段数据库,找了个今天最新更新的国内所有运营商IP段,copy到文件,用文件函数,字符串函数把玩下。分割出startIp和endIp这样格式写入.txt文件,直接用phpmyadmin导入.csv文件的形式导入。(生命在于折腾,也许你们觉得我傻X,直接下载人家弄好的导入不就可以,做自己的菜鸟,让别人去说吧)
当然用到了ip2long()函数把字符串转为整型数</div>
</li>
<li><a href="/article/3751.htm"
title="sublime text 3 rust" target="_blank">sublime text 3 rust</a>
<span class="text-muted">wudixiaotie</span>
<a class="tag" taget="_blank" href="/search/Sublime+Text/1.htm">Sublime Text</a>
<div>1.sublime text 3 => install package => Rust
2.cd ~/.config/sublime-text-3/Packages
3.mkdir rust
4.git clone https://github.com/sp0/rust-style
5.cd rust-style
6.cargo build --release
7.ctrl</div>
</li>
</ul>
</div>
</div>
</div>
<div>
<div class="container">
<div class="indexes">
<strong>按字母分类:</strong>
<a href="/tags/A/1.htm" target="_blank">A</a><a href="/tags/B/1.htm" target="_blank">B</a><a href="/tags/C/1.htm" target="_blank">C</a><a
href="/tags/D/1.htm" target="_blank">D</a><a href="/tags/E/1.htm" target="_blank">E</a><a href="/tags/F/1.htm" target="_blank">F</a><a
href="/tags/G/1.htm" target="_blank">G</a><a href="/tags/H/1.htm" target="_blank">H</a><a href="/tags/I/1.htm" target="_blank">I</a><a
href="/tags/J/1.htm" target="_blank">J</a><a href="/tags/K/1.htm" target="_blank">K</a><a href="/tags/L/1.htm" target="_blank">L</a><a
href="/tags/M/1.htm" target="_blank">M</a><a href="/tags/N/1.htm" target="_blank">N</a><a href="/tags/O/1.htm" target="_blank">O</a><a
href="/tags/P/1.htm" target="_blank">P</a><a href="/tags/Q/1.htm" target="_blank">Q</a><a href="/tags/R/1.htm" target="_blank">R</a><a
href="/tags/S/1.htm" target="_blank">S</a><a href="/tags/T/1.htm" target="_blank">T</a><a href="/tags/U/1.htm" target="_blank">U</a><a
href="/tags/V/1.htm" target="_blank">V</a><a href="/tags/W/1.htm" target="_blank">W</a><a href="/tags/X/1.htm" target="_blank">X</a><a
href="/tags/Y/1.htm" target="_blank">Y</a><a href="/tags/Z/1.htm" target="_blank">Z</a><a href="/tags/0/1.htm" target="_blank">其他</a>
</div>
</div>
</div>
<footer id="footer" class="mb30 mt30">
<div class="container">
<div class="footBglm">
<a target="_blank" href="/">首页</a> -
<a target="_blank" href="/custom/about.htm">关于我们</a> -
<a target="_blank" href="/search/Java/1.htm">站内搜索</a> -
<a target="_blank" href="/sitemap.txt">Sitemap</a> -
<a target="_blank" href="/custom/delete.htm">侵权投诉</a>
</div>
<div class="copyright">版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved.
<!-- <a href="https://beian.miit.gov.cn/" rel="nofollow" target="_blank">京ICP备09083238号</a><br>-->
</div>
</div>
</footer>
<!-- 代码高亮 -->
<script type="text/javascript" src="/static/syntaxhighlighter/scripts/shCore.js"></script>
<script type="text/javascript" src="/static/syntaxhighlighter/scripts/shLegacy.js"></script>
<script type="text/javascript" src="/static/syntaxhighlighter/scripts/shAutoloader.js"></script>
<link type="text/css" rel="stylesheet" href="/static/syntaxhighlighter/styles/shCoreDefault.css"/>
<script type="text/javascript" src="/static/syntaxhighlighter/src/my_start_1.js"></script>
</body>
</html>