话说上回,我们做了一个天天基金爬虫来统计用户持有基金的净值,毕竟浦发要等到第二天早上才更新,实在是太快了 。
不记得的朋友们可以点这个链接看之前的内容
但是有的人会说,哎呀,我一不小心买了好多基金呀,一只一只爬太慢了,孩子等不起(๑ŏ _ ŏ๑)。那怎么办呢,于是我们来尝试一下多线程爬虫。
这里我们用到一个自带的库:
from multiprocessing.dummy import Pool
同时我们可以用这个库里的函数读出当前cpu的总线程数(不是核心数,是线程数,有的设备能超线程,甚至四倍超线程),具体方法如下:
#by concyclics
# -*- coding:UTF-8 -*-
from multiprocessing import cpu_count
print(cpu_count())
#16
我们可以像这样让不同线程执行任务
#by concyclics
# -*- coding:UTF-8 -*-
from multiprocessing.dummy import Pool
import time
def act(s:str):
print('my name is thread:'+s)
pool=Pool(16)
for i in range(16):
pool.apply_async(act,(str(i),))
pool.close()
pool.join()
输出:
my name is thread:0
my name is thread:1
my name is thread:2
my name is thread:3
my name is thread:4
my name is thread:5
my name is thread:9
my name is thread:6
my name is thread:14
my name is thread:7
my name is thread:13
my name is thread:8
my name is thread:15
my name is thread:10
my name is thread:11
my name is thread:12
但是有一个问题,就是怎么让不同线程的值传回来,这里我们取个巧,在函数里利用global使用全局变量试试:
#by concyclics
# -*- coding:UTF-8 -*-
from multiprocessing.dummy import Pool
v=0
def ad():
global v
v+=10
print(v)
pool=Pool(16)
for i in range(16):
pool.apply_async(ad)
pool.close()
pool.join()
print("final ",v)
10
20
50
40
60
110
120
140
90
70
130
100
80
150
160
30
final 160
我们发现这样就可以确实更改到全局变量v。
爬虫函数的实现在这里看具体过程:传送门
这里我们直接使用上回的爬虫函数,然后把for循环里的内容放进act()函数里。
之前是这样的:
if __name__=='__main__':
total=0
for code in funds:
share=funds[code]
price=share*getfund(code)
total+=price
print('份额:',share,'市值:','%.2f'%price)
现在是这样的:
def act(code:str):
global total
share=funds[code]
price=share*getfund(code)
total+=price
print('份额:',share,'市值:','%.2f'%price)
然后我们依次创建进程,根据我们的字典:
if __name__=='__main__':
total=0
pool=Pool(cpu_count())
for code in funds:
pool.apply_async(act,(code,))
pool.close()
pool.join()
print('\n总计:','%.2f'%total)
#by concyclics
# -*- coding:UTF-8 -*-
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool
from multiprocessing import cpu_count
header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
}
funds={
'004432':2673.06,
'001156':739.65,
'009265':893.87,
'160222':2888.71,
'009821':1000.00,
'008903':2215.10,
'161725':2513.26,
'001475':1781.60,
'161028':2571.06,
'270002':2772.19,
'008168':9905.49}
def getfund(code:str):
url='http://fund.eastmoney.com/'+code+'.html'
page=requests.get(url)
html=str(page.content,'utf-8')
#把content中的内容重新编码成utf-8
soup=BeautifulSoup(html,'lxml')
value=soup.find_all('dd',{'class':'dataNums'})[1].find('span').getText()
name=soup.find('a',{'href':url,'target':"_self"}).getText()
date=soup.find('dl',{'class':"dataItem02"}).find('p').getText()[6:-1]
print("基金编号:",code,'\n基金名:',name,"\n日期:",date,"净值:",value)
return float(value)
def act(code:str):
global total
share=funds[code]
price=share*getfund(code)
total+=price
print('份额:',share,'市值:','%.2f'%price)
if __name__=='__main__':
total=0
pool=Pool(cpu_count())
for code in funds:
pool.apply_async(act,(code,))
pool.close()
pool.join()
print('\n总计:','%.2f'%total)
我们把两次的代码都跑一跑对比一下速度。
单线程:7.30秒
多线程:2.36秒
天天基金爬虫