网页爬虫入门–莫烦教程笔记
教程推荐:
莫烦教程–网页爬虫
崔庆才–Python爬虫学习系列教程
知乎问答中的各种推荐
孔淼–一看就明白的爬虫入门讲解
课程逻辑:
网页爬虫 → → 解析网页 → → 高效爬虫 → → 爬虫高级库
# 用Python登录网页
from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen(
"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)
Scraping tutorial 1 | 莫烦Python
爬虫测试1
注意:
我用jupyter notebook做动态交互, 运行时一直报错”ImportError: No module named request”. 但这段代码在IDE上没有出错, 并且这个模块是存在的. 所有我花了一些时间解决这个问题, 在此记录, 以防再次犯错.
首先分析是否是安装问题, 由于我很久没有用jupyter, 再次使用时忘了anaconda自带jupyter,所有用pip和pip3都分别安装了一次. 出问题后又用uninstall卸载, 但是问题没有解决, 依然报错. 并且发现anaconda里有jupyter.
google以及stackoverflow上找办法, 无果, 感觉问题还是出在Python版本上. 于是修改了配置文件, 还是出错.
分析了一下原因, 觉得问题应该出在anaconda自带jupyer上, 其默认的环境不是我正在使用的环境python3.5, 一番周折之后找到解决方案. 原来jupyter的ipykernel是使用一个叫kernel.json的文件管理的, 所以直接为我想要的环境安装ipykernel包:
$ conda install -n python35 ipykernel
$ python -m ipykernel install --user #激活这个环境
# 匹配网页内容,初级网页匹配使用正则,繁琐匹配推荐使用BeautifulSoup
# 想要找到title的话
import re
res=re.findall(r'(.+?) ',html)
print('\nPage title is:',res[0])
# 想要找到中间的那个段落
res=re.findall(r'(.+?)
',html,flags=re.DOTALL) #re.DOTALL if multil line
print('\nPage paragraph is:',res[0])
# 找一找所有的链接
res=re.findall(r'href=(.+?)',html)
print('\nAll links:',res)
Page title is: Scraping tutorial 1 | 莫烦Python
Page paragraph is:
这是一个在 莫烦Python
爬虫教程 中的简单测试.
All links: ['"', '"', '"']
爬网页流程:
1. 选择要爬的网址(url)
2. 使用Python登录上这个网址(urlopen等)
3. 读取网页信息(read()出来)
4. 将读取的信息放入BeautifulSoup
5. 使用BeautifulSoup选取tag信息等(代替正则表达式)
$ pip install beautifulsoup4 # Python 2+
$ pip3 install beautifulsoup4 # Python 3+
$ conda install beautifulsoup4 # anaconda
BeautifulSoup 4.2.0文档
# 按常规读取网页
from bs4 import BeautifulSoup
from urllib.request import urlopen
#if has Chinese, apple decode()
html = urlopen(
"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)
# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')
print(soup.h1)
print('\n',soup.p)
# 用find_all找到所有选项,但真正link在里面
# 可以看做是的一个属性,可以用字典形式的key来读取
all_href=soup.find_all('a')
all_href=[l["href"] for l in all_href]
print('\n',all_href)
Scraping tutorial 1 | 莫烦Python
爬虫测试1
爬虫测试1
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']
# 先读取页面
from bs4 import BeautifulSoup
from urllib.request import urlopen
html=urlopen(
"https://morvanzhou.github.io/static/scraping/list.html"
).read().decode('utf-8')
print(html)
# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')
爬虫练习 列表 class | 莫烦 Python
列表 爬虫练习
这是一个在 莫烦 Python 的 爬虫教程
里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.
- 一月
- 一月一号
- 一月二号
- 一月三号
- 二月
- 三月
- 四月
- 五月
# 按CSS class 匹配
month=soup.find_all('li',{"class":"month"})
for m in month:
print(m.get_text())
一月
二月
三月
四月
五月
jan=soup.find('ul',{'class':'jan'})
d_jan=jan.find_all('li')
for d in d_jan:
print(d.get_text())
一月一号
一月二号
一月三号
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html=urlopen(
"https://morvanzhou.github.io/static/scraping/table.html"
).read().decode('utf-8')
soup=BeautifulSoup(html,features='lxml')
img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])
https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg
course_links=soup.find_all('a',{'href':re.compile('https://morvan.*')})
for link in course_links:
print(link['href'])
https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
# 设置起始页,并将/item/...的网页放在his中,做一个备案,记录我们浏览过的网页
base_url="https://baike.baidu.com"
his=["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
# select the last sub url i 'his', print the title and url
url=base_url+his[-1]
html=urlopen(url).read().decode('utf-8')
soup=BeautifulSoup(html,features='lxml')
print(soup.find('h1').get_text(),' url: ',his[-1])
网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
# find valid urls
sub_urls=soup.find_all('a',{'target':'_blank','href':re.compile('/item/(%.{2})+$')})
if len(sub_urls)!=0:
his.append(random.sample(sub_urls,1)[0]['href'])
else:
# no valid sub link found
his.pop()
print(his)
['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95']
# 放在一个for loop中,让它在各种不同的网页中跳来跳去
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
for i in range(20):
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(i, soup.find('h1').get_text(), ' url: ', his[-1])
# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()
0 网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
1 搜索引擎 url: /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
2 全文索引 url: /item/%E5%85%A8%E6%96%87%E7%B4%A2%E5%BC%95
3 搜索引擎 url: /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
4 网站 url: /item/%E7%BD%91%E7%AB%99
5 通信协议 url: /item/%E9%80%9A%E4%BF%A1%E5%8D%8F%E8%AE%AE
6 数据单元 url: /item/%E6%95%B0%E6%8D%AE%E5%8D%95%E5%85%83
7 数据包 url: /item/%E6%95%B0%E6%8D%AE%E5%8C%85
8 会话层 url: /item/%E4%BC%9A%E8%AF%9D%E5%B1%82
9 数据流 url: /item/%E6%95%B0%E6%8D%AE%E6%B5%81
10 聚类分析 url: /item/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90
11 中数据 url: /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
12 中数据 url: /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
13 中数据 url: /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
14 企业邮箱 url: /item/%E4%BC%81%E4%B8%9A%E9%82%AE%E7%AE%B1
15 域名 url: /item/%E5%9F%9F%E5%90%8D
16 商标权 url: /item/%E5%95%86%E6%A0%87%E6%9D%83
17 商标注册不当 url: /item/%E5%95%86%E6%A0%87%E6%B3%A8%E5%86%8C%E4%B8%8D%E5%BD%93
18 中国驰名商标 url: /item/%E9%A9%B0%E5%90%8D%E5%95%86%E6%A0%87
19 商标评审委员会 url: /item/%E5%95%86%E6%A0%87%E8%AF%84%E5%AE%A1%E5%A7%94%E5%91%98%E4%BC%9A
注意: Python错误记录:IndexError: list index out of range
1. 第1种可能情况: list[index]index超出范围
2. 第2种可能情况: list是一个空的,没有一个元素, 进行list[0]就会出现该错误
获取网页的方式: get,post
1. post: 账号登录/搜索内容/上传图片/上传文件/往服务器传数据等
2. get: 正常打开网页/不往服务器传数据
从主动和被动的角度说:
1. post是发送,比较主动,你控制了服务器返回的内容.
2. get是取得,是被动的,你没有发送给服务器个性化的信息,它不会根据你个性化的信息返回不一样的HTML.
而requests模块有着各种不同的请求方法, 而且用起来很方便.
$ pip install requests # python 2+
$ pip3 install requests # python 3+
Requests 中文文档
# reguests get 请求
import requests
import webbrowser
param={'wd':'网络爬虫'} #搜索的信息 "s?wd=网络爬虫"就是我们搜索需要的关键信息
r=requests.get('http://www.baidu.com/s',params=param)
print(r.url)
webbrowser.open(r.url)
http://www.baidu.com/s?wd=%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB
True
# requests post 请求
# 使用python来提交Form,登录上提交后的页面
data={'firstname':'Morvan','lastname':'Zhou'}
r=requests.post('http://pythonscraping.com/files/processing.php',data=data)
print(r.text)
Hello there, Morvan Zhou!
# 上传图片,也是post的一种
file={'uploadFile':open('./images//webwxgetmsgimg.jpeg','rb')}
r=requests.post('http://pythonscraping.com/files/processing2.php',files=file)
print(r.text)
The file webwxgetmsgimg.jpeg has been uploaded.
# 登录
"""
浏览器做了什么:
1. 使用post方法登录了url
2. post的时候,使用了Form data中的用户名和密码
3. 生成了一些cookies
"""
payload={'username':'Morvan','password':'password'}
r=requests.post('http://pythonscraping.com/pages/cookies/welcome.php',data=payload)
print(r.cookies.get_dict())
{'username': 'Morvan', 'loggedin': '1'}
r=requests.get('http://pythonscraping.com/pages/cookies/welcome.php',cookies=r.cookies)
print(r.text)
Welcome to the Website!
You have logged in successfully!
Check out your profile!
# 使用Session登录
session=requests.Session()
payload={'username':'Morvan','password':'password'}
r=session.post('http://pythonscraping.com/pages/cookies/welcome.php',data=payload)
print(r.cookies.get_dict())
{'username': 'Morvan', 'loggedin': '1'}
r=session.get('http://pythonscraping.com/pages/cookies/welcome.php')
print(r.text)
Welcome to the Website!
You have logged in successfully!
Check out your profile!
import os
os.makedirs('./images/',exist_ok=True) #新建一个文件夹
IMAGE_URL='https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png'
from urllib.request import urlretrieve #urllib模块中提供了一个下载功能urlretrieve
urlretrieve(IMAGE_URL,'./images/Image1.png')
('./images/Image1.png', )
import requests
r=requests.get(IMAGE_URL)
with open('./images/Image2.png','wb') as f:
f.write(r.content)
r = requests.get(IMAGE_URL, stream=True) # stream loading
with open('./images/Image3.png', 'wb') as f:
for chunk in r.iter_content(chunk_size=32):
f.write(chunk)
from bs4 import BeautifulSoup
import requests
URL = "http://www.nationalgeographic.com.cn/animals/"
html = requests.get(URL).text
soup = BeautifulSoup(html, 'lxml')
img_ul = soup.find_all('ul', {"class": "img_list"})
for ul in img_ul:
imgs = ul.find_all('img')
for img in imgs:
url = img['src']
r = requests.get(url, stream=True)
image_name = url.split('/')[-1]
with open('./images/%s' % image_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
print('Saved %s' % image_name)
Saved 20180319103523967.jpg
Saved 20180315040949958.jpg
Saved 20180311101349652.jpg
Saved 20180306113348212.jpg
Saved 20180306103802944.jpg
Saved 20180305104502788.jpg
import multiprocessing as mp
import time
from urllib.request import urlopen,urljoin
from bs4 import BeautifulSoup
import re
base_url = 'https://morvanzhou.github.io/'
def crawl(url):
response=urlopen(url)
time.sleep(0.1)
return response.read().decode()
def parse(html):
soup=BeautifulSoup(html,'lxml')
urls=soup.find_all('a',{'href':re.compile('^/.+?/$')})
title=soup.find('h1').get_text().strip()
page_urls=set([urljoin(base_url,url['href']) for url in urls])
url=soup.find('meta',{'property':'og:url'})['content']
return title,page_urls,url
unseen = set([base_url,])
seen = set()
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False
count, t1 = 1, time.time()
while len(unseen) != 0: # still get some url to visit
if restricted_crawl and len(seen) > 20:
break
print('\nDistributed Crawling...')
htmls = [crawl(url) for url in unseen]
print('\nDistributed Parsing...')
results = [parse(html) for html in htmls]
print('\nAnalysing...')
seen.update(unseen) # seen the crawled
unseen.clear() # nothing unseen
for title, page_urls, url in results:
print(count, title, url)
count += 1
unseen.update(page_urls - seen) # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, )) # 53 s
Distributed Crawling...
Distributed Parsing...
Analysing...
1 教程 https://morvanzhou.github.io/
Distributed Crawling...
Distributed Parsing...
Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 19.8 s
unseen = set([base_url,])
seen = set()
pool = mp.Pool(4)
count, t1 = 1, time.time()
while len(unseen) != 0: # still get some url to visit
if restricted_crawl and len(seen) > 20:
break
print('\nDistributed Crawling...')
crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
htmls = [j.get() for j in crawl_jobs] # request connection
print('\nDistributed Parsing...')
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs] # parse html
print('\nAnalysing...')
seen.update(unseen) # seen the crawled
unseen.clear() # nothing unseen
for title, page_urls, url in results:
print(count, title, url)
count += 1
unseen.update(page_urls - seen) # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, )) # 16 s !!!
Distributed Crawling...
Distributed Parsing...
Analysing...
1 教程 https://morvanzhou.github.io/
Distributed Crawling...
Distributed Parsing...
Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 7.8 s
# asyncio: 在单线程里使用异步计算,下载网页的时候和处理网页的时候是不连续的
# 更有效利用了等待下载的这段时间
import time
def job(t):
print('Start job',t)
time.sleep(t)
print('Job',t,' takes',t,'s')
def main():
[job(t) for t in range(1,3)]
t1=time.time()
main()
print('No async total time: ',time.time()-t1)
Start job 1
Job 1 takes 1 s
Start job 2
Job 2 takes 2 s
No async total time: 3.00662899017334
import asyncio
async def job(t): #async形式的功能
print('Start job',t)
await asyncio.sleep(t) #等待t秒,期间切换其他任务
print('Job',t,' takes',t,'s')
async def main(loop): #async形式的功能
tasks=[loop.create_task(job(t)) for t in range(1,3)] #创建任务,但不执行
await asyncio.wait(tasks) #执行并等待所有任务完成
t1=time.time()
loop=asyncio.get_event_loop() #建立loop
loop.run_until_complete(main(loop)) #执行loop
print("Async total time:", time.time()-t1)
Start job 1
Start job 2
Job 1 takes 1 s
Job 2 takes 2 s
Async total time: 2.0041818618774414
$ pip3 install aiohttp
import requests
URL = 'https://morvanzhou.github.io/'
def normal():
for i in range(2):
r=requests.get(URL)
url=r.url
print(url)
t1=time.time()
normal()
print('Normal total time:',time.time()-t1)
https://morvanzhou.github.io/
https://morvanzhou.github.io/
Normal total time: 0.6391615867614746
import aiohttp
async def job(session):
response=await session.get(URL) #等待并切换
return str(response.url)
async def main(loop):
async with aiohttp.ClientSession() as session: #官网推荐建立Session的形式
tasks=[loop.create_task(job(session)) for _ in range(2)]
finished,unfinished=await asyncio.wait(tasks)
all_results=[r.result() for r in finished] #获取所有结果
print(all_results)
t1=time.time()
loop=asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print('Async total time:',time.time()-t1)
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.30881452560424805
import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
from urllib.request import urljoin
import re
import multiprocessing as mp
base_url = "https://morvanzhou.github.io/"
# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False
seen = set()
unseen = set([base_url])
def parse(html):
soup = BeautifulSoup(html, 'lxml')
urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
title = soup.find('h1').get_text().strip()
page_urls = set([urljoin(base_url, url['href']) for url in urls])
url = soup.find('meta', {'property': "og:url"})['content']
return title, page_urls, url
async def crawl(url, session):
r = await session.get(url)
html = await r.text()
await asyncio.sleep(0.1) # slightly delay for downloading
return html
async def main(loop):
pool = mp.Pool(8) # slightly affected
async with aiohttp.ClientSession() as session:
count = 1
while len(unseen) != 0:
print('\nAsync Crawling...')
tasks = [loop.create_task(crawl(url, session)) for url in unseen]
finished, unfinished = await asyncio.wait(tasks)
htmls = [f.result() for f in finished]
print('\nDistributed Parsing...')
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs]
print('\nAnalysing...')
seen.update(unseen)
unseen.clear()
for title, page_urls, url in results:
# print(count, title, url)
unseen.update(page_urls - seen)
count += 1
if __name__ == "__main__":
t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
#loop.close()
print("Async total time: ", time.time() - t1)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
in ()
61 t1 = time.time()
62 loop = asyncio.get_event_loop()
---> 63 loop.run_until_complete(main(loop))
64 loop.close()
65 print("Async total time: ", time.time() - t1)
~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in run_until_complete(self, future)
441 Return the Future's result, or raise its exception.
442 """
--> 443 self._check_closed()
444
445 new_task = not futures.isfuture(future)
~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in _check_closed(self)
355 def _check_closed(self):
356 if self._closed:
--> 357 raise RuntimeError('Event loop is closed')
358
359 def _asyncgen_finalizer_hook(self, agen):
RuntimeError: Event loop is closed
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import multiprocessing as mp
import re
import time
def crawl(url):
response = urlopen(url)
time.sleep(0.1) # slightly delay for downloading
return response.read().decode()
def parse(html):
soup = BeautifulSoup(html, 'lxml')
urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
title = soup.find('h1').get_text().strip()
page_urls = set([urljoin(base_url, url['href']) for url in urls])
url = soup.find('meta', {'property': "og:url"})['content']
return title, page_urls, url
if __name__ == '__main__':
base_url = 'https://morvanzhou.github.io/'
#base_url = "http://127.0.0.1:4000/"
# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False
unseen = set([base_url,])
seen = set()
pool = mp.Pool(8) # number strongly affected
count, t1 = 1, time.time()
while len(unseen) != 0: # still get some url to visit
if restricted_crawl and len(seen) > 20:
break
print('\nDistributed Crawling...')
crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
htmls = [j.get() for j in crawl_jobs] # request connection
htmls = [h for h in htmls if h is not None] # remove None
print('\nDistributed Parsing...')
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs] # parse html
print('\nAnalysing...')
seen.update(unseen)
unseen.clear()
for title, page_urls, url in results:
# print(count, title, url)
count += 1
unseen.update(page_urls - seen)
print('Total time: %.1f s' % (time.time()-t1, ))
Selenium: 能控制你的浏览器,有模有样地学人类”看”网页
什么时候用到Selenium:
- 发现用普通方法爬不到想要的内容
- 网站跟你玩”捉迷藏”,太多JavaScript内容
- 需要像人一样浏览的爬虫
$ pip install selenium #python 2+
$ pip install selenium #python 3+
下载针对Linux和macOS的driver:
Linux和MacOS用户下载好之后, 需要将下载好的”geckodriver”文件放在你的计算机的”/usr/bin”或者”/usr/local/bin”目录, 在终端执行:
$ sudo cp 你的geckodriver位置 /usr/local/bin
$ sudo chmod +x /usr/local/bin/geckodriver
下载火狐浏览器插件Katalon Recorder,记录重复性工作
点击插件, 开始各种点击工作, 每当点击时, 插件就会记录下这些点击, 形成一些log, 最后点击Export按钮, 观看帮你生成的浏览记录代码, 可以输出Python2版本的代码, 复制你需要的代码.
import os
os.makedirs('./images/',exist_ok=True)
from selenium import webdriver
driver=webdriver.Firefox() #打开火狐浏览器
# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()
# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot1.png')
driver.close()
print(html[:200])
# 让selenium不弹出浏览器窗口,让它安静地执行这些操作
from selenium.webdriver.firefox.options import Options
firefox_options=Options()
firefox_options.add_argument('--headless') #define headless
driver=webdriver.Firefox(firefox_options=firefox_options)
# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()
# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot2.png')
driver.close()
print(html[:200])
高级爬虫: 高效无忧的Scrapy爬虫库
$ pip3 install scrapy
Scrapy是一个爬虫的框架, 而不是一个简单的爬虫, 它整合了爬取, 处理数据, 存储数据的一条龙服务.
import scrapy
class MofanSpider(scrapy.Spider):
name = "mofan"
start_urls = [
'https://morvanzhou.github.io/',
]
# unseen = set()
# seen = set() # we don't need these two as scrapy will deal with them automatically
def parse(self, response):
yield { # return some results
'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
'url': response.url,
}
urls = response.css('a::attr(href)').re(r'^/.+?/$') # find all sub urls
for url in urls:
yield response.follow(url, callback=self.parse) # it will filter duplication automatically
# lastly, run this in terminal
# scrapy runspider 5-2-scrapy.py -o res.json