网页爬虫入门--莫烦教程笔记

网页爬虫入门–莫烦教程笔记

教程推荐:

莫烦教程–网页爬虫

崔庆才–Python爬虫学习系列教程

知乎问答中的各种推荐

孔淼–一看就明白的爬虫入门讲解

课程逻辑:

网页爬虫 解析网页 高效爬虫 爬虫高级库

爬虫简介

# 用Python登录网页

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)



    
    Scraping tutorial 1 | 莫烦Python
    


    

爬虫测试1

这是一个在 莫烦Python 爬虫教程 中的简单测试.

注意:

我用jupyter notebook做动态交互, 运行时一直报错”ImportError: No module named request”. 但这段代码在IDE上没有出错, 并且这个模块是存在的. 所有我花了一些时间解决这个问题, 在此记录, 以防再次犯错.

  1. 首先分析是否是安装问题, 由于我很久没有用jupyter, 再次使用时忘了anaconda自带jupyter,所有用pip和pip3都分别安装了一次. 出问题后又用uninstall卸载, 但是问题没有解决, 依然报错. 并且发现anaconda里有jupyter.

  2. google以及stackoverflow上找办法, 无果, 感觉问题还是出在Python版本上. 于是修改了配置文件, 还是出错.

  3. 分析了一下原因, 觉得问题应该出在anaconda自带jupyer上, 其默认的环境不是我正在使用的环境python3.5, 一番周折之后找到解决方案. 原来jupyter的ipykernel是使用一个叫kernel.json的文件管理的, 所以直接为我想要的环境安装ipykernel包:

$ conda install -n python35 ipykernel 
$ python -m ipykernel install --user #激活这个环境
# 匹配网页内容,初级网页匹配使用正则,繁琐匹配推荐使用BeautifulSoup

# 想要找到title的话
import re
res=re.findall(r'(.+?)',html)
print('\nPage title is:',res[0])

# 想要找到中间的那个段落
res=re.findall(r'

(.+?)

'
,html,flags=re.DOTALL) #re.DOTALL if multil line print('\nPage paragraph is:',res[0]) # 找一找所有的链接 res=re.findall(r'href=(.+?)',html) print('\nAll links:',res)
Page title is: Scraping tutorial 1 | 莫烦Python

Page paragraph is: 
        这是一个在 莫烦Python
        爬虫教程 中的简单测试.


All links: ['"', '"', '"']

BeautifulSoup解析网页

爬网页流程:
1. 选择要爬的网址(url)
2. 使用Python登录上这个网址(urlopen等)
3. 读取网页信息(read()出来)
4. 将读取的信息放入BeautifulSoup
5. 使用BeautifulSoup选取tag信息等(代替正则表达式)

安装BeautifulSoup

$ pip install beautifulsoup4 # Python 2+ 
$ pip3 install beautifulsoup4 # Python 3+ 
$ conda install beautifulsoup4 # anaconda

BeautifulSoup 4.2.0文档

BeautifulSoup解析网页: 基础

# 按常规读取网页
from bs4 import BeautifulSoup
from urllib.request import urlopen

#if has Chinese, apple decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')
print(soup.h1)
print('\n',soup.p)

# 用find_all找到所有选项,但真正link在里面
# 可以看做是的一个属性,可以用字典形式的key来读取
all_href=soup.find_all('a')
all_href=[l["href"] for l in all_href]
print('\n',all_href)



    
    Scraping tutorial 1 | 莫烦Python
    


    

爬虫测试1

这是一个在 莫烦Python 爬虫教程 中的简单测试.

爬虫测试1

这是一个在 莫烦Python 爬虫教程 中的简单测试.

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

BeautifulSoup解析网页: CSS

# 先读取页面
from bs4 import BeautifulSoup
from urllib.request import urlopen

html=urlopen(
    "https://morvanzhou.github.io/static/scraping/list.html"
).read().decode('utf-8')
print(html)

# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')



    
    爬虫练习 列表 class | 莫烦 Python
    




列表 爬虫练习

这是一个在 莫烦 Python爬虫教程 里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.

  • 一月
    • 一月一号
    • 一月二号
    • 一月三号
  • 二月
  • 三月
  • 四月
  • 五月
# 按CSS class 匹配
month=soup.find_all('li',{"class":"month"})
for m in month:
    print(m.get_text())
一月
二月
三月
四月
五月
jan=soup.find('ul',{'class':'jan'})
d_jan=jan.find_all('li')
for d in d_jan:
    print(d.get_text())
一月一号
一月二号
一月三号

BeautifulSoup解析网页: 正则表达

from  bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html=urlopen(
    "https://morvanzhou.github.io/static/scraping/table.html"
).read().decode('utf-8')

soup=BeautifulSoup(html,features='lxml')

img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])
https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg
course_links=soup.find_all('a',{'href':re.compile('https://morvan.*')})
for link in course_links:
    print(link['href'])
https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/

小练习: 爬百度百科

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random

# 设置起始页,并将/item/...的网页放在his中,做一个备案,记录我们浏览过的网页
base_url="https://baike.baidu.com"
his=["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
# select the last sub url i 'his', print the title and url
url=base_url+his[-1]
html=urlopen(url).read().decode('utf-8')

soup=BeautifulSoup(html,features='lxml')
print(soup.find('h1').get_text(),'  url: ',his[-1])
网络爬虫   url:  /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
# find valid urls
sub_urls=soup.find_all('a',{'target':'_blank','href':re.compile('/item/(%.{2})+$')})

if len(sub_urls)!=0:
    his.append(random.sample(sub_urls,1)[0]['href'])
else:
    # no valid sub link found
    his.pop()
print(his)
['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95']
# 放在一个for loop中,让它在各种不同的网页中跳来跳去
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
    url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    # find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        # no valid sub link found
        his.pop()

0 网络爬虫     url:  /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
1 搜索引擎     url:  /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
2 全文索引     url:  /item/%E5%85%A8%E6%96%87%E7%B4%A2%E5%BC%95
3 搜索引擎     url:  /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
4 网站     url:  /item/%E7%BD%91%E7%AB%99
5 通信协议     url:  /item/%E9%80%9A%E4%BF%A1%E5%8D%8F%E8%AE%AE
6 数据单元     url:  /item/%E6%95%B0%E6%8D%AE%E5%8D%95%E5%85%83
7 数据包     url:  /item/%E6%95%B0%E6%8D%AE%E5%8C%85
8 会话层     url:  /item/%E4%BC%9A%E8%AF%9D%E5%B1%82
9 数据流     url:  /item/%E6%95%B0%E6%8D%AE%E6%B5%81
10 聚类分析     url:  /item/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90
11 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
12 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
13 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
14 企业邮箱     url:  /item/%E4%BC%81%E4%B8%9A%E9%82%AE%E7%AE%B1
15 域名     url:  /item/%E5%9F%9F%E5%90%8D
16 商标权     url:  /item/%E5%95%86%E6%A0%87%E6%9D%83
17 商标注册不当     url:  /item/%E5%95%86%E6%A0%87%E6%B3%A8%E5%86%8C%E4%B8%8D%E5%BD%93
18 中国驰名商标     url:  /item/%E9%A9%B0%E5%90%8D%E5%95%86%E6%A0%87
19 商标评审委员会     url:  /item/%E5%95%86%E6%A0%87%E8%AF%84%E5%AE%A1%E5%A7%94%E5%91%98%E4%BC%9A

注意: Python错误记录:IndexError: list index out of range
1. 第1种可能情况: list[index]index超出范围
2. 第2种可能情况: list是一个空的,没有一个元素, 进行list[0]就会出现该错误

更多请求/下载方式

多功能的Requests

获取网页的方式: get,post
1. post: 账号登录/搜索内容/上传图片/上传文件/往服务器传数据等
2. get: 正常打开网页/不往服务器传数据

从主动和被动的角度说:
1. post是发送,比较主动,你控制了服务器返回的内容.
2. get是取得,是被动的,你没有发送给服务器个性化的信息,它不会根据你个性化的信息返回不一样的HTML.

requests模块有着各种不同的请求方法, 而且用起来很方便.

安装requests

$ pip install requests # python 2+
$ pip3 install requests # python 3+

Requests 中文文档

reguests get 请求

# reguests get 请求
import requests
import webbrowser
param={'wd':'网络爬虫'} #搜索的信息 "s?wd=网络爬虫"就是我们搜索需要的关键信息
r=requests.get('http://www.baidu.com/s',params=param)
print(r.url)
webbrowser.open(r.url)
http://www.baidu.com/s?wd=%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB





True

requests post 请求

# requests post 请求
# 使用python来提交Form,登录上提交后的页面
data={'firstname':'Morvan','lastname':'Zhou'}
r=requests.post('http://pythonscraping.com/files/processing.php',data=data)
print(r.text)
Hello there, Morvan Zhou!

上传图片

# 上传图片,也是post的一种
file={'uploadFile':open('./images//webwxgetmsgimg.jpeg','rb')}
r=requests.post('http://pythonscraping.com/files/processing2.php',files=file)
print(r.text)
The file webwxgetmsgimg.jpeg has been uploaded.

登录

# 登录
"""
浏览器做了什么:
1. 使用post方法登录了url
2. post的时候,使用了Form data中的用户名和密码
3. 生成了一些cookies
"""
payload={'username':'Morvan','password':'password'}
r=requests.post('http://pythonscraping.com/pages/cookies/welcome.php',data=payload)
print(r.cookies.get_dict())
{'username': 'Morvan', 'loggedin': '1'}
r=requests.get('http://pythonscraping.com/pages/cookies/welcome.php',cookies=r.cookies)
print(r.text)

Welcome to the Website!

You have logged in successfully!
Check out your profile!

使用Session登录

# 使用Session登录
session=requests.Session()
payload={'username':'Morvan','password':'password'}
r=session.post('http://pythonscraping.com/pages/cookies/welcome.php',data=payload)
print(r.cookies.get_dict())
{'username': 'Morvan', 'loggedin': '1'}
r=session.get('http://pythonscraping.com/pages/cookies/welcome.php')
print(r.text)

Welcome to the Website!

You have logged in successfully!
Check out your profile!

下载文件

import os
os.makedirs('./images/',exist_ok=True) #新建一个文件夹

IMAGE_URL='https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png'

使用urlretrieve

from urllib.request import urlretrieve #urllib模块中提供了一个下载功能urlretrieve
urlretrieve(IMAGE_URL,'./images/Image1.png')
('./images/Image1.png', )

使用request

import requests
r=requests.get(IMAGE_URL)
with open('./images/Image2.png','wb') as f:
    f.write(r.content)

r = requests.get(IMAGE_URL, stream=True)    # stream loading

with open('./images/Image3.png', 'wb') as f:
    for chunk in r.iter_content(chunk_size=32):
        f.write(chunk)

小练习: 下载美图

from bs4 import BeautifulSoup
import requests

URL = "http://www.nationalgeographic.com.cn/animals/"
html = requests.get(URL).text
soup = BeautifulSoup(html, 'lxml')
img_ul = soup.find_all('ul', {"class": "img_list"})
for ul in img_ul:
    imgs = ul.find_all('img')
    for img in imgs:
        url = img['src']
        r = requests.get(url, stream=True)
        image_name = url.split('/')[-1]
        with open('./images/%s' % image_name, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)
        print('Saved %s' % image_name)
Saved 20180319103523967.jpg
Saved 20180315040949958.jpg
Saved 20180311101349652.jpg
Saved 20180306113348212.jpg
Saved 20180306103802944.jpg
Saved 20180305104502788.jpg

加速你的爬虫

多进程分布式

import multiprocessing as mp
import time
from urllib.request import urlopen,urljoin
from bs4 import BeautifulSoup
import re

base_url = 'https://morvanzhou.github.io/'

def crawl(url):
    response=urlopen(url)
    time.sleep(0.1)
    return response.read().decode()

def parse(html):
    soup=BeautifulSoup(html,'lxml')
    urls=soup.find_all('a',{'href':re.compile('^/.+?/$')})
    title=soup.find('h1').get_text().strip()
    page_urls=set([urljoin(base_url,url['href']) for url in urls])
    url=soup.find('meta',{'property':'og:url'})['content']
    return title,page_urls,url

测试普通爬法

unseen = set([base_url,])
seen = set()

if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False

count, t1 = 1, time.time()

while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break

    print('\nDistributed Crawling...')
    htmls = [crawl(url) for url in unseen]

    print('\nDistributed Parsing...')
    results = [parse(html) for html in htmls]

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 53 s
Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 https://morvanzhou.github.io/

Distributed Crawling...

Distributed Parsing...

Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 19.8 s

测试分布式爬法

unseen = set([base_url,])
seen = set()

pool = mp.Pool(4)                       
count, t1 = 1, time.time()
while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break
    print('\nDistributed Crawling...')
    crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
    htmls = [j.get() for j in crawl_jobs]                                       # request connection

    print('\nDistributed Parsing...')
    parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
    results = [j.get() for j in parse_jobs]                                     # parse html

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 16 s !!!

Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 https://morvanzhou.github.io/

Distributed Crawling...

Distributed Parsing...

Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 7.8 s

异步加载Asyncio

基本用法

# asyncio: 在单线程里使用异步计算,下载网页的时候和处理网页的时候是不连续的
# 更有效利用了等待下载的这段时间
import time

def job(t):
    print('Start job',t)
    time.sleep(t)
    print('Job',t,' takes',t,'s')

def main():
    [job(t) for t in range(1,3)]

t1=time.time()
main()
print('No async total time: ',time.time()-t1)
Start job 1
Job 1  takes 1 s
Start job 2
Job 2  takes 2 s
No async total time:  3.00662899017334
import asyncio

async def job(t): #async形式的功能
    print('Start job',t)
    await asyncio.sleep(t)  #等待t秒,期间切换其他任务
    print('Job',t,' takes',t,'s')

async def main(loop): #async形式的功能
    tasks=[loop.create_task(job(t)) for t in range(1,3)] #创建任务,但不执行
    await asyncio.wait(tasks) #执行并等待所有任务完成

t1=time.time()
loop=asyncio.get_event_loop() #建立loop
loop.run_until_complete(main(loop)) #执行loop
print("Async total time:", time.time()-t1)
Start job 1
Start job 2
Job 1  takes 1 s
Job 2  takes 2 s
Async total time: 2.0041818618774414

aiohttp

$ pip3 install aiohttp

import requests

URL = 'https://morvanzhou.github.io/'

def normal():
    for i in range(2):
        r=requests.get(URL)
        url=r.url
        print(url)

t1=time.time()
normal()
print('Normal total time:',time.time()-t1)
https://morvanzhou.github.io/
https://morvanzhou.github.io/
Normal total time: 0.6391615867614746
import aiohttp

async def job(session):
    response=await session.get(URL) #等待并切换
    return str(response.url)

async def main(loop):
    async with aiohttp.ClientSession() as session: #官网推荐建立Session的形式
        tasks=[loop.create_task(job(session)) for _ in range(2)] 
        finished,unfinished=await asyncio.wait(tasks)
        all_results=[r.result() for r in finished] #获取所有结果
        print(all_results)

t1=time.time()
loop=asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print('Async total time:',time.time()-t1)
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.30881452560424805

和多进程分布式爬虫对比

%E5%9B%BE%E7%89%87.png

import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
from urllib.request import urljoin
import re
import multiprocessing as mp

base_url = "https://morvanzhou.github.io/"

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False


seen = set()
unseen = set([base_url])


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url


async def crawl(url, session):
    r = await session.get(url)
    html = await r.text()
    await asyncio.sleep(0.1)        # slightly delay for downloading
    return html


async def main(loop):
    pool = mp.Pool(8)               # slightly affected
    async with aiohttp.ClientSession() as session:
        count = 1
        while len(unseen) != 0:
            print('\nAsync Crawling...')
            tasks = [loop.create_task(crawl(url, session)) for url in unseen]
            finished, unfinished = await asyncio.wait(tasks)
            htmls = [f.result() for f in finished]

            print('\nDistributed Parsing...')
            parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
            results = [j.get() for j in parse_jobs]

            print('\nAnalysing...')
            seen.update(unseen)
            unseen.clear()
            for title, page_urls, url in results:
                # print(count, title, url)
                unseen.update(page_urls - seen)
                count += 1

if __name__ == "__main__":
    t1 = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(loop))
    #loop.close()
    print("Async total time: ", time.time() - t1)
---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

 in ()
     61     t1 = time.time()
     62     loop = asyncio.get_event_loop()
---> 63     loop.run_until_complete(main(loop))
     64     loop.close()
     65     print("Async total time: ", time.time() - t1)


~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in run_until_complete(self, future)
    441         Return the Future's result, or raise its exception.
    442         """
--> 443         self._check_closed()
    444 
    445         new_task = not futures.isfuture(future)


~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in _check_closed(self)
    355     def _check_closed(self):
    356         if self._closed:
--> 357             raise RuntimeError('Event loop is closed')
    358 
    359     def _asyncgen_finalizer_hook(self, agen):


RuntimeError: Event loop is closed
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import multiprocessing as mp
import re
import time


def crawl(url):
    response = urlopen(url)
    time.sleep(0.1)             # slightly delay for downloading
    return response.read().decode()


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url


if __name__ == '__main__':
    base_url = 'https://morvanzhou.github.io/'
    #base_url = "http://127.0.0.1:4000/"

    # DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
    if base_url != "http://127.0.0.1:4000/":
        restricted_crawl = True
    else:
        restricted_crawl = False

    unseen = set([base_url,])
    seen = set()

    pool = mp.Pool(8)                       # number strongly affected
    count, t1 = 1, time.time()
    while len(unseen) != 0:                 # still get some url to visit
        if restricted_crawl and len(seen) > 20:
            break
        print('\nDistributed Crawling...')
        crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
        htmls = [j.get() for j in crawl_jobs]                                       # request connection
        htmls = [h for h in htmls if h is not None]     # remove None

        print('\nDistributed Parsing...')
        parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
        results = [j.get() for j in parse_jobs]                                     # parse html

        print('\nAnalysing...')
        seen.update(unseen)
        unseen.clear()

        for title, page_urls, url in results:
            # print(count, title, url)
            count += 1
            unseen.update(page_urls - seen)

    print('Total time: %.1f s' % (time.time()-t1, ))

高级爬虫

高级爬虫:让Selenium控制你的浏览器帮你爬

Selenium: 能控制你的浏览器,有模有样地学人类”看”网页

什么时候用到Selenium:
- 发现用普通方法爬不到想要的内容
- 网站跟你玩”捉迷藏”,太多JavaScript内容
- 需要像人一样浏览的爬虫

安装Selenium

$ pip install selenium #python 2+

$ pip install selenium #python 3+

下载针对Linux和macOS的driver:

  • Chrome driver
  • Edge driver
  • Firefox driver
  • Safari driver

Linux和MacOS用户下载好之后, 需要将下载好的”geckodriver”文件放在你的计算机的”/usr/bin”或者”/usr/local/bin”目录, 在终端执行:

$ sudo cp 你的geckodriver位置 /usr/local/bin

$ sudo chmod +x /usr/local/bin/geckodriver

下载火狐浏览器插件Katalon Recorder,记录重复性工作

点击插件, 开始各种点击工作, 每当点击时, 插件就会记录下这些点击, 形成一些log, 最后点击Export按钮, 观看帮你生成的浏览记录代码, 可以输出Python2版本的代码, 复制你需要的代码.

import os

os.makedirs('./images/',exist_ok=True)
from selenium import webdriver

driver=webdriver.Firefox() #打开火狐浏览器

# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot1.png')
driver.close()
print(html[:200])

  
  
  

# 让selenium不弹出浏览器窗口,让它安静地执行这些操作
from selenium.webdriver.firefox.options import Options

firefox_options=Options()
firefox_options.add_argument('--headless') #define headless

driver=webdriver.Firefox(firefox_options=firefox_options)

# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot2.png')
driver.close()
print(html[:200])

  
  
  高级爬虫: 高效无忧的Scrapy爬虫库 
  

$ pip3 install scrapy

Scrapy是一个爬虫的框架, 而不是一个简单的爬虫, 它整合了爬取, 处理数据, 存储数据的一条龙服务.

import scrapy

class MofanSpider(scrapy.Spider):
    name = "mofan"
    start_urls = [
        'https://morvanzhou.github.io/',
    ]
    # unseen = set()
    # seen = set()      # we don't need these two as scrapy will deal with them automatically

    def parse(self, response):
        yield {     # return some results
            'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
            'url': response.url,
        }

        urls = response.css('a::attr(href)').re(r'^/.+?/$')     # find all sub urls
        for url in urls:
            yield response.follow(url, callback=self.parse)     # it will filter duplication automatically


# lastly, run this in terminal
# scrapy runspider 5-2-scrapy.py -o res.json

你可能感兴趣的:(coding)