python实现自动监测目标网站的爬取速度_以及整体网络环境分析

python实现自动监测目标网站的爬取速度_以及整体网络环境分析

实现的功能:在win7下,分别通过使用和不使用代理获取网页速度,以及使用系统自带的ping,tracert命令分析整个网络的环境,分析爬虫在速度上的瓶颈,以及排查解决网络故障。

分四步:

  • 第一步:计算不使用代理获取目标网页的平均速度。
  • 第二步:计算使用代理获取目标网页的平均速度。
  • 第三步:使用win7自带的ping命令测试网络的连通度。
  • 第四步:使用win7自带的tracert命令查看网络各节点路由的连通度。

1. 环境。

  • Python:3.6.1
  • Python IDE:pycharm
  • 系统:win7

2. 影响爬虫爬取速度的关键因素。

首先,思考一下,爬虫爬取目标网站的速度受哪几个因素的影响?

  • 第一,本地网络带宽。
  • 第二,代理的稳定性,以及到目标网站的速度。
  • 第三,目标网站的状态。
  • 第四,本地,代理,目标网站之间的网络链路状态。

3. 监测哪些数据?

3.1. 使用本地网络访问目标网站的速度

在不使用代理的情况下,访问目标网站的不同详情页,计算访问每个网页需要的时间,成功率,以及失败率。
For example:

# python 3.6.1
import requests
import datetime
detailUrl = "www.amazon.com"
startTime = datetime.datetime.now()
response = requests.get(url=detailUrl, timeout = myTime)
endTime = datetime.datetime.now()
usedTime = endTime - startTime
success_count = 0           # statusCode == 200
if response.status_code == 200:
    success_count += 1

注意:
1. 使用本地网络访问时,一定要注意目标网站的反爬技术,最好是设置延时 time.sleep(5)。

3.2. 使用代理访问目标网站的速度 —— 与3.1进行对比

使用代理,访问目标网站的不同详情页,计算访问每个网页需要的时间,成功率,以及失败率。
For example:

# python 3.6.1
import requests
import datetime
url = "www.amazon.com"
proxies = {
    "http": proxy,
          "https": proxy}
header = {
    "User-Agent": "Mozilla/2.0 (compatible; Ask Jeeves/Teoma)"}
success_count = 0           # statusCode == 200
connectFail_count = 0       # statusCode != 200 or timeout
proxyFail_count = 0         # requests Exception
startTime = datetime.datetime.now()
try:
    s = requests.session()
    response = s.get(url=url, proxies=proxies, headers=header, timeout=30)
    endTime = datetime.datetime.now()
    usedTime = endTime - startTime
    if response.status_code == 200:
        success_count += 1
    else:
        connectFail_count += 1
except Exception as e:
    proxyFail_count += 1
    print( f"Exception: url={url}, e:{e}")

注意:
1. 使用代理访问时,为了防止程序异常终止,最好加上异常处理,因为代理总会有这样那样的问题。

3.3. 使用ping命令来测试目标网站的连通度

使用系统自带的ping命令,查看目标网站的网络连通度,以及延时信息。
For example:

# python 3.6.1
import subprocess
import os
ip_address = "www.amazon.com"
p = subprocess.Popen(["ping.exe", ip_address],
                     stdin = subprocess.PIPE,
                     stdout = subprocess.PIPE,
                     stderr = subprocess.PIPE,
                     shell = True)
# 在Windows上需要用gbk解码,python默认utf-8,会出现乱码
cmdOut = p.stdout.read().decode('gbk') 

结果:
python实现自动监测目标网站的爬取速度_以及整体网络环境分析_第1张图片

注意:
1. 指标:一般延时在100ms以下说明连通度非常高。还需要关注丢包率。

3.4. 使用tracert命令来查看到目标网站的路由详情

使用系统自带的tracert命令,查看到目标网站需要经过多少跳路由,到每个路由的时间,协助分析整个网络链路状态。
For example:

# python 3.6.1
import subprocess
import os
ip_address = "www.amazon.com"
p = subprocess.Popen(["tracert.exe", ip_address],
                     stdin=subprocess.PIPE,
                     stdout=subprocess.PIPE,
                     stderr=subprocess.PIPE,
                     shell=True)
# 在Windows上需要gbk解码,python默认utf-8,会出现乱码
cmdOut = p.stdout.read().decode('gbk') 

结果:
python实现自动监测目标网站的爬取速度_以及整体网络环境分析_第2张图片

注意:
- 访问 http://www.ip138.com/ips138.asp?ip=52.84.239.51&action=2 就能查到IP对应的地址在哪,注意将ip的值更换成自己的。
- 请求超时的,不是指链路不通,而是这个节点路由不遵守这个协议,不返回包。

这些信息能说明什么?

  • 第一,对比3.1和3.2的数据,可以知道爬虫速度的瓶颈在哪。如果使用代理和不使用代理速度差不多,那么瓶颈就在本地带宽,想提升爬取速度,就要想办法提高本地网速。反之,就是代理拉慢了爬虫速度,需要寻找更稳定更快速的代理。
  • 第二,如果爬虫爬不动了,首先要检查一下目标网站的连通度,因为有可能是服务器故障。
  • 第三,如果ping的结果不理想,延时很大,tracert结果有助于我们分析是哪一段网络之间发生问题,再去思考详细的解决方案。

4. 代码详解

  • 4.1. Import包:
# python 3.6.1
import subprocess
import re
import time
import datetime
import functools
from ftplib import FTP
import os
import requests
import random
  • 4.2. 定义配置信息:
# 定义配置信息
domains = ["www.amazon.com", "www.amztracker.com"]  # 待测试的域名
target_folder = r"F:\NetworkEnvDaily\\"             # 测试结果文件存放位置
ftp_url = "192.168.0.101"                           # FTP服务器地址
ftp_port = 21                                       # FTP服务端口号
ftpUploadFolder = "NetworkEnvup"                    # 文件上传至FTP服务器的位置
ids_req = ["B018A2RRG4", "B002GCJOC0", "0071386211", "B00MHIKRIS"]  # 用于构造detaiUrl
useragent = [
  'Mozilla/2.0 (compatible; Ask Jeeves/Teoma)',
  'Baiduspider ( http://www.baidu.com/search/spider.htm)',
  'FAST-WebCrawler/3.8 (crawler at trd dot overture dot com; http://www.alltheweb.com/help/webmaster/crawler)',
  'AdsBot-Google ( http://www.google.com/adsbot.html)',
  'Mozilla/5.0 (compatible; Googlebot/2.1;  http://www.google.com/bot.html)'
]
  • 4.3. 定义装饰器,计算函数执行时间:
# 装饰器
def timeDecorator(func):
    '''
    装饰器,记录函数的执行时间
    :param func:
    :return:
    '''
    @functools.wraps(func)   # 方便调试,堆栈能显示真实的func name,而不是wrapper
    def wrapper(*args, **kwargs):
        startTime = datetime.datetime.now()
        print(f"Enter func:{func.__name__} at {startTime}")
        res = func(*args, **kwargs)
        endTime = datetime.datetime.now()
        print(f"Leave func:{func.__name__} at {endTime}, usedTime: {endTime-startTime}")
        return res
    return wrapper
  • 4.4. 定义Log处理和url构造函数:
# 处理Log
def myHandleLog(toSave, log):
    '''
    保存Log,用于写到文件中,并打印log。因为str是不可变量
    :param toSave:
    :param log:
    :return:
    '''
    toSave += log
    toSave += '\n'
    print(log)
    return toSave

def getUrlsFromIds(domain, ids_lst):
    '''
    构造url,amazon商品详情页url格式:www.amazon.com/dp/ASIN_ID
    :param domain: domain of amazon
    :param ids_lst: batch operator
    :return:
    '''
    urls_lst = [f"http://{domain}/dp/{ID}" for ID in ids_lst]
    return urls_lst
  • 4.5. 定义 ping ip_address 的函数:
# Ping IP 测试网络连通度
@timeDecorator
def get_ping_result(ip_address):
    '''
    测试目标网站的连通度,也测试本地到目标站的网速。
    :param ip_address: ping ip_address
    :return:
    '''
    p = subprocess.Popen(["ping.exe", ip_address],
                         stdin = subprocess.PIPE,
                         stdout = subprocess.PIPE,
                         stderr = subprocess.PIPE,
                         shell = True)
    cmdOut = p.stdout.read().decode('gbk') # 在Windows上需要用gbk解码,python默认utf-8,会出现乱码

    # 定义正则表达式,筛选数字,提取连通率信息
    re_receive = "已接收 = \d"
    re_lost = "丢失 = \d"
    re_ip = "来自 [\d\.]+"
    match_receive = re.search(re_receive, cmdOut)
    match_lost = re.search(re_lost, cmdOut)
    match_ip = re.search(re_ip, cmdOut)

    receive_count = -1
    if match_receive:
        receive_count = int(match_receive.group()[6:])

    lost_count = -1
    if match_lost:
        lost_count = int(match_lost.group()[5:])

    re_ip = "127.0.0.1"
    if match_ip:
        re_ip = match_ip.group()[3:]

    # 接受到的反馈大于0,表示网络通,提取延时信息
    if receive_count > 0:
        re_min_time = '最短 = \d+ms'
        re_max_time = '最长 = \d+ms'
        re_avg_time = '平均 = \d+ms'

        match_min_time = re.search(re_min_time, cmdOut)
        min_time = int(match_min_time.group()[5:-2])

        match_max_time = re.search(re_max_time, cmdOut)
        max_time = int(match_max_time.group()[5:-2])

        match_avg_time = re.search(re_avg_time, cmdOut)
        avg_time = int(match_avg_time.group()[5:-2])
        return [re_ip, receive_count, lost_count, min_time, max_time, avg_time]
    else:
        print(f"网络不通,{ip_address}不可达")
        return ["127.0.0.1", 0, 9999, 9999, 9999, 9999]
  • 4.6. 定义 tracert ip_address 的函数:
# tracert IP 查看各网络节点连通情况
@timeDecorator
def get_tracert_result(ip_address):
    p = subprocess.Popen(["tracert.exe", ip_address],
                         stdin=subprocess.PIPE,
                         stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE,
                         shell=True)
    # 在Windows上需要gbk解码,python默认utf-8,会出现乱码
    cmdOut = p.stdout.read().decode('gbk') 
    return cmdOut
  • 4.7. 统计使用Abuyun代理访问url 的速度:
# 根据abuyun提供的隧道号,密钥,生成动态代理。需要购买,此处只是示例,不可用
def getProxyFromAbuyun(tunnel, secret):
    # proxy = f'http://AMAZONU62J941J01:[email protected]:9020'
    proxy = f'http://{tunnel}:{secret}@proxy.abuyun.com:9020'
    return proxy

# 测量使用代理访问目标urls需要消耗的时间
@timeDecorator
def statisticSpeedOfProxy(urls, proxy):
    '''
    主要是统计使用代理时,访问成功,失败数,以及成功的情况下,访问每条url的平均时间
    :param urls: 用来测试的url集合
    :param proxy: 使用的代理
    :return: 详细的日志信息
    '''
    detail_log = f"The result of use Abuyun proxy:{proxy}\n"
    proxies = {
    "http": proxy,
               "https": proxy}
    header = {
    "User-Agent": random.choice(useragent)}
    success_count = 0           # statusCode == 200
    connectFail_count = 0       # statusCode != 200 or timeout
    proxyFail_count = 0         # requests Exception
    totalSuccessTime = 0
    for url in urls:
        startTime = datetime.datetime.now()
        try:
            s = requests.session()
            response = s.get(url=url, proxies=proxies, headers=header, timeout=10)
            endTime = datetime.datetime.now()
            usedTime = endTime - startTime
            detail_log = myHandleLog(detail_log, f"request url:{url}, statusCode:"
                                                 f"{response.status_code}, usedTime:{usedTime}")
            # print(f"request url:{url}, response.statusCode: {response.status_code}, usedTime:{usedTime}")
            if response.status_code == 200:
                success_count += 1
                totalSuccessTime += usedTime.total_seconds()
            else:
                connectFail_count += 1
        except Exception as e:
            proxyFail_count += 1
            detail_log = myHandleLog(detail_log, f"Exception: url={url}, e:{e}")
        time.sleep(1)   # 控制好时间间隔
    avgTime = "100000"  # 定义不可访问的值
    if success_count != 0:
        avgTime = totalSuccessTime / success_count
    detail_log = myHandleLog(detail_log, f"Statistic_proxy, total:{len(urls)}: "
                                         f"success:{success_count}, "
                                         f"totalSuccessTime:{totalSuccessTime}, "
                                         f"avgTime:{avgTime}, "
                                         f"connectFail_count:{connectFail_count}, "
                                         f"proxyFail_count:{proxyFail_count}")
    return ( len(urls), success_count, totalSuccessTime, avgTime, connectFail_count,
             proxyFail_count, detail_log )
  • 4.8. 统计不使用代理访问url 的速度:
# 测量不使用代理访问目标urls需要消耗的时间
@timeDecorator
def statisticSpeedWithoutProxy(urls):
    '''
    主要是统计不使用代理时,访问成功,失败数,以及成功的情况下,访问每条url的平均时间
    尤其需要主要好时间间隔,以防网站反爬,IP被封
    :param urls: 用来测试速度的url集合
    :return: 详细的日志信息
    '''
    detail_log = f"The result of not use proxy:\n"
    header = {
    "User-Agent": random.choice(useragent)}
    success_count = 0       # statusCode == 200
    connectFail_count = 0   # statusCode != 200 or timeout
    unknowFail_count = 0    # requests Exception
    totalSuccessTime = 0
    for url in urls:
        startTime = datetime.datetime.now()
        try:
            s = requests.session()
            response = s.get(url=url, headers=header, timeout=10)
            endTime = datetime.datetime.now()
            usedTime = endTime - startTime
            detail_log = myHandleLog(detail_log, f"request url:{url}, statusCode:"
                                                 f"{response.status_code}, usedTime:{usedTime}")
            # print(f"request url:{url}, response.statusCode: {response.status_code}, usedTime:{usedTime}")
            if response.status_code == 200:
                success_count += 1
                totalSuccessTime += usedTime.total_seconds()
            else:
                connectFail_count += 1
        except Exception as e:
            unknowFail_count += 1
            detail_log = myHandleLog(detail_log, f"Exception: url={url}, e:{e}")
        time.sleep(5)  # 控制好时间间隔
    avgTime = "100001" # 定义不可访问的值
    if success_count != 0:
        avgTime = totalSuccessTime / success_count
    detail_log = myHandleLog(detail_log, f"Statistic_No_proxy, total:{len(urls)}: "
                                         f"success:{success_count}, "
                                         f"totalSuccessTime:{totalSuccessTime}, "
                                         f"avgTime:{avgTime}, "
                                         f"connectFail_count:{connectFail_count}, "
                                         f"proxyFail_count:{unknowFail_count}")
    return ( len(urls), success_count, totalSuccessTime, avgTime, connectFail_count,
             unknowFail_count, detail_log )
  • 4.9. 上传统计结果文件到FTP服务器:
# 上传文件至FTP服务器
@timeDecorator
def ftpUpload(filename, folder, ftp_url, ftp_port):
    '''
    :param filename: 待上传文件路径
    :param folder: 文件上传至FTP服务器上的存储目录
    :param ftp_url: FTP服务器IP
    :param ftp_port: 端口号,默认为21
    :return: status code
    '''
    startTime = datetime.datetime.now()
    print(f"Enter func ftpUpload, time:{startTime}")
    ftp = FTP()
    ftp.set_debuglevel(2)  # set debug level, detail info:2, close:0
    ftp.connect(ftp_url, ftp_port)
    ftp.login('', '')  # 登录,如果匿名登录则用空串代替
    print(ftp.getwelcome())  # ext: *welcome* '220 Microsoft FTP Service'
    ftp.cwd(folder)  # Change to a directory on FTP server
    bufsize = 1024  # 设置缓冲块大小
    file_handler = open(filename, 'rb')  # 读模式在本地打开文件
    res = -1
    try:
        # 为了避免程序终止,忽略可能出现的错误
        res = ftp.storbinary(f"STOR {os.path.basename(filename)}", file_handler, bufsize)  # upload file
    except Exception as e:
        print(f"except: {e}, cannot upload file: {ftp_url}:{ftp_port} {filename}")
    finally:
        ftp.set_debuglevel(0)  # 关闭debug信息
        file_handler.close()
        ftp.quit()
    endTime = datetime.datetime.now()
    print(f"Upload done, leave func ftpUpload, time:{endTime}, usedTime:{endTime-startTime}")
    return res
  • 4.10. 主程序main:
if __name__ == '__main__':

    # 按时间给文件命名
    MainRunTime = datetime.datetime.now()
    fileName = f"statisticNetworkEnv{MainRunTime.year}{MainRunTime.month}{MainRunTime.day}.txt"
    amazonDeatilUrls = getUrlsFromIds("www.amazon.com", ids_req)
    conclusion_content = ""         # 写入文件的结论性信息
    detail_content = ""             # 写入文件的详细信息

    # 测试爬虫使用代理的速度
    content = "\n###### Speed of get url page ######\nThe speed result of Abuyun proxy(amazon) for" \
              " amazon detail page: proxy = AMAZONU62J941J01:LUI52JRD425UFDDK\n"
    conclusion_content += content
    print(content)
    proxy = getProxyFromAbuyun("AMAZONU62J941J01", "LUI52JRD425UFDDK")
    res = statisticSpeedOfProxy(urls=amazonDeatilUrls, proxy=proxy)
    # res: [len(urls), success_count, totalSuccessTime, avgTime, connectFail_count, proxyFail_count, detail_log]
    conclusion_content += f"Amazon Totalurls:{res[0]}, successCount:{res[1]}, totalSuccessTime:{res[2]}, " \
                          f"avgTime:{res[3]}, connectFailCount:{res[4]}, proxyFailCount:{res[5]}\n\n"
    detail_content += res[6]

    # 测试不使用代理爬取amazon商品详情网页的速度
    content = "The speed result of not use proxy for amazon detail page. \n"
    conclusion_content += content
    print(content)
    res = statisticSpeedWithoutProxy(urls=amazonDeatilUrls)
    # res: [len(urls), success_count, totalSuccessTime, avgTime, connectFail_count, unknowFail_count, detail_log]
    conclusion_content += f"No_proxy Totalurls:{res[0]}, successCount:{res[1]}, totalSuccessTime:{res[2]}, " \
                          f"avgTime:{res[3]}, connectFailCount:{res[4]}, unknowFail_count:{res[5]}\n"
    detail_content += res[6]

    # 测试ping
    content = f"\n##### Speed of ping ######\n"
    conclusion_content += content
    print(content)
    for domain in domains:
        content = f"The speed result of ping {domain}.\n"
        conclusion_content += content
        print(content)
        res = get_ping_result(domain)
        # return [re_ip, receive_count, lost_count, min_time, max_time, avg_time]
        content = f"result of ping {domain}:{res[0]}, receive_count:{res[1]}, lost_count:{res[2]}, " \
                  f"min_time:{res[3]}, max_time:{res[4]}, avg_time:{res[5]} \n\n"
        conclusion_content += content
        print(content)

    # 测试tracert
    content = f"\n##### Speed of tracert ######\n"
    conclusion_content += content
    print(content)
    for domain in domains:
        content = f"The speed result of tracert {domain}. \n"
        conclusion_content += content
        print(content)
        res = get_tracert_result(domain)
        # return cmdOut
        content = f"{res}\n\n"
        conclusion_content += res
        print(content)

    # 打印统计结果
    print(f"###### conclusion_content: \n {conclusion_content}")
    print(f"###### detail_content: \n {detail_content}")

    # 写入文件
    try:
        f = open(f"{target_folder}{fileName}", "w")     # 按覆盖文件的方式打开
        f.write(f"###### conclusion ######\n{conclusion_content}\n")
        f.write(f"###### detail ######\n{detail_content}\n")
    except Exception as e:
            print(f"Exception: {e}")
    finally:
        if f:
            f.close()
    print(f"fileName: {fileName}")

    # 上传到FTP服务器
    ftp_res = ftpUpload(f"{target_folder}{fileName}", ftpUploadFolder, ftp_url, ftp_port)
    print(f"ftp_res: {ftp_res}")

5. 效果

5.1. 生成结论

python实现自动监测目标网站的爬取速度_以及整体网络环境分析_第3张图片

###### conclusion #######

###### Speed of get url page ######
The speed result of Abuyun proxy(amazon) for amazon detail page: proxy = AMAZONU62J941J01:LUI52JRD425UFDDK
Amazon Totalurls:4, successCount:4, totalSuccessTime:17.214000000000002, avgTime:4.3035000000000005, connectFailCount:0, proxyFailCount:0

The speed result of not use proxy for amazon detail page. 
No_proxy Totalurls:4, successCount:4, totalSuccessTime:6.484999999999999, avgTime:1.6212499999999999, connectFailCount:0, proxyFailCount:0

##### Speed of ping ######
The speed result of ping www.amazon.com.
result of ping www.amazon.com:52.84.239.51, receive_count:4, lost_count:0, min_time:159, max_time:186, avg_time:167 

The speed result of ping www.amztracker.com.
result of ping www.amztracker.com:104.20.194.28, receive_count:4, lost_count:0, min_time:173, max_time:173, avg_time:173 


##### Speed of tracert ######
The speed result of tracert www.amazon.com. 

通过最多 30 个跃点跟踪

到 d3ag4hukkh62yn.cloudfront.net [52.84.239.51] 的路由:



  1    <1 毫秒   <1 毫秒   <1 毫秒 192.168.0.1 

  2     2 ms    44 ms     2 ms  100.64.0.1 

  3     2 ms     2 ms     2 ms  113.106.42.53 

  4     3 ms     3 ms     3 ms  17.107.38.59.broad.fs.gd.dynamic.163data.com.cn [59.38.107.17] 

  5     9 ms     7 ms     8 ms  183.56.65.62 

  6     5 ms     8 ms     7 ms  202.97.33.202 

  7     *       13 ms     7 ms  202.97.94.102 

  8   220 ms   219 ms   224 ms  202.97.51.106 

  9   160 ms   159 ms   159 ms  202.97.49.106 

 10   274 ms   274 ms   274 ms  218.30.53.2 

 11   171 ms   170 ms   165 ms  54.239.103.88 

 12   182 ms   185 ms   190 ms  54.239.103.97 

 13     *        *        *     请求超时。

 14   157 ms   158 ms   158 ms  54.239.41.97 

 15   176 ms   176 ms   177 ms  205.251.230.89 

 16     *        *        *     请求超时。

 17     *        *        *     请求超时。

 18     *        *        *     请求超时。

 19   159 ms   159 ms   159 ms  server-52-84-239-51.sfo5.r.cloudfront.net [52.84.239.51] 



跟踪完成。

The speed result of tracert www.amztracker.com. 


通过最多 30 个跃点跟踪

到 www.amztracker.com [104.20.194.28] 的路由:



  1    <1 毫秒   <1 毫秒   <1 毫秒 192.168.0.1 

  2     2 ms     2 ms     2 ms  100.64.0.1 

  3     2 ms     2 ms     2 ms  119.145.222.101 

  4     3 ms     3 ms     2 ms  183.56.67.193 

  5     5 ms     8 ms     7 ms  183.56.65.58 

  6     9 ms     8 ms    10 ms  202.97.33.214 

  7     5 ms     7 ms     7 ms  202.97.91.190 

  8     *        *        *     请求超时。

  9   167 ms   167 ms   167 ms  202.97.50.70 

 10   161 ms   162 ms   161 ms  218.30.54.70 

 11   190 ms   192 ms   199 ms  et-0-0-71-1.cr5-sjc1.ip4.gtt.net [89.149.137.2] 

 12   163 ms   167 ms   175 ms  ip4.gtt.net [173.205.51.142] 

 13   157 ms   158 ms   158 ms  104.20.194.28 



跟踪完成。


################# detail #################
The result of use Abuyun proxy:http://AMAZONU62J941J01:LUI52JRD425UFDDK@proxy.abuyun.com:9020
request url:http://www.amazon.com/dp/B018A2RRG4, statusCode:200, usedTime:0:00:04.690000
request url:http://www.amazon.com/dp/B002GCJOC0, statusCode:200, usedTime:0:00:04.777000
request url:http://www.amazon.com/dp/0071386211, statusCode:200, usedTime:0:00:04.592000
request url:http://www.amazon.com/dp/B00MHIKRIS, statusCode:200, usedTime:0:00:03.155000
Statistic_proxy, total:4: success:4, totalSuccessTime:17.214000000000002, avgTime:4.3035000000000005, connectFail_count:0, proxyFail_count:0
The result of not use proxy:
request url:http://www.amazon.com/dp/B018A2RRG4, statusCode:200, usedTime:0:00:01.574000
request url:http://www.amazon.com/dp/B002GCJOC0, statusCode:200, usedTime:0:00:01.394000
request url:http://www.amazon.com/dp/0071386211, statusCode:200, usedTime:0:00:01.473000
request url:http://www.amazon.com/dp/B00MHIKRIS, statusCode:200, usedTime:0:00:02.044000
Statistic_No_proxy, total:4: success:4, totalSuccessTime:6.484999999999999, avgTime:1.6212499999999999, connectFail_count:0, proxyFail_count:0

分析结果:
网络连通性不错,爬虫速度瓶颈在于代理的访问速度。

5.2. Log信息以及性能

###### Speed of get url page ######
The speed result of Abuyun proxy(amazon) for amazon detail page: proxy = AMAZONU62J941J01:LUI52JRD425UFDDK

Enter func:statisticSpeedOfProxy at 2017-08-19 12:34:41.199200
request url:http://www.amazon.com/dp/B018A2RRG4, statusCode:200, usedTime:0:00:04.690000
request url:http://www.amazon.com/dp/B002GCJOC0, statusCode:200, usedTime:0:00:04.777000
request url:http://www.amazon.com/dp/0071386211, statusCode:200, usedTime:0:00:04.592000
request url:http://www.amazon.com/dp/B00MHIKRIS, statusCode:200, usedTime:0:00:03.155000
Statistic_proxy, total:4: success:4, totalSuccessTime:17.214000000000002, avgTime:4.3035000000000005, connectFail_count:0, proxyFail_count:0
Leave func:statisticSpeedOfProxy at 2017-08-19 12:35:18.413200, usedTime: 0:00:37.214000

Enter func:statisticSpeedWithoutProxy at 2017-08-19 12:35:55.528200
request url:http://www.amazon.com/dp/B018A2RRG4, statusCode:200, usedTime:0:00:01.574000
request url:http://www.amazon.com/dp/B002GCJOC0, statusCode:200, usedTime:0:00:01.394000
request url:http://www.amazon.com/dp/0071386211, statusCode:200, usedTime:0:00:01.473000
request url:http://www.amazon.com/dp/B00MHIKRIS, statusCode:200, usedTime:0:00:02.044000
Statistic_No_proxy, total:4: success:4, totalSuccessTime:6.484999999999999, avgTime:1.6212499999999999, connectFail_count:0, proxyFail_count:0
Leave func:statisticSpeedWithoutProxy at 2017-08-19 12:39:22.014200, usedTime: 0:03:26.486000

##### Speed of ping ######

The speed result of ping www.amazon.com.

Enter func:get_ping_result at 2017-08-19 12:39:22.014200
Leave func:get_ping_result at 2017-08-19 12:39:25.200200, usedTime: 0:00:03.186000
result of ping www.amazon.com:52.84.239.51, receive_count:4, lost_count:0, min_time:159, max_time:186, avg_time:167 


The speed result of ping www.amztracker.com.

Enter func:get_ping_result at 2017-08-19 12:39:25.200200
Leave func:get_ping_result at 2017-08-19 12:39:28.397200, usedTime: 0:00:03.197000
result of ping www.amztracker.com:104.20.194.28, receive_count:4, lost_count:0, min_time:173, max_time:173, avg_time:173 



##### Speed of tracert ######

The speed result of tracert www.amazon.com. 

Enter func:get_tracert_result at 2017-08-19 12:39:28.397200
Leave func:get_tracert_result at 2017-08-19 12:42:55.191200, usedTime: 0:03:26.794000

通过最多 30 个跃点跟踪
到 d3ag4hukkh62yn.cloudfront.net [52.84.239.51] 的路由:

  1    <1 毫秒   <1 毫秒   <1 毫秒 192.168.0.1 
  2     2 ms    44 ms     2 ms  100.64.0.1 
  3     2 ms     2 ms     2 ms  113.106.42.53 
  4     3 ms     3 ms     3 ms  17.107.38.59.broad.fs.gd.dynamic.163data.com.cn [59.38.107.17] 
  5     9 ms     7 ms     8 ms  183.56.65.62 
  6     5 ms     8 ms     7 ms  202.97.33.202 
  7     *       13 ms     7 ms  202.97.94.102 
  8   220 ms   219 ms   224 ms  202.97.51.106 
  9   160 ms   159 ms   159 ms  202.97.49.106 
 10   274 ms   274 ms   274 ms  218.30.53.2 
 11   171 ms   170 ms   165 ms  54.239.103.88 
 12   182 ms   185 ms   190 ms  54.239.103.97 
 13     *        *        *     请求超时。
 14   157 ms   158 ms   158 ms  54.239.41.97 
 15   176 ms   176 ms   177 ms  205.251.230.89 
 16     *        *        *     请求超时。
 17     *        *        *     请求超时。
 18     *        *        *     请求超时。
 19   159 ms   159 ms   159 ms  server-52-84-239-51.sfo5.r.cloudfront.net [52.84.239.51] 

跟踪完成。



The speed result of tracert www.amztracker.com. 

Enter func:get_tracert_result at 2017-08-19 12:42:55.191200
Leave func:get_tracert_result at 2017-08-19 12:45:02.005200, usedTime: 0:02:06.814000

通过最多 30 个跃点跟踪
到 www.amztracker.com [104.20.194.28] 的路由:

  1    <1 毫秒   <1 毫秒   <1 毫秒 192.168.0.1 
  2     2 ms     2 ms     2 ms  100.64.0.1 
  3     2 ms     2 ms     2 ms  119.145.222.101 
  4     3 ms     3 ms     2 ms  183.56.67.193 
  5     5 ms     8 ms     7 ms  183.56.65.58 
  6     9 ms     8 ms    10 ms  202.97.33.214 
  7     5 ms     7 ms     7 ms  202.97.91.190 
  8     *        *        *     请求超时。
  9   167 ms   167 ms   167 ms  202.97.50.70 
 10   161 ms   162 ms   161 ms  218.30.54.70 
 11   190 ms   192 ms   199 ms  et-0-0-71-1.cr5-sjc1.ip4.gtt.net [89.149.137.2] 
 12   163 ms   167 ms   175 ms  ip4.gtt.net [173.205.51.142] 
 13   157 ms   158 ms   158 ms  104.20.194.28 

跟踪完成。



####### conclusion_content: 

###### Speed of get url page ######
The speed result of Abuyun proxy(amazon) for amazon detail page: proxy = AMAZONU62J941J01:LUI52JRD425UFDDK
Amazon Totalurls:4, successCount:4, totalSuccessTime:17.214000000000002, avgTime:4.3035000000000005, connectFailCount:0, proxyFailCount:0

The speed result of not use proxy for amazon detail page. 
No_proxy Totalurls:4, successCount:4, totalSuccessTime:6.484999999999999, avgTime:1.6212499999999999, connectFailCount:0, proxyFailCount:0

##### Speed of ping ######
The speed result of ping www.amazon.com.
result of ping www.amazon.com:52.84.239.51, receive_count:4, lost_count:0, min_time:159, max_time:186, avg_time:167 

The speed result of ping www.amztracker.com.
result of ping www.amztracker.com:104.20.194.28, receive_count:4, lost_count:0, min_time:173, max_time:173, avg_time:173 


##### Speed of tracert ######
The speed result of tracert www.amazon.com. 

通过最多 30 个跃点跟踪
到 d3ag4hukkh62yn.cloudfront.net [52.84.239.51] 的路由:

  1    <1 毫秒   <1 毫秒   <1 毫秒 192.168.0.1 
  2     2 ms    44 ms     2 ms  100.64.0.1 
  3     2 ms     2 ms     2 ms  113.106.42.53 
  4     3 ms     3 ms     3 ms  17.107.38.59.broad.fs.gd.dynamic.163data.com.cn [59.38.107.17] 
  5     9 ms     7 ms     8 ms  183.56.65.62 
  6     5 ms     8 ms     7 ms  202.97.33.202 
  7     *       13 ms     7 ms  202.97.94.102 
  8   220 ms   219 ms   224 ms  202.97.51.106 
  9   160 ms   159 ms   159 ms  202.97.49.106 
 10   274 ms   274 ms   274 ms  218.30.53.2 
 11   171 ms   170 ms   165 ms  54.239.103.88 
 12   182 ms   185 ms   190 ms  54.239.103.97 
 13     *        *        *     请求超时。
 14   157 ms   158 ms   158 ms  54.239.41.97 
 15   176 ms   176 ms   177 ms  205.251.230.89 
 16     *        *        *     请求超时。
 17     *        *        *     请求超时。
 18     *        *        *     请求超时。
 19   159 ms   159 ms   159 ms  server-52-84-239-51.sfo5.r.cloudfront.net [52.84.239.51] 

跟踪完成。
The speed result of tracert www.amztracker.com. 

通过最多 30 个跃点跟踪
到 www.amztracker.com [104.20.194.28] 的路由:

  1    <1 毫秒   <1 毫秒   <1 毫秒 192.168.0.1 
  2     2 ms     2 ms     2 ms  100.64.0.1 
  3     2 ms     2 ms     2 ms  119.145.222.101 
  4     3 ms     3 ms     2 ms  183.56.67.193 
  5     5 ms     8 ms     7 ms  183.56.65.58 
  6     9 ms     8 ms    10 ms  202.97.33.214 
  7     5 ms     7 ms     7 ms  202.97.91.190 
  8     *        *        *     请求超时。
  9   167 ms   167 ms   167 ms  202.97.50.70 
 10   161 ms   162 ms   161 ms  218.30.54.70 
 11   190 ms   192 ms   199 ms  et-0-0-71-1.cr5-sjc1.ip4.gtt.net [89.149.137.2] 
 12   163 ms   167 ms   175 ms  ip4.gtt.net [173.205.51.142] 
 13   157 ms   158 ms   158 ms  104.20.194.28 

跟踪完成。

fileName: statisticNetworkEnv2017819.txt
Enter func:ftpUpload at 2017-08-19 12:45:02.007200
Enter func ftpUpload, time:2017-08-19 12:45:02.007200

你可能感兴趣的:(python爬虫,python,网络,ftp,爬虫)