Python爬虫_背景调研

网络爬虫-背景调研
一、检查robots.txt文件，了解爬虫限制。
大多数网站会定义robots.txt文件，从而提供读取网站的限制。爬虫前可通过检查该文件获得网站信息，例如抓取时间和次数限制，避免永久封禁IP等。
二、检查Sitemap文件，获取网页链接结构
帮助爬虫定位网站最新内容，无需抓爬每个网页。
三、估算网站大小
方法一，直接利用Google爬虫结果，搜索site的关键词过滤域名结果获取信息，如site：website/url
四、识别网站所用技术
pip install builtwith
可以利用builtwith模块检查网站构建的技术类型。

import builtwith
builtwith.parse('http://....')

即可获取返回数据。
一般而言，Web2py框架和通用JavaScript库，网页很有可能嵌入在HTML中，相对容易抓取；若用AngularJS构建，网站可能是动态加载的；若使用ASP.NET，则在抓取中需要使用会话管理和表单提交。
五、检查网站所有者
利用python-whois模块

pip install python-whois
import whois
print whois.whois('....')

即可从返回值中获取域名信息。

第一个简单的网络爬虫实例

基础语句：

# 下载网页
import urllib2
def download(url):
return urllib2.urlopen(url).read()

当传入URL数据时该函数会下载网页并返回HTML，但可能由于一些无法控制的错误，如页面不存在，urllib2
则会抛出异常。安全起见，可以修改代码，使之能够捕获异常。

import urllib2
def download(url):
    print 'Downloading',url
    try:
       html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading:',e.reason
        html = None
    return html

遇到错误时返回None.
此外，有事服务器过载会导致临时性错误，通过重试即可解决问题（如503 Service Unavailable错误）如返回404 Not Found，则说明为永久性错误。
因此，我们可以调整代码，使之在5xx错误发生时，尝试重试下载。

def download(url,num_retries = 2)
    print 'Downloading',url
    try:
       html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading:',e.reason
        html = None
        if num_retries > 0;
        if hasattr(e,'code') and 500<= e.code < 600:
        return download(url,num_retries-1)
    return html

为了下载更可靠，有时我们需要设置用户代理。修改代码：

def download(url,user_agent = 'wswp', num_retries = 2)
    print 'Downloading',url
    headers = {'User-agent':user_agent}
    request = urllib2.request(url,headers=headers)
    try:
       html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading:',e.reason
        html = None
        if num_retries > 0;
        if hasattr(e,'code') and 500<= e.code < 600:
        return download(url,num_retries-1)
    return html

该下载函数比较灵活，可在各种爬虫中修整复用。

Python爬虫_背景调研

你可能感兴趣的:(Python爬虫_背景调研)