爬虫小白第一篇 西刺代理

爬取西刺代理流程图

爬虫小白第一篇 西刺代理_第1张图片

背景

  1. 环境:python3.6
  2. 模块:
    1. urllib.request(获取html)
    2. chardet(判断html的编码)
    3. bs4.Beautiful(提取代理IP)
  3. github地址(https://github.com/tonyxinminghui/spider/blob/master/xici_spider.py)

获取html

难点

  1. 选择用什么模块获取html
  2. 西刺代理的网址不伪造header是无法获取正确的html的。一般会返回503
  3. python3内存中字符串的编码是Unicode的形式,我们一般获取的html都是编码过的,我们需要解码。
  4. 3引出4,如何判断html的编码。

解决方案

  1. 由于自己异常稀少的知识储量,选择了urllib.request(PS:网上大家都说requests是为人类写的模块,很多方法名都很人性化,可惜我不是很熟悉,以后有机会可以,用request重写一下。)

  2. 有关伪造header,urllib.request中相关的接口是
    class urllib.request.Request(url[, data][, headers][, origin_req_host][, unverifiable])
    这里列一下header的形式,具体接口参数,详见参考

    forged_header = {
    'User-Agent': XXXX,
    'Referer'   : XXXX,
    'Host'      : spider_data['host'],
    'GET'       : "url"
    }#这是一个伪装get请求的header。
  3. req = request.Request(url, headers=forged_header)
    html = request.urlopen(url, timeout=8)
    利用urlopen函数获取html文件描述符。
    content = html.read()
    利用read()函数获取html内容

  4. encode_form = chardet.detect(content)['encoding']获取connent的编码格式
    content.decode(encode_form)将html内容转码成Unicode。

提取代理IP

其实如何提取数据是所有爬虫的关键,一般的方案正则,或者bs4。
我这里选择用bs4.BeautifulSoup进行数据提取。这个可能相对简单一点。
提取流程:
1. chrome中输入http://www.xicidaili.com/nn/西刺代理官网地址F12搜索所需的代理IP信息在html中的位置。
2. 确认BeautifulSoup中tag是否相同。初步发现我们所需要的ip和端口信息在,tr(table的行)的td(标准单元格)中。
3.
soup = BeautifulSoup(html_content, 'lxml')
ips = soup.find_all('tr')`#先找行。然后在行里找对应的单元格
td = ips[i].find_all('td')`#提取相关的ip和port

验证代理IP的有效性

这里查了一下urllib.request添加代理的步骤(听说requests的要简单的多)
proxy_handler = request.ProxyHandler({'http':'ip'})
proxy_auth_handler = request.ProxyBasicAnthHandler()
opener = request.build_open(proxy_handler, proxy_auth_handler)
request.install_opener(opener)
注意一下ProxyHandler的参数,是个字典, 这4个函数的详细定义我会放到参考中。
在安装好代理后同样方式爬取http://ip.chinaz.com/getip.aspx或者http://icanhazip.com根据urlopen是否报异常进行有效性判断。

保存代理IP

我这里选择了保存在本地文件用,格式是ip:port。

参考

  1. class urllib.request.ProxyHandler([proxies])
    Cause requests to go through a proxy. If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies. The default is to read the list of proxies from the environment variables . To disable autodetected proxy pass an empty dictionary.
  2. ProxyBasicAuthHandler.http_error_407(req, fp, code, msg, hdrs)
    Retry the request with authentication information, if available.
    407是指要求代理身份验证。
  3. urllib.request.build_opener([handler, …])
    Return an OpenerDirector instance, which chains the handlers in the order given. handlers can be either instances of BaseHandler, or subclasses of BaseHandler (in which case it must be possible to call the constructor without any parameters). Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.
    If the Python installation has SSL support (i.e., if the ssl module can be imported), HTTPSHandler will also be added.
    A BaseHandler subclass may also change its handler_order member variable to modify its position in the handlers list.
  4. urllib.request.install_opener(opener)
    Install an OpenerDirector instance as the default global opener. Installing an opener is only necessary if you want urlopen to use that opener; otherwise, simply call OpenerDirector.open() instead of urlopen(). The code does not check for a real OpenerDirector, and any class with the appropriate interface will work.
  5. class urllib.request.Request(url[, data][, headers][, origin_req_host][, unverifiable])
    This class is an abstraction of a URL request.
    url should be a string containing a valid URL.
    data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format.
    headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as “Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11”, while urllib‘s default user agent string is “Python-urllib/2.6” (on Python 2.6).
    The final two arguments are only of interest for correct handling of third-party HTTP cookies:
    origin_req_host should be the request-host of the origin transaction, as defined by RFC 2965. It defaults to http.cookiejar.request_host(self). This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.
    unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.
  6. https://docs.python.org/3.0/library/多查文档

你可能感兴趣的:(爬虫)