爬虫基础知识

URL介绍

爬虫基础知识_第1张图片
URL介绍.png

请求介绍

如何通过urllib2实现请求,参看下图:


爬虫基础知识_第2张图片
通过urllib2完成请求.png

使用 HTTP 的 PUT 和 DELETE 方法

import urllib2  
request = urllib2.Request(uri, data=data)  
request.get_method = lambda: 'PUT' # or 'DELETE'  
response = urllib2.urlopen(request) 

异常处理

from urllib2 import Request, urlopen, URLError, HTTPError  
  
req = Request('http://www.jianshu.com/users/92a1227beb27/latest_articles')  
    
try:    
    
    response = urlopen(req)    
    
except URLError, e:    
  
    if hasattr(e, 'code'):    
    
        print 'The server couldn't fulfill the request.'    
    
        print 'Error code: ', e.code    
  
    elif hasattr(e, 'reason'):    
    
        print 'We failed to reach a server.'    
    
        print 'Reason: ', e.reason    

数据解析

爬虫基础知识_第3张图片
数据解析.png

写出的测试示例

# coding:utf-8
import urllib2
from lxml import etree
import sys
print sys.getdefaultencoding()
reload(sys)
sys.setdefaultencoding('utf-8')

#网站数据复杂,暂时还没有处理方法
def oper(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
    }
    req = urllib2.Request(url=url,headers=headers)
    try:
        response = urllib2.urlopen(req)
    except urllib2.URLError,e:
        print e.reason
    html = response.read()
    myparser = etree.HTMLParser(encoding="utf-8")
    selector = etree.HTML(html, parser=myparser)
    stainfos = selector.xpath('//input[@name="stainfo"]/@value')
    for stainfo in stainfos:
        print stainfo
    stanames = selector.xpath('//input[@name="staname"]/@value')
    for staname in stanames:
        print staname
    stainfodbys = selector.xpath('//input[@name="stainfodby"]/@value')
    for stainfodby in stainfodbys:
        print stainfodby


def start():
    urls = ['http://58.68.130.147/#']
    for url in urls:
        oper(url)



if __name__ == '__main__':
    start()

你可能感兴趣的:(爬虫基础知识)