day61-Spider

一、HTTP基本原理

1.URI和URL
URI:统一资源标志符
URL:统一资源定位符

注:URL是URI的子集。

2.HTTP和HTTPS
HTTP:超文本传输协议。
HTTPS:HTTP加入SSL层,传输内容通过SSL加密。


二、requests库

1.requests是基于http的高层库,它有以下两个主要功能:

1.request处理客户端的请求

2.response处理服务端的响应

2.获取响应信息

import requests

# 获取⽹⻚内容
response = requests.get('http://www.baidu.com/')
html = response.text

print(html)

# 取响应状态码和头信息
print(response.status)

print(response.getheaders())

print(response.getheader("Server"))

3.设置请求头

import requests


def get_page():
    url = 'http://www.baidu.com/'
    headers = {
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

4.下载图片

# 获取二进制资源
import requests


def get_resource(url):
    headers = {
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.content
    return None


# 保存图片
def save_pic(url):
    img_content = get_resource(url)
    file_name = url.split('/')[-1].split('@')[0]
    with open('./images/%s' % file_name, 'wb') as f:
        f.write(img_content)

你可能感兴趣的:(day61-Spider)