爬虫之Requests库应用实例

1.京东商品页的爬取

import requests

url = "https://item.jd.com/100000400014.html"

try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except:
    print("ERROR!")

2.亚马逊商品页面爬取

运行如下代码

url = "https://www.amazon.cn/dp/B07DFFDR1V/ref=sr_1_5?ie=UTF8&qid=1538651583&sr=8-5"
r = requests.get(url)
r.encoding = r.apparent_encoding
print(r.text)

运行结果并不是我们想要的页面代码,用r.status_code看一下返回值为503,运行r.request.headers,返回{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'},因此这种情况要修改它的headers参数,代码:

import requests

url = "https://www.amazon.cn/dp/B07DFFDR1V/ref=sr_1_5?ie=UTF8&qid=1538651583&sr=8-5"
header = {'user-agent': 'Mozilla/5.0'}

try:
    r = requests.get(url, headers = header)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except:
    print("ERROR!")

3.百度搜索关键词提交

百度的关键词接口:

http://www.baidu.com/s?wd=keywords

因此只需要构造类似的结构即可进行关键词搜索

>>> import requests
>>> kw = {'wd':'Python'}
>>> r = requests.get("http://www.baidu.com/s", params = kw)
#真正的url
>>> r.request.url
'http://www.baidu.com/s?wd=Python'
#返回值的长度
>>> len(r.text)
254672

4.网络图片的爬取与存储

网络图片链接的格式:

http://www.xxxx.com/xxx.jpg

例如要保存这样一张图片(http://pic107.huitu.com/res/20180709/520738_20180709220906214050_1.jpg)

爬虫之Requests库应用实例_第1张图片

简单代码:

>>> import requests
>>> url = "http://pic107.huitu.com/res/20180709/520738_20180709220906214050_1.jpg"
>>> path = "D:/" + url.split('/')[-1]
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open (path, "wb") as f:
	f.write(r.content)
86931

全代码:

import requests
import os

url = "http://pic107.huitu.com/res/20180709/520738_20180709220906214050_1.jpg"
root = "D://pics//"
path = root + url.split("/")[-1]

try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path) as f:
            f.write(r.content)
            f.close()
            print('文件保存成功')
    else:
        print("文件已存在")
except:
    print("爬取失败")

5.IP地址归属地自动查询

import requests

ip = input("your ip:")
url = "http://www.ip138.com/ips138.asp?ip="

try:
    r = requests.get(url + ip)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except:
    print("ERROR!")

你可能感兴趣的:(Python)