网络爬虫与信息提取-requests库

文章目录

    • 安装小测
    • 爬取网页通用框架
    • 遵守robots协议
    • 京东商品页面的爬取
    • 亚马逊商品,我的失败了
    • 百度 360关键字提交
    • 网络图片的爬取和存储
    • IP地址归属地的自动查询

pip install requests

安装小测

import requests
r=requests.get("http://www.baidu.com")
print(r.status_code)

爬取网页通用框架

#在我电脑上没实验成功
import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "产生异常"

if __name__=="_main_":
    url="www.baidu.com"
    print(getHTMLText(url))

遵守robots协议

京东商品页面的爬取

import requests
url="https://item.jd.com/100004815031.html"
try:
    r=requests.get(url)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败")

亚马逊商品,我的失败了

分步

import requests
r=requests.get("http://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>> r.status_code
503
>>> r.encding
Traceback (most recent call last):
  File "", line 1, in <module>
    r.encding
AttributeError: 'Response' object has no attribute 'encding'
>>> r.encoding
'ISO-8859-1'
>>> r.encoding=r.apparent_encoding
>>> r.text
'\n\n\n\n\n\n\n\n\nAmazon CAPTCHA\n\n\n\n\n\n\n\n\n\n\n
\n\n
\n\n
\n\n
\n
\n \n

请输入您在下方看到的字符

\n

抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。

\n
\n
\n\n
\n\n
\n
\n\n
\n \n
\n
\n
\n

请输入您在这个图片中看到的字符:

\n
\n \n
\n
\n
\n
\n \n
\n \n
\n \n
\n
\n
\n
\n\n
\n\n
\n \n \n \n \n \n
\n\n
\n
\n\n
\n
\n\n
\n\n
\n\n
\n\n
\n 使用条件\n \n \n \n \n 隐私声明\n
\n\n
\n © 1996-2015, Amazon.com, Inc. or its affiliates\n \n \n
\n
\n \n\n'
>>> r.request.headers { 'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> kv={ 'user-agent':'Mozilla/5.0'} >>> url="https://www.amazon.cn/gp/product/B01M8L5Z3Y" >>> r=requests.get(url,headers={ 'user-agent':'Mozilla/5.0'}) >>> r.status_code 200 >>> r.request.headers { 'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> r.text[:1000] '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n\n

你可能感兴趣的:(网络爬虫与信息提取)