静态网页抓取基础知识

1安装Requests

pip install requests

2获取响应内容

import requests
r = requests.get('http://www.santostang.com/')
print ("文本编码:", r.encoding)
print ("响应状态码:", r.status_code)
print ("字符串方式的响应体:", r.text)

结果:

文本编码: UTF-8
响应状态码: 200
字符串方式的响应体: 






唐松 Santos























4.3 通过selenium 模拟浏览器抓取

4.3 通过selenium 模拟浏览器抓取 在上述的例子中,使用Chrome“检查”功能找到源地址还十分容易。但是有一些网站非常复杂,例如前面的天猫产品评论,使用“检查”功能很难找到调用的网页地址。除此之外,有一些数据...

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing! 各位读者,由于网易云跟帖在本书出版后已经停止服务,书中的第四章已经无法使用。所以我将本书的评论系统换成了来必力...

代码在页面底部,统计标识不会显示,但不影响统计效果

3定制Requests

3.1 传递URL参数

import requests
key_dict = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('http://httpbin.org/get', params=key_dict)
print ("URL已经正确编码:", r.url)
print ("字符串方式的响应体: \n", r.text)

结果:

URL已经正确编码: http://httpbin.org/get?key1=value1&key2=value2
字符串方式的响应体: 
 {
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "max-age=259200", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.14.2"
  }, 
  "origin": "203.205.141.49", 
  "url": "http://httpbin.org/get?key1=value1&key2=value2"
}

3.2 定制请求头

import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
'Host': 'www.santostang.com'
}
r = requests.get('http://www.santostang.com/', headers=headers)
print ("响应状态码:", r.status_code)
响应状态码: 200

3.3 发送POST请求

import requests
key_dict = {'key1': 'value1', 'key2': 'value2'}
r = requests.post('http://httpbin.org/post', data=key_dict)
print (r.text)

结果:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "max-age=259200", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.14.2"
  }, 
  "json": null, 
  "origin": "203.205.141.49", 
  "url": "http://httpbin.org/post"
}

3.4 超时

import requests
link = "http://www.santostang.com/" 
r = requests.get(link, timeout= 0.001)

你可能感兴趣的:(Python网络爬虫,python,爬虫,前端)