import requests
response = requests.get("https://www.baidu.com")
print(type(response)) #
1.2.1 只有url一个参数情况
response = requests.post("http://z.kktijian.com/Project/GetAllDepartment")
1.2.2 有Form Data的情况
response = requests.post("http://z.kktijian.com/Project/GetDiagnosisList",data={"departId":"01"})
此时headers的Content-Type默认为:{'Content-Type':'application/x-www-form-urlencoded'}
当我们进行登录的时候,需要保存登录后的session,以便访问登录后的页面,就可以用到requests.session()。用法如下:
# session能自动保存Cookie,使用方法和requests差不多
session = requests.session()
# 进行登录操作
session.post(url, data)
# 请求登录后的页面
session.get(url)
import requests
response = requests.get("https://www.baidu.com")
text=response.text
print(text)
结果如下,可以看到是乱码的状态,且编码方式是utf-8:
具体的原因可以参考我的另外一篇博客Python scrapy爬虫框架使用教程与实战示例
正确的读取方式是添加response.encoding='utf-8'
,如下:
import requests
response = requests.get("https://www.baidu.com")
response.encoding='utf-8'
text=response.text
print(text)
现在的效果如下:
proxies={'http':'http://27.40.108.142:36058', 'https':'https://27.40.108.142:36058'}
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'}
response = requests.get("http://httpbin.org/ip",proxies=proxies,headers=headers)
print(response .text)
print(response .request.headers)
结果如下:
{
"origin": "27.40.108.142"
}
{'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
说明:
'协议':'协议://ip:port'
http://httpbin.org/ip
是一个查看ip的网站requests.post(...)
的代理proxy和header的设置和这一样response = requests.get("http://httpbin.org/ip", timeout = (5, 5))
默认是不进行超时处理,一直阻塞,timeout的第一个参数表示连接超时时间,第二个参数表示response响应超时时间
pip install beautifulsoup4
这里我们测试使用网页的html内容如下:
<html>
<head>
<base href='http://example.com/' />
<title>Example websitetitle>
head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1a>
<a href='image1.html'>Name: My image 1a>
<a href='image2.html'>Name: My image 2a>
<a href='image2.html'>Name: My image 2a>
div>
body>
html>
python代码如下:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text,"html.parser")
title=soup.title
print(type(title)) #
print(title) # Example website
说明:
html.parser
解析器bs4.element.Tag
,可以接着使用title.find(...)
等函数;其它函数如soup.find_all(...)
类似 soup = BeautifulSoup(html_text,"html.parser")
title_content=soup.title.string
print(title_content) # Example website
返回标签内的内容
soup = BeautifulSoup(html_text,"html.parser")
tag_list=soup.find_all("a",href="image1.html")
# [Name: My image 1, Name: My image 1]
print(tag_list)
说明:
find_all
可以指定一个或多个条件,返回bs4.element.Tag
类型的listhref
,而是class
,和python的关键字冲突,所有要用class_="xxx"
soup = BeautifulSoup(html_text,"html.parser")
tag=soup.find("a",href="image1.html")
# Name: My image 1
print(tag)
说明:
find
函数和find_all类似,区别在于只返回第一个匹配的元素,返回类型为bs4.element.Tag