因为最近使用pandas 和numpy,所以就研究下requests库,毕竟这是目前python最流行的http请求库,也是最符合pythonic的库,有时间也准备看源码学习下,不过前一阵看到一个httpx第三方库,支持了更加丰富的API调用,比requests库扩展了,还没有细看,后面也会看下,这里只是简单的记录下requests库看的基础东西,(requests是基于urllib去做的,所以很多的底层也要先研究下urllib的实现)
# 这个比较简单, 使用官方案例
# 不带任何参数的请求
import requests
r = requests.get('https://api.github.com/events')
print(r.status_code)
# 带有url参数的请求
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print(r.url)
# 带有嵌套字典
payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
r = requests.get('http://httpbin.org/get', params=payload)
print(r.url)
# result
'''
r.status_code >>> 200
r.url >>> http://httpbin.org/get?key2=value2&key1=value1
r.url >>> http://httpbin.org/get?key1=value1&key2=value2&key2=value3
'''
# 这个比较简单, 使用官方案例
import requests
r = requests.post('http://httpbin.org/post', data = {'key':'value'})
print(r.status_code)
# 传递请求参数,比如说登录
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)
# result
'''
r.status_code >>> 200
r.text >>>
{
...
"form": {
"key2": "value2",
"key1": "value1"
},
...
}
'''
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')
import requests
# 设置代理 domain, 意思就是http请求通过以下字典的http的值访问, https则是通过https的值访问
proxy = {"http": "http://xxx.com", "https": "https://xxx.com"}
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
r = requests.get(url, proxy=proxy, headers=headers)
print(r)
requests提供了一个叫做session的类,来实现客户端和服务端的会话保持
一套cookie和session往往和一个用户对应,如果用不同的ip但是带有相同的cookie去访问,则会容易判定为爬虫信息
获取登录之后的页面,必须发送带有cookies的请求
#######################################################################
# session方法
import requests
session = requests.session()
post_url = "http://www.renren.com/PLogin.do"
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
# 需要提供自己的密码
post_data = {"email": "xxx.com", "password": "xxxxxx"}
# 这里将session和cookie设置
session.post(post_url, data=post_data, headers=headers)
# 这时已经可以直接用session对象进行一系列请求了,这里用get举例
r = session.get("http://www.renren.com/357363399/profile")
print(r.content.decode())
#######################################################################
# 当然如果你不想用这样的方式,你可以先去网站登录,拿到登陆后的cookies,用字典推导式将cookies拆分成一个大字典,然后带入
import requests
requests.get("http://www.renren.com/PLogin.do", headers=headers, cookies=cookies)
# 将带有%号的网络地址解码为人类可读的url
r = requests.utils.unquote("带有看不懂的%号的url")
print(r)
在访问https时,有时会遇到告诉你证书不对的信息,原因可能是对方的服务器网站并没有申请证书(CA)或者证书过期,这时,如果你确认你访问的地址没有问题,则可以加上以下参数解决,就是我单方面的不管对方服务器证书的问题
# 参数是指不需要验证证书
requests.get("https://www.12306.cn/index/", verify=False)
# 从服务器取cookies,这里是CookieJar对象
r.cookies = requests.get(url).content.decode().cookies
# 将CookieJar对象转为字典
requests.utils.dict_from_cookiejar(r.cookies)
这里仅做自己感兴趣,做技术交流,侵权删! 这里仅做技术交流,侵权删! 这里仅做技术交流,侵权删!
# 这里仅做技术交流,侵权删
# 这里仅做技术交流,侵权删
# 这里仅做技术交流,侵权删
# 百度翻译需要用到的 JS源码, 在js文件中可以通过一些技巧搜索到
var i = "320305.131321201"
function a(r) {
if (Array.isArray(r)) {
for (var o = 0, t = Array(r.length); o < r.length; o++) t[o] = r[o];
return t
}
return Array.from(r)
}
function n(r, o) {
for (var t = 0; t < o.length - 2; t += 3) {
var a = o.charAt(t + 2);
a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a), a = "+" === o.charAt(t + 1) ? r >>> a : r << a, r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
}
return r
}
function e(r) {
var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
if (null === o) {
var t = r.length;
t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
} else {
for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) "" !== e[C] && f.push.apply(f, a(e[C].split(""))), C !== h - 1 && f.push(o[C]);
var g = f.length;
g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
}
var u = void 0, l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
u = null !== i ? i : (i = window[l] || "") || "";
for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
var A = r.charCodeAt(v);
128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128)
}
for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F);
return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + "." + (p ^ m)
}
# 这里仅做技术交流,侵权删
# 这里仅做技术交流,侵权删
# 这里仅做技术交流,侵权删
# 就是最近在看requests,所以拿来练个手,有写的不好的,多担待
import requests
import urllib.parse
import json
class BaiDuTranslator:
def __init__(self, query, origin_lang="en", translated_lang="zh"):
"""
:param query:
:param origin_lang: 原翻译语言类型, 默认为英文
:param translated_lang: 翻译成什么语言, 默认为中文
self.headers: 需要提供访问百度翻译的自己机器上的cookies
self.data: token的值可能需要更改 ,目前(2020-06-28)是这个样子的,之后可能要在百度翻译源码搜一下,替换就行了,我这里暂时没用bs4取直接取,写着方便就直接拿过来用了
"""
self.base_url = "https://fanyi.baidu.com/v2transapi?"
self.origin_lang = origin_lang
self.translated_lang = translated_lang
self.query = query
self.headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
"cookie": "你自己访问百度翻译的cookies"
}
self.data = {
"from": self.origin_lang,
"to": self.translated_lang,
"transtype": "translang",
"simple_means_flag": "3",
"token": "daa6bc27173d38ecf0fd28141cf572f3",
"domain": "common",
"query": self.query
}
# 请求的实际完整url
@property
def full_url(self):
return self.base_url + urllib.parse.urlencode({"from": self.origin_lang, "to": self.translated_lang})
# 主要用来添加sign和query两个数据
def update_data(self, value):
self.data.update({"sign": value})
def post_query(self, full_url):
r = requests.post(full_url, data=self.data, headers=self.headers).content.decode()
return r
@staticmethod
def parse_sign(key, js_path=None):
# 这里调用第三方库,来解析js语法
import execjs
if not js_path:
raise FileNotFoundError("必须要有百度官方js文件")
with open(js_path) as fp:
content = fp.read()
sign_value = execjs.compile(content).call("e", key)
return sign_value
def run(self):
full_url = self.full_url
sign_value = self.parse_sign(self.query, "/home/chen/PycharmProjects/spider_man/baidu_translator/baidu.js")
self.update_data(sign_value)
resp = self.post_query(full_url)
return json.loads(resp)
if __name__ == '__main__':
# 默认为英语转中文,可以通过传参不同翻译不同的语言,见__init__参数
while True:
_str = input("input your query:")
transltor = BaiDuTranslator(_str)
r = transltor.run()
print(r["trans_result"]["data"][0]["dst"])