最近有个需求要批量调用Google翻译进行翻译,由于数据量过大,使用selenium模拟显然是不现实的。。。。
要是能调用他的接口就好了,但接口是收费的。。。没办法,哎,只能自己搞,看看能不能破解。看了下Google翻译的接口,
对浏览器右键检查,可用发现翻译是有接口的:
https://translate.google.com/translate_a/single?client=webapp&sl=auto&tl=en&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ssel=5&tsel=5&kc=1&tk={}&q={}
其中tk是一个ticket ,用作反爬机制,q为你要翻译的数据,所以我们最重要的是需要知道tk怎么来的。关于tk的生成,有兴趣的可用查看翻译页面的js,tk的生成是在前端通过js生成的,由需要翻译的数据进行一系列的加密操作,已经有大佬破解了,下面贴上生成tk的代码,直接调用get_tk()方法就可以获得,传入你需要翻译的数据。
#!/usr/bin/env python2
# -*- coding: UTF-8 -*-
"""This module creates the TK GET parameter for Google translate.
This is just a code port to python. All credits should go to the original
creators of the code @tehmaestro and @helen5106.
For more info see: https://github.com/Stichoza/google-translate-php/issues/32
Usage:
Call this python script from the command line.
$ python tk_generator.py
Use this module from another python script.
>>> import .tk_generator
>>> tk_generator.get_tk('dog')
Attributes:
_ENCODING (string): Default encoding to be used during the string
encode-decode process.
"""
__all__ = ["get_tk"]
import sys
from datetime import datetime
_ENCODING = "UTF-8"
# Helper functions
def _mb_strlen(string):
"""Get the length of the encoded string."""
return len(string.decode(_ENCODING))
def _mb_substr(string, start, length):
"""Get substring from the encoded string."""
return string.decode(_ENCODING)[start: start + length]
##################################################
def _shr32(x, bits):
if bits <= 0:
return x
if bits >= 32:
return 0
x_bin = bin(x)[2:]
x_bin_length = len(x_bin)
if x_bin_length > 32:
x_bin = x_bin[x_bin_length - 32: x_bin_length]
if x_bin_length < 32:
x_bin = x_bin.zfill(32)
return int(x_bin[:32 - bits].zfill(32), 2)
def _char_code_at(string, index):
return ord(_mb_substr(string, index, 1))
#OLD Function
def _generateB():
start = datetime(1970, 1, 1)
now = datetime.now()
diff = now - start
return int(diff.total_seconds() / 3600)
def _TKK():
"""Replacement for _generateB function."""
return [406604, 1836941114]
def _RL(a, b):
for c in range(0, len(b) - 2, 3):
d = b[c + 2]
if d >= 'a':
d = _char_code_at(d, 0) - 87
else:
d = int(d)
if b[c + 1] == '+':
d = _shr32(a, d)
else:
d = a << d
if b[c] == '+':
a = a + d & (pow(2, 32) - 1)
else:
a = a ^ d
return a
def _TL(a):
#b = _generateB()
tkk = _TKK()
b = tkk[0]
d = []
for f in range(0, _mb_strlen(a)):
g = _char_code_at(a, f)
if g < 128:
d.append(g)
else:
if g < 2048:
d.append(g >> 6 | 192)
else:
if ((g & 0xfc00) == 0xd800 and
f + 1 < _mb_strlen(a) and
(_char_code_at(a, f + 1) & 0xfc00) == 0xdc00):
f += 1
g = 0x10000 + ((g & 0x3ff) << 10) + (_char_code_at(a, f) & 0x3ff)
d.append(g >> 18 | 240)
d.append(g >> 12 & 63 | 128)
else:
d.append(g >> 12 | 224)
d.append(g >> 6 & 63 | 128)
d.append(g & 63 | 128)
a = b
for e in range(0, len(d)):
a += d[e]
a = _RL(a, "+-a^+6")
a = _RL(a, "+-3^+b+-f")
a = a ^ tkk[1]
if a < 0:
a = (a & (pow(2, 31) - 1)) + pow(2, 31)
a %= pow(10, 6)
return "%d.%d" % (a, a ^ b)
def get_tk(word):
"""Returns the tk parameter for the given word."""
if isinstance(word, unicode):
word = word.encode(_ENCODING)
return _TL(word)
if __name__ == '__main__':
# if len(sys.argv) != 2:
# print "Usage: %s " % sys.argv[0]
# sys.exit(1)
#
# print "%s=%s" % (sys.argv[1], get_tk(sys.argv[1]))
print get_tk("你 好")
调用谷歌翻译的大致流程如下:
# 生成tk
tk = get_tk(content)
url = "https://translate.google.com/translate_a/single?client=webapp&sl=auto&tl=en&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ssel=5&tsel=5&kc=1&tk={}&q={}".format(tk, content)
header = {"Referer": "https://translate.google.com/",
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"x-client-data": "CJG2yQEIorbJAQjBtskBCKmdygEIqKPKAQjiqMoBCJetygEIza3KAQjyrcoBCMuuygEIyq/KAQ==",
"cookie": "ANID=AHWqTUmEUOyWZxeXZaHhE0-5lGBTHOVeMH3NP6elG0Zi4kx3VkGky6oE7hNyEZk_; _ga=GA1.3.123605499.1565587980; _gid=GA1.3.1811799555.1565587980; NID=188=nikTUMCm2LqPiIFv2Qr6mAjrzE6jONn4_LBFYRYhdDPrXaNU5KA6336tSkKY67wWHbbLAYwL86RuyqqJE1bWp6ltY-RS27-hjNxx0F2GN5pF2NKJ9mDT25hPU5XBEMjllYomR9TvdBR7u0Lig-Mra7to_bFGWWW2nq3wog9TAdA; 1P_JAR=2019-08-12-09"
}
resp = self._session.get(url, headers=header, timeout=30)
resp.raise_for_status()
if resp.status_code == 200:
data = self._xpather.parse(url, resp.content)
print data
最好带上请求头去请求,还有一个问题,当要翻译的content数据量过长时,请求方式使用post,不要使用get,content作为请求体。
实测可用翻译,但如果想做批量短时间翻译,使用上面方法还是不行,上面的方式翻译时会出现短时间内两虚翻译了20次左右时,ip被封的情况,封禁时间大概5分钟左右,之后自动解封。如果想短时间内批量调用,可用尝试使用代理池,我也用代理池试过,可能是代理的问题,使用代理访问不了这个接口,有兴趣的朋友可用试试有没有更好的办法