百度翻译爬虫js逆向解析

今天来看一下百度翻译js逆向解析。

问题

在翻译的接口输入:汉语中文,https://fanyi.baidu.com/?aldtype=16047#zh/en/汉语中文
找到它请求的接口:https://fanyi.baidu.com/v2transapi ,发现是post请求,携带的参数为

from: zh
to: en
query: 汉语中文
simple_means_flag: 3
sign: 523457.204784
token: 261d93745136045b7ea6a1badc54a075

经过测试发现,只有sign值在变化,影响结果

分析

全局搜索接口中的关键字符串v2transapi,发现只有一个js,https://fanyi.bdstatic.com/static/translation/pkg/index_9b62d56.js:formatted ,继续搜索v2transapi,如下图:
百度翻译爬虫js逆向解析_第1张图片
发现了接口url的拼接参数,sign:m(a),这一行打断点调试,会定位到如下js当中的内容:
百度翻译爬虫js逆向解析_第2张图片
分析这一段js代码,发现这一段就是我们所需要的,如果会前端的知识相信会比较好理解,不然就对这一部分js进行断点调试,慢慢分析查看参数传递过程,会发现很多代码都是进行判断检测,整理出关键的代码,就是上图的方框里的内容。关键部分如下:

# u = null !== i ? i : (i = window[l] || "") || "";下面定义了var i=null;
# null !== i 为假,执行(i = window[l] || ""),结果为320305.131321201
# 这里多次运行发现window[1]是一个固定值:320305.131321201,(直接替换,调用js就可以得到结果了),
#  简化成u= false ? : null : 320305.131321201 || "",最后u=320305.131321201是一个固定值

解决

将window[l]替换成320305.131321201,var i=null;调换一下位置,稍微整理一下js,如下:

jsCode = """
    function a(r) {
        if (Array.isArray(r)) {
            for (var o = 0, t = Array(r.length); o < r.length; o++)
                t[o] = r[o];
            return t
        }
        return Array.from(r)
    }
    function n(r, o) {
        for (var t = 0; t < o.length - 2; t += 3) {
            var a = o.charAt(t + 2);
            a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a),
            a = "+" === o.charAt(t + 1) ? r >>> a : r << a,
            r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
        }
        return r
    }
    var i = null;
    function e(r) {
        var t = r.length;
        t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))

        var u = void 0, l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
        
        u = null !== i ? i : (i = '320305.131321201' || "") || "";
        for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
            var A = r.charCodeAt(v);
            128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)),
            S[c++] = A >> 18 | 240,
            S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224,
            S[c++] = A >> 6 & 63 | 128),
            S[c++] = 63 & A | 128)
        }
        for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++)
            p += S[b],
            p = n(p, F);
        return p = n(p, D),
        p ^= s,
        0 > p && (p = (2147483647 & p) + 2147483648),
        p %= 1e6,
        p.toString() + "." + (p ^ m)
    }
"""
import execjs
query = '汉语中文'
sign = execjs.compile(jsCode).call("e", query)
print(sign)

得到的sign就是正确的结果了,每次查询的query不同,返回的sign不同,而同一个query,返回的sign始终相同。
将sign传入下面的代码(代码中的请求参数都必须要,否则报错),就可以返回正确的翻译内容:

import requests
import json
data = {
    "from": "zh",
    "to": "en",
    "query":query,
    "transtype": "translang",
    "simple_means_flag": "3",
    "sign": sign,
    "token": "261d93745136045b7ea6a1badc54a075",
}
headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Content-Length': '154',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie': 'BAIDUID=BADB5E5D72667C5F54A34848245C74AB:FG=1; BIDUPSID=BADB5E5D72667C5F54A34848245C74AB; PSTM=1534988617; H_WISE_SIDS=124614_125822_108269_124694_125945_114745_123552_125629_120162_125735_118894_118867_118845_118820_118788_107311_125006_124978_117428_125776_125652_124636_124897_124939_125487_125171_125289_125710_125285_124938_124683_124030_110085_123289_125645_125428_125875; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; BDUSS=0N6UHl4VTVQR1k2d0k1YWtMRmRRaTJhLWFpQk1mLWZwS3YxU01UTW0tMEtNaEpjQVFBQUFBJCQAAAAAAAAAAAEAAACrlRIPNDM0ODkwOTc0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAql6lsKpepbS; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1542288873,1542333802,1542365440,1542422410; __cfduid=dde3763f9db2ea63b180aa1372d14e3921542882903; BDSFRCVID=LNuOJeC62mNCkaT90LMd--Y9EeTt9t7TH6aItwTAQU9bjkQNDx7JEG0PfU8g0KubJmOPogKKKgOTHICF_2uxOjjg8UtVJeC6EG0P3J; H_BDCLCKID_SF=tR-tVC0-fCI3fP36q4co-4FjbqOH54cXHDOH0Kvzt-55OR5Jj65NW-FtMGtf2j3kyCT-hInH-RF5jDn-3MA--fF1hmcayhjMXbvXQp-2KhTCsq0x0-6le-bQypoa2b-HBIOMahvc5h7xOhIC05C-jTvBDG_eq-JQ2C7WsJjs24ThD6rnhPF3K4rbKP6-3MJO3b7zon6lJbrSSpO_5n3V3tkEKh6tb6oP-N7CohFLtC0BhK0Gj6Rjh-40b2TJK4625CoJsJOOaC7tqR7zKPnhK4t8HfJfhtJfHJue_II5JjuKhI-wjTD_D6J3eHtqqh5gW57Z0lOnMp05jt3uh6J_hKurXpKJ0JQ2aCDfhC5RLKOSVIO_e6LbejOWjHDs5-7ybCPXLn58Kb5_f5rnhPF3yMvDKP6-3MJO3b7JbpTvJbrSOUob-jrl5f63247XtbTwKRnlohFLtD8KMI-GjjRMK4_SMUoHetrK-D5XQbC8Kb7VbTC4LfnkbfJBD4-jXJJC2JvXQpOy5R5lfM7NyUOO5fI7yajK25KHbe5h2h5v3fOASIO63bbpQT8rbfDOK5Oib4j7htOnab3vOpvTXpO1yftzBN5thURB2DkO-4bCWJ5TMl5jDh05y6TXjH-ttTLDJn3fL-082R6oDn8k-PnVePF3LtnZKxtqtDDOBDJ7QI3lDbk6557ZbbDqhRJ8t4TnWncKWho1bIb_MR5cK65IhPrb2HJ405OT2j-O0KJcbRo0Hpb4hPJvyUPsXnO7BxJlXbrtXp7_2J0WStbKy4oTjxL1Db0eBjIHtnFJVI05fCvKeJbYK4oj5KCyMfca5C6JKCOa3RA8Kb7VbpR85fnkbfJBDxcw3-nKQnKj2p6eH66Zbq5N3b8B55-7yajKBR3fQg6I-CTt2JvG8lb9jU6pQT8rbR_OK5OibCrZWl7xab3vOpvTXpO1yftzBN5thURB2DkO-4bCWJ5TMl5jDh05y6ISKx-_J5te3f; MCITY=-%3A; H_PS_PSSID=1438_21081_18559_28019_26350; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; PSINO=5; locale=zh; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1544166294; from_lang_often=%5B%7B%22value%22%3A%22dan%22%2C%22text%22%3A%22%u4E39%u9EA6%u8BED%22%7D%2C%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%2C%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%5D; to_lang_often=%5B%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%2C%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%5D',
    'Host': 'fanyi.baidu.com',
    'Origin': 'https://fanyi.baidu.com',
    'Referer': 'https://fanyi.baidu.com/?aldtype=16047',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
}
url = 'https://fanyi.baidu.com/v2transapi'
res = requests.post(url=url, headers=headers, data=data)

result = json.loads(res.text).get("trans_result").get('data')[0]
item = dict()
item[source] = result.get('dst')
print(item) #{'汉语中文': 'Chinese language'}

简书上有网友分析了js,获得的sign也都是一样的,如下:

# signCode = 'function a(r,o){for(var t=0;t="a"?a.charCodeAt(0)-87:Number(a),a="+"===o.charAt(t+1)?r>>>a:r<30&&(r=""+r.substr(0,10)+r.substr(Math.floor(o/2)-5,10)+r.substr(-10,10));var t=void 0,t=null!==C?C:(C=_gtk||"")||"";for(var e=t.split("."),h=Number(e[0])||0,i=Number(e[1])||0,d=[],f=0,g=0;gm?d[f++]=m:(2048>m?d[f++]=m>>6|192:(55296===(64512&m)&&g+1>18|240,d[f++]=m>>12&63|128):d[f++]=m>>12|224,d[f++]=m>>6&63|128),d[f++]=63&m|128)}for(var S=h,u="+-a^+6",l="+-3^+b+-f",s=0;sS&&(S=(2147483647&S)+2147483648),S%=1e6,S.toString()+"."+(S^h)}'
# import js2py
# call = js2py.eval_js(signCode)
# sign = call("汉语中文",'320305.131321201')
# print(sign)

总结

1.过程中遇到返回的997,998等参数都是sign值不对造成的。
2.for循环体中的位运算依然是解析难点,不建议解析出来,能调用得出正确结果最好。
有问题,欢迎留言交流!

你可能感兴趣的:(爬取翻译,百度翻译逆向,逆向js,爬取百度翻译,js逆向解析)