先前分享了《爬虫加密算法浅入浅出》;最近刚好就碰到了一些加密的情况,在此和大家分享分享。
采集流程
1、采集主播列表
页面:(示例链接:https://market.m.taobao.com/app/mtb/personal-homepage/pages/index/index.html?disableNav=YES&userId=758644019&source=share_from_home&sourceType=other&suid=a4e502b3-abd9-45a9-a8aa-18d98a560722&ut_sk=1.X6f6tk8CoM4DAC7d3WV%2FDFqI_21646297_1632901072998.Copy.gerenzhuye_other_new&un=a5ba79c8d2b96b792040bd82ddc2e1ea&share_crt_v=1&spm=a2159r.13376460.0.0&sp_tk=YlNZUVhyZHZxblM=&cpp=1&shareurl=true&short_name=h.fdMjiEi&bxsign=scdUHPesXXRP0-kxi9rEHcwdEJcb1sscJHIFmscTQlsU0erSxTzxD9D27T2e2qzgurlbhmuhpS6M2gbT-Vefz33GCMVkj2cRz6J_sGkYogh4uQ&sm=949a0d&app=chrome)
数据接口:(示例链接:https://h5api.m.taobao.com/h5/mtop.taobao.maserati.guangguang.lives/1.0/?jsv=2.6.1&appKey=12574478&t=1633001589749&sign=562888cbe8e336a6b5d2acb64bb148f1&v=1.0&api=mtop.taobao.maserati.guangguang.lives&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22source%22%3A%22share_from_home%22%2C%22type%22%3A%22h5%22%2C%22tab%22%3A%22live%22%2C%22pageSize%22%3A10%2C%22page%22%3A1%2C%22userId%22%3A%22758644019%22%2C%22userIdString%22%3A%22758644019%22%2C%22contentFilter%22%3A1%7D。其中,参数“sign”加密)
参数“sign”加密的相关接口:(示例链接:https://g.alicdn.com/mtb/personal-homepage/0.5.2/web/index.js。最后结论:“sign”由Cookie中的Token+appKey+请求时间+请求内容拼接后MD5加密得到)
核心代码:
long currentTime = System.currentTimeMillis();
String data = "{\"source\":\"share_from_home\",\"type\":\"h5\",\"tab\":\"live\",\"pageSize\":10,\"page\":" + i + ",\"userId\":\"" + userId + "\",\"userIdString\":\"" + userId + "\",\"contentFilter\":1}";
byte[] signByte = MessageDigest.getInstance("md5").digest(
(TaoliveCrawlerTest.getH5Token()
+ "&" + currentTime
+ "&" + "12574478"
+ "&" + data).getBytes());
String sign = new BigInteger(1, signByte).toString(16);
String url = "https://h5api.m.taobao.com/h5/mtop.taobao.maserati.guangguang.lives/1.0/?jsv=2.6.1&appKey=12574478"
+ "&t=" + currentTime + "&sign=" + sign + "&v=1.0&api=mtop.taobao.maserati.guangguang.lives&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp4"
+ "&data=" + URLEncoder.encode(data);
Request request = new Request();
request.setUrl(url);
request.setCookie(TaoliveCrawlerTest.TAOLIVE_COOKIE);
Page page = downloader.download(request);
String content = page.getRawText();
2、采集直播Id
页面:(示例链接:https://h5.m.taobao.com/taolive/video.html?id=329062842940&sharerId=669291096&anchorGuard=true×tamp=1633001824843&signature=eb23c8f3af1413c99905498fb394f37d&livesource=guard&cp_origin=taobaozhibo%7Ca2141.8001249%7C%7B%22account_id%22%3A%22475445741%22%2C%22app_key%22%3A%2221646297%22%2C%22feed_id%22%3A%22329062842940%22%2C%22os%22%3A%22android%22%2C%22spm-cnt%22%3A%22a2141.8001249%22%7D&sourceType=talent&suid=73d3a8b5-787e-4143-bfab-79e6af619d04&ut_sk=1.X6f6tk8CoM4DAC7d3WV%2FDFqI_21646297_1633001828291.Copy.ShareGlobalNavigation_zhibo&un=a5ba79c8d2b96b792040bd82ddc2e1ea&share_crt_v=1&spm=a2159r.13376460.0.0&sp_tk=MTVIbVhyaUhOcmU=&cpp=1&shareurl=true&short_name=h.fWIpuE7&bxsign=scdOvH_QMZx_BYcCOu6fdM3XLiVxnc8S18i31k_zkvPEvg-TKDtxN-NONreWKXOSqVkCjbacroe77znE-AYLIv6DSyB2h6VwxxhgDddjtpsnS4&sm=79bcd2&app=chrome)
数据接口:(示例链接:https://h5api.m.taobao.com/h5/mtop.mediaplatform.live.livedetail/4.0/?jsv=2.6.1&appKey=12574478&t=1633001955314&sign=9ec6d224ca253ecb5bd3182602fd2ad8&AntiFlood=true&AntiCreep=true&api=mtop.mediaplatform.live.livedetail&v=4.0&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp1&data=%7B%22liveId%22%3A%22329062842940%22%2C%22creatorId%22%3Anull%7D。其中,参数“sign”加密;加密方式同上)
3、采集直播弹幕
数据接口:(示例链接:https://tb-live-message.alibaba.com/live/message/273191f6-87b7-4d6d-9fb0-2680d01acae9/0/1633000895。因为加密部分数据以“=”或“==”结尾,很容易就想到可能是Base64加密;一试,果真如此)
页面:(示例链接:https://index.baidu.com/v2/main/index.html#/trend/%E8%BF%AA%E8%BF%A6?words=%E8%BF%AA%E8%BF%A6)
数据接口:(示例链接:https://index.baidu.com/api/FeedSearchApi/getFeedIndex?word=[[%7B%22name%22:%22%E8%BF%AA%E8%BF%A6%22,%22wordType%22:1%7D]]&area=0&days=30)
数据映射接口:(示例链接:https://index.baidu.com/Interface/ptbk?uniqid=976c01e5a769b63c8ea364df1afbfda2)
返回结果加密的相关接口:(链接:https://index.baidu.com/v2/static/js/main.914903fe3724b8f1daa8.js)
核心代码:
public void crawler() {
String indexUrl = "https://index.baidu.com/api/FeedSearchApi/getFeedIndex?word=[[%7B%22name%22:%22%E8%BF%AA%E8%BF%A6%22,%22wordType%22:1%7D]]&area=0&days=30";
String indexContent = getContent(indexUrl);
JSONObject indexJson = JSONObject.parseObject(indexContent);
JSONObject indexOneJson = indexJson.getJSONObject("data").getJSONArray("index").getJSONObject(0);
String startDate = indexOneJson.getString("startDate");
String endDate = indexOneJson.getString("endDate");
String dateIncreaseStep = indexOneJson.getString("type");
String dataBeforeStr = indexOneJson.getString("data");
String uniqid = indexJson.getJSONObject("data").getString("uniqid");
String uniqUrl = "https://index.baidu.com/Interface/ptbk?uniqid=" + uniqid;
String uniqContent = getContent(uniqUrl);
String uniqStr = JSONObject.parseObject(uniqContent).getString("data");
LOG.info("StartDate: {}, EndDate: {}, DateIncreaseStep: {}, Data: {}.",
startDate, endDate, dateIncreaseStep, decrypt(dataBeforeStr, uniqStr));
}
public String decrypt(String dataBeforeStr, String uniqStr) {
StringBuffer dataAfterBuf = new StringBuffer();
String[] uniqStrs = uniqStr.split("");
Map<String, String> tempMap = new HashMap<>();
for (int i = 0; i < uniqStrs.length / 2; i++) {
tempMap.put(uniqStrs[i], uniqStrs[uniqStrs.length / 2 + i]);
}
String[] dataBefores = dataBeforeStr.split("");
for (int i = 0; i < dataBefores.length; i++) {
dataAfterBuf.append(tempMap.get(dataBefores[i]));
}
return dataAfterBuf.toString();
}
爬虫虐我千百遍,我待爬虫如初恋!