爬虫加密算法实践(淘宝直播+百度指数)

前提

先前分享了《爬虫加密算法浅入浅出》;最近刚好就碰到了一些加密的情况,在此和大家分享分享。

淘宝直播

采集流程

1、采集主播列表

页面:(示例链接:https://market.m.taobao.com/app/mtb/personal-homepage/pages/index/index.html?disableNav=YES&userId=758644019&source=share_from_home&sourceType=other&suid=a4e502b3-abd9-45a9-a8aa-18d98a560722&ut_sk=1.X6f6tk8CoM4DAC7d3WV%2FDFqI_21646297_1632901072998.Copy.gerenzhuye_other_new&un=a5ba79c8d2b96b792040bd82ddc2e1ea&share_crt_v=1&spm=a2159r.13376460.0.0&sp_tk=YlNZUVhyZHZxblM=&cpp=1&shareurl=true&short_name=h.fdMjiEi&bxsign=scdUHPesXXRP0-kxi9rEHcwdEJcb1sscJHIFmscTQlsU0erSxTzxD9D27T2e2qzgurlbhmuhpS6M2gbT-Vefz33GCMVkj2cRz6J_sGkYogh4uQ&sm=949a0d&app=chrome)

爬虫加密算法实践(淘宝直播+百度指数)_第1张图片

数据接口:(示例链接:https://h5api.m.taobao.com/h5/mtop.taobao.maserati.guangguang.lives/1.0/?jsv=2.6.1&appKey=12574478&t=1633001589749&sign=562888cbe8e336a6b5d2acb64bb148f1&v=1.0&api=mtop.taobao.maserati.guangguang.lives&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp4&data=%7B%22source%22%3A%22share_from_home%22%2C%22type%22%3A%22h5%22%2C%22tab%22%3A%22live%22%2C%22pageSize%22%3A10%2C%22page%22%3A1%2C%22userId%22%3A%22758644019%22%2C%22userIdString%22%3A%22758644019%22%2C%22contentFilter%22%3A1%7D。其中,参数“sign”加密)

爬虫加密算法实践(淘宝直播+百度指数)_第2张图片

参数“sign”加密的相关接口:(示例链接:https://g.alicdn.com/mtb/personal-homepage/0.5.2/web/index.js。最后结论:“sign”由Cookie中的Token+appKey+请求时间+请求内容拼接后MD5加密得到)

爬虫加密算法实践(淘宝直播+百度指数)_第3张图片

核心代码:

            long currentTime = System.currentTimeMillis();
            String data = "{\"source\":\"share_from_home\",\"type\":\"h5\",\"tab\":\"live\",\"pageSize\":10,\"page\":" + i + ",\"userId\":\"" + userId + "\",\"userIdString\":\"" + userId + "\",\"contentFilter\":1}";
            byte[] signByte = MessageDigest.getInstance("md5").digest(
                    (TaoliveCrawlerTest.getH5Token()
                            + "&" + currentTime
                            + "&" + "12574478"
                            + "&" + data).getBytes());
            String sign = new BigInteger(1, signByte).toString(16);
            String url = "https://h5api.m.taobao.com/h5/mtop.taobao.maserati.guangguang.lives/1.0/?jsv=2.6.1&appKey=12574478"
                    + "&t=" + currentTime + "&sign=" + sign + "&v=1.0&api=mtop.taobao.maserati.guangguang.lives&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp4"
                    + "&data=" + URLEncoder.encode(data);

            Request request = new Request();
            request.setUrl(url);
            request.setCookie(TaoliveCrawlerTest.TAOLIVE_COOKIE);
            Page page = downloader.download(request);
            String content = page.getRawText();

2、采集直播Id

页面:(示例链接:https://h5.m.taobao.com/taolive/video.html?id=329062842940&sharerId=669291096&anchorGuard=true×tamp=1633001824843&signature=eb23c8f3af1413c99905498fb394f37d&livesource=guard&cp_origin=taobaozhibo%7Ca2141.8001249%7C%7B%22account_id%22%3A%22475445741%22%2C%22app_key%22%3A%2221646297%22%2C%22feed_id%22%3A%22329062842940%22%2C%22os%22%3A%22android%22%2C%22spm-cnt%22%3A%22a2141.8001249%22%7D&sourceType=talent&suid=73d3a8b5-787e-4143-bfab-79e6af619d04&ut_sk=1.X6f6tk8CoM4DAC7d3WV%2FDFqI_21646297_1633001828291.Copy.ShareGlobalNavigation_zhibo&un=a5ba79c8d2b96b792040bd82ddc2e1ea&share_crt_v=1&spm=a2159r.13376460.0.0&sp_tk=MTVIbVhyaUhOcmU=&cpp=1&shareurl=true&short_name=h.fWIpuE7&bxsign=scdOvH_QMZx_BYcCOu6fdM3XLiVxnc8S18i31k_zkvPEvg-TKDtxN-NONreWKXOSqVkCjbacroe77znE-AYLIv6DSyB2h6VwxxhgDddjtpsnS4&sm=79bcd2&app=chrome)

爬虫加密算法实践(淘宝直播+百度指数)_第4张图片

数据接口:(示例链接:https://h5api.m.taobao.com/h5/mtop.mediaplatform.live.livedetail/4.0/?jsv=2.6.1&appKey=12574478&t=1633001955314&sign=9ec6d224ca253ecb5bd3182602fd2ad8&AntiFlood=true&AntiCreep=true&api=mtop.mediaplatform.live.livedetail&v=4.0&preventFallback=true&type=jsonp&dataType=jsonp&callback=mtopjsonp1&data=%7B%22liveId%22%3A%22329062842940%22%2C%22creatorId%22%3Anull%7D。其中,参数“sign”加密;加密方式同上)

爬虫加密算法实践(淘宝直播+百度指数)_第5张图片

3、采集直播弹幕

数据接口:(示例链接:https://tb-live-message.alibaba.com/live/message/273191f6-87b7-4d6d-9fb0-2680d01acae9/0/1633000895。因为加密部分数据以“=”或“==”结尾,很容易就想到可能是Base64加密;一试,果真如此)

爬虫加密算法实践(淘宝直播+百度指数)_第6张图片

百度指数

页面:(示例链接:https://index.baidu.com/v2/main/index.html#/trend/%E8%BF%AA%E8%BF%A6?words=%E8%BF%AA%E8%BF%A6)

爬虫加密算法实践(淘宝直播+百度指数)_第7张图片

数据接口:(示例链接:https://index.baidu.com/api/FeedSearchApi/getFeedIndex?word=[[%7B%22name%22:%22%E8%BF%AA%E8%BF%A6%22,%22wordType%22:1%7D]]&area=0&days=30)

爬虫加密算法实践(淘宝直播+百度指数)_第8张图片

数据映射接口:(示例链接:https://index.baidu.com/Interface/ptbk?uniqid=976c01e5a769b63c8ea364df1afbfda2)

爬虫加密算法实践(淘宝直播+百度指数)_第9张图片

返回结果加密的相关接口:(链接:https://index.baidu.com/v2/static/js/main.914903fe3724b8f1daa8.js)

爬虫加密算法实践(淘宝直播+百度指数)_第10张图片

核心代码:

    public void crawler() {
        String indexUrl = "https://index.baidu.com/api/FeedSearchApi/getFeedIndex?word=[[%7B%22name%22:%22%E8%BF%AA%E8%BF%A6%22,%22wordType%22:1%7D]]&area=0&days=30";
        String indexContent = getContent(indexUrl);
        JSONObject indexJson = JSONObject.parseObject(indexContent);

        JSONObject indexOneJson = indexJson.getJSONObject("data").getJSONArray("index").getJSONObject(0);
        String startDate = indexOneJson.getString("startDate");
        String endDate = indexOneJson.getString("endDate");
        String dateIncreaseStep = indexOneJson.getString("type");
        String dataBeforeStr = indexOneJson.getString("data");

        String uniqid = indexJson.getJSONObject("data").getString("uniqid");
        String uniqUrl = "https://index.baidu.com/Interface/ptbk?uniqid=" + uniqid;
        String uniqContent = getContent(uniqUrl);
        String uniqStr = JSONObject.parseObject(uniqContent).getString("data");

        LOG.info("StartDate: {}, EndDate: {}, DateIncreaseStep: {}, Data: {}.",
                startDate, endDate, dateIncreaseStep, decrypt(dataBeforeStr, uniqStr));
    }

    public String decrypt(String dataBeforeStr, String uniqStr) {
        StringBuffer dataAfterBuf = new StringBuffer();

        String[] uniqStrs = uniqStr.split("");
        Map<String, String> tempMap = new HashMap<>();
        for (int i = 0; i < uniqStrs.length / 2; i++) {
            tempMap.put(uniqStrs[i], uniqStrs[uniqStrs.length / 2 + i]);
        }

        String[] dataBefores = dataBeforeStr.split("");
        for (int i = 0; i < dataBefores.length; i++) {
            dataAfterBuf.append(tempMap.get(dataBefores[i]));
        }

        return dataAfterBuf.toString();
    }

结尾

爬虫虐我千百遍,我待爬虫如初恋!

你可能感兴趣的:(算法,百度,爬虫)