php模拟登陆360网站 爬取360指数

爬取360指数三部曲

  1. 获取360用户的token
  2. 通过账号、密码、token模拟登陆,获取cookie
  3. 设置cookie,爬取360指数

代码实现



class CrawlerOf360Data
{

    private $curl;

    function __construct()
    {
        $this->init_curl_client();
    }

    function init_curl_client() {
        $this->curl = curl_init();
        curl_setopt($this->curl, CURLOPT_URL, "https://login.360.cn/");
        curl_setopt($this->curl, CURLOPT_TIMEOUT, 60);
        curl_setopt($this->curl, CURLOPT_COOKIEJAR, dirname(__FILE__).'/cookie.txt');
        curl_setopt($this->curl, CURLOPT_COOKIEFILE, dirname(__FILE__).'/cookie.txt');
        curl_setopt($this->curl, CURLOPT_SSL_VERIFYHOST, 0);
        curl_setopt($this->curl, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, 1);
    }

    function get360Data($name) {
        //step1:get token
        $_tokenPostFields = array(
            'callback' => 'jQuery18309010124561026427_1524021670433',
            'src'=> 'pcw_360index',
            'from'=> 'pcw_360index',
            'charset'=> 'UTF-8',
            'requestScema'=> 'https',
            'o'=> 'sso',
            'm'=> 'getToken',
            'userName'=> 'username',
            '_'=> time() * 1000
        );
        curl_setopt($this->curl, CURLOPT_POST, true);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $_tokenPostFields);
        $_tokenResponse = curl_exec($this->curl);
        $tokenJson = json_decode(preg_split('/[()]/', $_tokenResponse)[1], true);

        //step2:get cookies
        $_postLoginFields = array(
            'callback' => 'QiUserJsonp21670482',
            'func' => 'QiUserJsonp21670482',
            'proxy' => 'https://trends.so.com/psp_jump.html',
            'type' => 'normal',
            'src'=> 'pcw_360index',
            'from'=> 'pcw_360index',
            'charset'=> 'UTF-8',
            'requestScema'=> 'https',
            'o'=> 'sso',
            'm'=> 'login',
            'lm'=> '0',
            'userName'=> 'username',
            'account'=> 'username',
            'password'=> 'aeeb3b46c69459542c1457ea3b3a4bc0(md5 32位小写)',
            'token'=> $tokenJson['token'],
            '_'=> time() * 1000
        );
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $_postLoginFields);
        curl_exec($this->curl);
        //step3:get the 360data
        $cookieFileContents = file_get_contents(dirname(__FILE__).'/cookie.txt');
        //拼接cookie字符串
        $cookieArr = explode("\n", $cookieFileContents);
        $cookie = '';
        foreach ($cookieArr as $data) {
            $dataArr = explode("\t", $data);
            if ($dataArr[0] == '.360.cn' || $dataArr[0] == '#HttpOnly_.360.cn') {
                $cookie .= "{$dataArr[5]}=$dataArr[6];";
            }
        }
        $url = "https://trends.so.com/index/overviewJson?area=" . urldecode('全国') . '&q=' . urldecode($name);
        curl_setopt($this->curl, CURLOPT_COOKIE, $cookie);
        curl_setopt($this->curl, CURLOPT_URL, $url);
        $result = curl_exec($this->curl);
        echo $result;
    }

}

补充说明

  • step1:
    -获取token的response,如下所示,解析字符串,获取token即可:login.setSigCallback({"errno":0,"errmsg":"","token":"638f6d593b12d10b"})
  • step2:
    -模拟登陆中,password为md5后的32位小写字符串。
    -模拟登陆成功后,会获得如下的cookie信息:
# Netscape HTTP Cookie File
# http://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

.360.cn TRUE    /   FALSE   0   Q   u%3D%25O9%25R3%25Q6%25QQ%25PO%25P4%25P8%25SQ%26n%3D%25O9%25R3%25Q6%25QQ%25PO%25P4%25P8%25SQ%26le%3DnzyhM3IcMT9hMlH0ZQDmBGxhozI0%26m%3D%26qid%3D251535385%26im%3w1_t01923d359dad425928%26src%3Dpcw_360index%26t%3D1
#HttpOnly_.360.cn   TRUE    /   FALSE   0   T   s%3D45ebc94e99d3b53fce0c57f504b2f070%26t%3D1524064257%26lm%3D%26lf%3D4%26sk%3Dd0ad528d2a9e9ef22ee3477128691bf1%26mt%3D1524064257%26rc%3D%26v%3D2.0%26a%3D0
  • step3:
    -在step3时,需要设置cookie,具体cookie文件中最后两行的拼接而成。
    Q\T分别为key,紧跟着Q\T后面即对应的value.
eg:Q=u%3D%25O9%25R3%25Q6%25QQ%25PO%25P4%25P8%25SQ%26n%3D%25O9%25R3%25Q6%25QQ%25PO%25P4%25P8%25SQ%26le%3DnzyhM3IcMT9hMlH0ZQDmBGxhozI0%26m%3D%26qid%3D251535385%26im%3D1_t01923d359dad425928%26src%3Dpcw_360index%26t%3D1;T=s%3D45ebc94e99d3b53fce0c57f404b2f070%26t%3D1524064257%26lm%3D%26lf%3D4%26sk%3Dd0ad528d2a9e9ef22ee3477128691bf1%26mt%3D1524064257%26rc%3D%26v%3D2.0%26a%3D0;

-通过下面的链接测试
https://trends.so.com/index/overviewJson?area=全国&q=微信
可成功得到如下JSON:

{"status":0,"data":[{"query":"\u5fae\u4fe1","data":{"week_year_ratio":">1000%","month_year_ratio":">1000%","week_chain_ratio":"16.18%","month_chain_ratio":"19.46%","week_index":8674687,"month_index":8455856}}],"msg":"success"}

你可能感兴趣的:(PHP,crawler)