抓取51JOB职位的功能,很久之前都写好,程序跑的很正常。
最近从运营部反馈过来,之前的抓取功能不能用了。经检测51job改版了,抓取规则不同;同时致命一点,正常抓取了,51job通过直接抓取会进行跳转,反馈过来的内容是“新版提示信息”之类的信息。
从反馈回来的内容里再进行抓取,同样是反馈是“新版提示信息”
如何解决?
之前一直用原生的CURL来抓取,这次咱换个Snoopy来抓取,对CURL进行一下封装,用起来不错。
在进行抓取的时候,要给CURL进行中cookie设置
主要是这三个cookie参数:
1:search
2:guide
3:guid
自己在先在浏览器里搜索企业,用chrome打开,查看他的cookie值,
复制到自己的程序中
$ck1 = "jobarea......."; $snoopy->cookies["search"] = $ck1; $snoopy->cookies["guide"] = 1; $snoopy->cookies["guid"] = "14598261697930760091";
51job 是gbk的编码格式,所以抓取后要进行转码,不然的会出现乱码,正则不到任何你想的东东。
代码如下:
function array_iconv($data, $output = 'utf-8') { $encode_arr = array('UTF-8','ASCII','GBK','GB2312','BIG5','JIS','eucjp-win','sjis-win','EUC-JP'); $encoded = mb_detect_encoding($data, $encode_arr); if(empty($encoded)) $encoded='UTF-8'; if (!is_array($data)) { return @mb_convert_encoding($data, $output, $encoded); } else { foreach ($data as $key=>$val) { $key = array_iconv($key, $output); if(is_array($val)) { $data[$key] = array_iconv($val, $output); } else { $data[$key] = @mb_convert_encoding($data, $output, $encoded); } } return $data; } }
核心代码如下:
$formvars["keyword"] = "公司名称"; $formvars["fromJs"] = 1; $formvars["keywordtype"] = 1; $action = "http://search.51job.com/jobsearch/search_result.php"; $ck1 = "jobare..."; $snoopy->cookies["search"] = $ck1; $snoopy->cookies["guide"] = 1; $snoopy->cookies["guid"] = "14598261697930760091"; $snoopy->agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"; $snoopy->referer ="http://www.51job.com"; $snoopy->rawheaders["Pragma"] = "no-cache"; $snoopy->rawheaders["X_FORWARDED_FOR"] = "127.0.0.101"; $snoopy->submit($action,$formvars); $listres = $common->array_iconv($snoopy->results); preg_match_all("/http:\/\/jobs\.51job\.com\/(\w+)\/(\d+)\.html\?s=0/", $listres , $matches); //print_r($matches); if (empty($matches[0])) { break; } $j = 0; foreach ($matches[0] as $durl) { //$durl = 'http://jobs.51job.com/shanghai/76319081.html?s=0'; $detailres = $common->getContent($durl); preg_match_all('/<h1 title="(.*)">/iU', $detailres , $matchtitle);//职位 preg_match('/(\d+)\-(\d+)\/月/iU', $detailres , $matchwage);//薪水 preg_match_all('/<span class="sp4">(.*)<\/span>/iU', $detailres , $matchoption);//选项 preg_match_all('/<p class="t2">(.*)<\/p>/iU', $detailres , $matchattr);//吸引力 preg_match_all('/<div class="bmsg job_msg inbox">(.*)<\/div>/iU', $detailres , $matchcont);//职位要求 preg_match_all('/<p class="fp">(.*)<\/p>/iU', $detailres , $matchadd);//详细地址 preg_match_all('/<span class="lname">(.*)<\/span>/iU', $detailres, $matcharea); }