【51job爬虫】多线程多代理下载IT招聘信息

目标城市:北上广深+武汉

工作类别:计算机软件

保存方式:保存工作列表和工作明细到本地html文件中

所用技术:HttpClient + Jsoup + 爬虫工具包

 

获取分页API:

    ①选择城市如武汉 ②职位选择计算机软件 ③执行搜索 ④点击F12,打开Network页签监控网络. ⑤再翻到任意一页即可看到分页API

【51job爬虫】多线程多代理下载IT招聘信息_第1张图片

 

 

 

 

代码分析:

1. 使用pageMap保存各个城市代号和具体的页数, 为了后面爬虫任务简单点, 直接把每个城市的工作页数写死了。

    当然在城市和职位有许多的时候就要去动态获取页数了。

2. 分城市下载每页的工作列表, 保存到本地

    F:/资源/爬虫/51job/jobs/" + city + "/list/ 文件夹

    保存每页的同时, 利用jsoup分析出每页的所有工作的地址保存到joblist文本中, 方便后面多线程下载这些工作明细网页.

3.对每个城市的每个工作地址进行爬虫下载.

 开启代理:     ProxyHttpClient.initProxy();

 开启多线程: SpiderHttpClient.getInstance().startCrawl(new MySpiderAction() {...});

 核心逻辑在WuHanJobTask类的handle方法中, 任务成功会保存到本地F:/资源/爬虫/51job/jobs/" + city + "/detail/ 文件夹

 
package job51;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.http.client.methods.HttpGet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.crawl.proxy.ProxyHttpClient;
import com.crawl.spider.MySpiderAction;
import com.crawl.spider.SpiderHttpClient;
import com.crawl.spider.entity.Page;
import com.crawl.spider.task.AbstractPageTask;
import com.wnc.basic.BasicFileUtil;
import com.wnc.string.PatternUtil;
import com.wnc.tools.FileOp;

import common.spider.HttpClientUtil;
import common.spider.node.MyElement;

public class TestBigCity {
    static final String pageFormat = "http://search.51job.com/list/{CITY},000000,0100,00,9,99,%2B,2,{PAGE}.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=1&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";
    static Map pageMap = new HashMap();
    static {
        pageMap.put("010000", 535);
        pageMap.put("020000", 720);
        pageMap.put("030000", 361);
        pageMap.put("040000", 473);
        pageMap.put("180200", 218);
    }

    public static void main(String[] args) {
        for (String city : pageMap.keySet()) {
            BasicFileUtil.makeDirectory("F:/资源/爬虫/51job/jobs/" + city + "/list/");
            BasicFileUtil.makeDirectory("F:/资源/爬虫/51job/jobs/" + city + "/detail/");

            downloadJobList(city);
        }
        downloadJobDetailAll();
    }

    private static void downloadJobDetailAll() {
        ProxyHttpClient.initProxy();
        SpiderHttpClient.getInstance().startCrawl(new MySpiderAction() {
            public void execute() {
                List readFrom = new ArrayList();
                for (String city : pageMap.keySet()) {
                    readFrom = FileOp.readFrom("joblist-" + city + ".txt");
                    for (String string : readFrom) {
                        System.out.println(string);
                        if (!BasicFileUtil.isExistFile("F:/资源/爬虫/51job/jobs/" + city + "/detail/"
                                + PatternUtil.getLastPattern(string, "\\d+") + ".html"))
                            SpiderHttpClient.getInstance().getNetPageThreadPool()
                                    .execute(new WuHanJobTask(city, string, true));
                    }
                }
            }
        });
    }

    static class WuHanJobTask extends AbstractPageTask {
        String city;

        public WuHanJobTask(String city, String url, boolean b) {
            super(url, b);
            this.city = city;
            this.pageEncoding = "GBK";
        }

        @Override
        protected void retry() {
            SpiderHttpClient.getInstance().getNetPageThreadPool().execute(new WuHanJobTask(city, url, true));
        }

        @Override
        protected void handle(Page page) throws Exception {
            BasicFileUtil.writeFileString(
                    "F:/资源/爬虫/51job/jobs/" + city + "/detail/" + PatternUtil.getLastPattern(url, "\\d+") + ".html",
                    page.getHtml(), "GBK", false);
            SpiderHttpClient.parseCount.getAndIncrement();
        }

    }

    private static void downloadJobList(String city) {
        for (int i = 1; i <= pageMap.get(city); i++) {
            if (BasicFileUtil.isExistFile("F:/资源/爬虫/51job/jobs/" + city + "/list/" + i + ".html"))
                continue;

            try {
                String webPage = HttpClientUtil
                        .getWebPage(new HttpGet(pageFormat.replace("{CITY}", city).replace("{PAGE}", i + "")), "GBK");
                BasicFileUtil.writeFileString("F:/资源/爬虫/51job/jobs/" + city + "/list/" + i + ".html", webPage, "GBK",
                        false);
                Document parse = Jsoup.parse(webPage);
                Elements select = parse.select("#resultList div.el:gt(1)");
                for (Element element : select) {
                    BasicFileUtil.writeFileString("joblist-" + city + ".txt",
                            new MyElement(element.select("a").first()).pattern4Attr("href", "(.*\\.html)") + "\r\n",
                            null, true);
                }
                Thread.sleep(800);
            } catch (Exception e) {
                e.printStackTrace();
                BasicFileUtil.writeFileString("F:/资源/爬虫/51job/jobs/" + city + "/err.txt",
                        i + "Page " + e.toString() + "\r\n", null, true);
            }

        }
    }
}

 以上是9月份代码, 现在(10月23日)的工作信息感觉增加了好多.  武汉的计算机软件岗位居然从218页增加到了541页

转载于:https://www.cnblogs.com/crazyData/p/7719865.html

你可能感兴趣的:(【51job爬虫】多线程多代理下载IT招聘信息)