目标城市:北上广深+武汉
工作类别:计算机软件
保存方式:保存工作列表和工作明细到本地html文件中
所用技术:HttpClient + Jsoup + 爬虫工具包
获取分页API:
①选择城市如武汉 ②职位选择计算机软件 ③执行搜索 ④点击F12,打开Network页签监控网络. ⑤再翻到任意一页即可看到分页API
代码分析:
1. 使用pageMap保存各个城市代号和具体的页数, 为了后面爬虫任务简单点, 直接把每个城市的工作页数写死了。
当然在城市和职位有许多的时候就要去动态获取页数了。
2. 分城市下载每页的工作列表, 保存到本地
F:/资源/爬虫/51job/jobs/" + city + "/list/ 文件夹
保存每页的同时, 利用jsoup分析出每页的所有工作的地址保存到joblist文本中, 方便后面多线程下载这些工作明细网页.
3.对每个城市的每个工作地址进行爬虫下载.
开启代理: ProxyHttpClient.initProxy();
开启多线程: SpiderHttpClient.getInstance().startCrawl(new MySpiderAction() {...});
核心逻辑在WuHanJobTask类的handle方法中, 任务成功会保存到本地F:/资源/爬虫/51job/jobs/" + city + "/detail/ 文件夹
package job51; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.http.client.methods.HttpGet; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import com.crawl.proxy.ProxyHttpClient; import com.crawl.spider.MySpiderAction; import com.crawl.spider.SpiderHttpClient; import com.crawl.spider.entity.Page; import com.crawl.spider.task.AbstractPageTask; import com.wnc.basic.BasicFileUtil; import com.wnc.string.PatternUtil; import com.wnc.tools.FileOp; import common.spider.HttpClientUtil; import common.spider.node.MyElement; public class TestBigCity { static final String pageFormat = "http://search.51job.com/list/{CITY},000000,0100,00,9,99,%2B,2,{PAGE}.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=1&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; static MappageMap = new HashMap (); static { pageMap.put("010000", 535); pageMap.put("020000", 720); pageMap.put("030000", 361); pageMap.put("040000", 473); pageMap.put("180200", 218); } public static void main(String[] args) { for (String city : pageMap.keySet()) { BasicFileUtil.makeDirectory("F:/资源/爬虫/51job/jobs/" + city + "/list/"); BasicFileUtil.makeDirectory("F:/资源/爬虫/51job/jobs/" + city + "/detail/"); downloadJobList(city); } downloadJobDetailAll(); } private static void downloadJobDetailAll() { ProxyHttpClient.initProxy(); SpiderHttpClient.getInstance().startCrawl(new MySpiderAction() { public void execute() { List readFrom = new ArrayList (); for (String city : pageMap.keySet()) { readFrom = FileOp.readFrom("joblist-" + city + ".txt"); for (String string : readFrom) { System.out.println(string); if (!BasicFileUtil.isExistFile("F:/资源/爬虫/51job/jobs/" + city + "/detail/" + PatternUtil.getLastPattern(string, "\\d+") + ".html")) SpiderHttpClient.getInstance().getNetPageThreadPool() .execute(new WuHanJobTask(city, string, true)); } } } }); } static class WuHanJobTask extends AbstractPageTask { String city; public WuHanJobTask(String city, String url, boolean b) { super(url, b); this.city = city; this.pageEncoding = "GBK"; } @Override protected void retry() { SpiderHttpClient.getInstance().getNetPageThreadPool().execute(new WuHanJobTask(city, url, true)); } @Override protected void handle(Page page) throws Exception { BasicFileUtil.writeFileString( "F:/资源/爬虫/51job/jobs/" + city + "/detail/" + PatternUtil.getLastPattern(url, "\\d+") + ".html", page.getHtml(), "GBK", false); SpiderHttpClient.parseCount.getAndIncrement(); } } private static void downloadJobList(String city) { for (int i = 1; i <= pageMap.get(city); i++) { if (BasicFileUtil.isExistFile("F:/资源/爬虫/51job/jobs/" + city + "/list/" + i + ".html")) continue; try { String webPage = HttpClientUtil .getWebPage(new HttpGet(pageFormat.replace("{CITY}", city).replace("{PAGE}", i + "")), "GBK"); BasicFileUtil.writeFileString("F:/资源/爬虫/51job/jobs/" + city + "/list/" + i + ".html", webPage, "GBK", false); Document parse = Jsoup.parse(webPage); Elements select = parse.select("#resultList div.el:gt(1)"); for (Element element : select) { BasicFileUtil.writeFileString("joblist-" + city + ".txt", new MyElement(element.select("a").first()).pattern4Attr("href", "(.*\\.html)") + "\r\n", null, true); } Thread.sleep(800); } catch (Exception e) { e.printStackTrace(); BasicFileUtil.writeFileString("F:/资源/爬虫/51job/jobs/" + city + "/err.txt", i + "Page " + e.toString() + "\r\n", null, true); } } } }
以上是9月份代码, 现在(10月23日)的工作信息感觉增加了好多. 武汉的计算机软件岗位居然从218页增加到了541页