前言
距上一次发博已经有一个月的时间了,期间一直在优化某功能模块,每次都搞到很晚才回家。所以就没有新的内容发布。
国庆之后我负责了几个爬虫,主要就是自己编写爬虫抓取BAT三家公司的职位信息,还有就是三大人才网的职位信息,这三个之前用了webmagic写的,我就负责维护,总体来说还算是轻松的,就是最后一个猎聘网,需要用代理ip来抓取,花了点时间在网上找了个比较不错的代理ip站点,下面就把webmagic的使用过程抽了出来,方便下一次的快速使用。
webmagic简介
http://webmagic.io/docs/zh/ 这是官方的中文说明,很详细我就不再做过多的解释了,只想提醒几点:
- 适合大部分的列表-内容网站,如CSDN博文列表与对应内容等类似格式额网站
- 要会一些简单的正则表达式与xpath(如果不会的话,使用chrom的插件xpath也是一个不错的选择)
- 不要过分的依赖此框架
简单的配置抓取智联招聘的职位信息
编写一个model并使用@ExtractBy
注解编写xpath语法来为每个属性赋值
@TargetUrl({"http://jobs.zhaopin.com/*.htm?*"})
@HelpUrl({"http://sou.zhaopin.com/jobs/searchresult.ashx?*"})
public class ZhilianJobInfo
implements AfterExtractor {
@ExtractBy("//h1/text()")
private String title = "";
@ExtractBy("//html/body/div[6]/div[1]/ul/li[1]/strong/text()")
private String salary = "";
@ExtractBy("//html/body/div[5]/div[1]/div[1]/h2/a/text()")
private String company = "";
@ExtractBy("//html/body/div[6]/div[1]/div[1]/div/div[1]/allText()")
private String description = "";
private String source = "zhilian.com";
@ExtractByUrl
private String url = "";
private String urlMd5 = "";
@ExtractBy("//html/body/div[6]/div[1]/ul/li[2]/strong/a/text()")
private String dizhi = "";
@ExtractBy("//html/body/div[6]/div[1]/ul/li[5]/strong/text()")
private String qualifications = "";
@ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[3]/strong/a/text()")
private String companycategory = "";
@ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[1]/strong/text()")
private String companyscale = "";
@ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[2]/strong/text()")
private String companytype = "";
@ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[4]/strong/text()")
private String companyaddress;
public String getTitle() {
return this.title;
}
public void setTitle(String title) {
this.title = title;
}
public String getCompany() {
return this.company;
}
public void setCompany(String company) {
this.company = company;
}
public String getDescription() {
return this.description;
}
public void setDescription(String description) {
if (description != null)
this.description = description;
}
public String getSource() {
return this.source;
}
public void setSource(String source) {
this.source = source;
}
public String getUrl() {
return this.url;
}
public void setUrl(String url) {
this.url = url;
this.urlMd5 = DigestUtils.md5Hex(url);
}
public String getSalary() {
return this.salary;
}
public void setSalary(String salary) {
this.salary = salary;
}
public String getUrlMd5() {
return this.urlMd5;
}
public void setUrlMd5(String urlMd5) {
this.urlMd5 = urlMd5;
}
public String getDizhi() {
return this.dizhi;
}
public void setDizhi(String dizhi) {
this.dizhi = dizhi;
}
public String getQualifications() {
return this.qualifications;
}
public void setQualifications(String qualifications) {
this.qualifications = qualifications;
}
public String getCompanycategory() {
return this.companycategory;
}
public void setCompanycategory(String companycategory) {
this.companycategory = companycategory;
}
public String getCompanyscale() {
return this.companyscale;
}
public void setCompanyscale(String companyscale) {
this.companyscale = companyscale;
}
public String getCompanytype() {
return this.companytype;
}
public void setCompanytype(String companytype) {
this.companytype = companytype;
}
public String getCompanyaddress() {
return this.companyaddress;
}
public void setCompanyaddress(String companyaddress) {
this.companyaddress = companyaddress;
}
public String toString() {
return "JobInfo{title='" + this.title + '\'' + ", salary='" + this.salary + '\'' + ", company='" + this.company + '\'' + ", description='" + this.description + '\'' + ", source='" + this.source + '\'' + ", url='" + this.url + '\'' + '}';
}
public void afterProcess(Page page) {
}
}
再来就是实现Crawler的Pipeline来决定你抓取的数据的存储方式
public class ZhilianModelPipeline implements PageModelPipeline {
public void process(ZhilianJobInfo zhilianJobInfo, Task task) {
// save info to db
System.out.println(zhilianJobInfo);
}
}
最后就是爬虫的一些配置和启动入口(今天加了IP代理池)
public class Crawler {
public static void main(String[] args) {
// IP代理池
HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
try {
List proxies = buildProxyIP();
System.out.println("请求代理IP: " + proxies);
httpClientDownloader.setProxyProvider(new SimpleProxyProvider(proxies));
} catch (IOException e) {
e.printStackTrace();
}
OOSpider.create(Site.me()
.setSleepTime(5)
.setRetrySleepTime(10)
.setCycleRetryTimes(3),
new ZhilianModelPipeline(),ZhilianJobInfo.class)
.addUrl("http://sou.zhaopin.com/jobs/searchresult.ashx?jl=765&bj=7002000&sj=463")
.thread(60)
.setDownloader(httpClientDownloader)
.run();
}
/**
* 不错的免费代理IP站点
* www.89ip.cn
*
* @return
*/
private static List buildProxyIP() throws IOException {
Document parse = Jsoup.parse(new URL("http://www.89ip.cn/tiqv.php?sxb=&tqsl=50&ports=&ktip=&xl=on&submit=%CC%E1++%C8%A1"), 5000);
String pattern = "(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+):(\\d+)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(parse.toString());
List proxies = new ArrayList();
while (m.find()) {
String[] group = m.group().split(":");
int prot = Integer.parseInt(group[1]);
proxies.add(new Proxy(group[0], prot));
}
return proxies;
}
}
很快就可以抓取了。代码在https://github.com/vector4wang/webmagic-quick
完了,又水了一贴。。。