做了一个小项目,用微信小程序来显示一个网站的数据。后端采用Java语言,使用Springboot+WebMagic一站式解决,即前端每次刷新,后端就开启爬虫线程并立即把数据返回前端,不设持久层。
WebMagic为开源的Java爬虫框架,官方文档:http://webmagic.io/docs/zh/
一、爬虫部分
1.创建springboot工程,pom里导入WebMagic相关依赖:
<dependency>
<groupId>us.codecraftgroupId>
<artifactId>webmagic-coreartifactId>
<version>0.7.3version>
dependency>
<dependency>
<groupId>us.codecraftgroupId>
<artifactId>webmagic-extensionartifactId>
<version>0.7.3version>
<exclusions>
<exclusion>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-log4j12artifactId>
exclusion>
exclusions>
dependency>
2.根据需要爬取的网页编写正则和xpath。其中正则主要用于解析url,xpath用来获取html标签信息。
正则表达式快速入门:http://deerchao.net/tutorials/regex/regex.htm
XPath入门:https://www.w3school.com.cn/xpath/index.asp
3.编写Java代码进行网页爬取。网页链接:http://www.pm25.com/city/xian.html
package com.example.reptile;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
public class MyReptileDemo implements PageProcessor
{
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
public Site getSite() {
return site;
}
public void process(Page page)
{
page.putField("西安_AQI",page.getHtml().xpath("//a[@class='cbol_aqi_num' and @href='/news/385.html']/text()").toString());
page.putField("西安_pollutant",page.getHtml().xpath("//a[@class=\"cbol_wuranwu_num \" and @href=\"/news/387.html\"]/text()").toString());
page.putField("西安_pm25",page.getHtml().xpath("//a[@class=\"cbol_nongdu_num \" and @href=\"/news/386.html\"]/span/text()").toString());
page.putField("西安_level",page.getHtml().xpath("//div[@class=\"cbor_gauge\"]/span/text()").toString());
String[] positions = {"高压开关厂","兴城小区","纺织城","市人民体育场","高新西区","经开区","长安区","阎良区","临潼区","草滩","曲江文化产业集团","广运潭"};
for(int i=1;i<=12;i++)
{
String xpath="ul[@class=\"pj_area_data_details\" and @style=\"display: none;\"]/li["+String.valueOf(i)+"]";
String aqi = positions[i-1]+"_AQI";
page.putField(aqi,page.getHtml().xpath(xpath+"/span[@class=\"pjadt_aqi\"]/text()").toString());
String level = positions[i-1]+"_空气状况";
page.putField(level,page.getHtml().xpath(xpath+"/span[@class=\"pjadt_quality\"]/em/text()").toString());
String pollutant = positions[i-1]+"_污染物";
page.putField(pollutant,page.getHtml().xpath(xpath+"/a[@class=\"pjadt_wuranwu\"]/text()").toString());
String pm25 = positions[i-1]+"_PM2.5";
page.putField(pm25,page.getHtml().xpath(xpath+"/span[@class=\"pjadt_pm25\"]/text()").toString());
String pm10 = positions[i-1]+"_PM10";
page.putField(pm10,page.getHtml().xpath(xpath+"/span[@class=\"pjadt_pm10\"]/text()").toString());
}
page.putField("test",page.getHtml().xpath("/html/body/table/tbody/tr[264]/td[2]/text()").toString());
}
public static void main(String[] args)
{
Spider.create(new MyReptileDemo()).addUrl("http://www.pm25.com/city/xian.html")
.addPipeline(new ConsolePipeline()).run();
}
}
最后爬取的数据会直接打印到控制台上。
但是这并不是我们要的,因此我们接下来需要定制Pipeline接口实现MVC功能。
二、Web部分
关于定制Pipeline接口:http://webmagic.io/docs/zh/posts/ch6-custom-componenet/pipeline.html
简单来说,我们的方法就是重写process函数并将结果保存到静态变量中,然后在Controller的方法里启动爬虫线程、并将该变量转化成json字符串返回至页面。
代码:
package com.example.reptile;
import ...
public class MyReptileDemo implements PageProcessor
{
...//同上,省略
}
@Controller
@RequestMapping("test")
class ReptilePipeline implements Pipeline
{
public ReptilePipeline(){}
private static Map<String, Object> mapResults;//static关键字能确保爬取的数据能保存下来,不被JVM回收
@Override
public void process(ResultItems resultItems, Task task)
{
mapResults = resultItems.getAll();
}
@RequestMapping("/reptile")
@ResponseBody//此注解能将POJO转化成JSON串返回到Web中
public Map<String,Object> getReptile()
{
Spider.create(new MyReptileDemo()).addUrl("http://www.pm25.com/city/xian.html").addPipeline(new ReptilePipeline()).run();//Web中启动爬虫线程
return mapResults;
}
}
三、前端部分
使用微信开发者工具开发。前端就是wxml+js。响应事件写在js文件中:
onLoad: function (e) {
var th = this;
wx.request({
url: 'http://localhost:8080/test/reptile',
method: 'GET',
header: {
'content-type': 'application/json'
},
success(result) {
console.log(result.data)
th.setData({
map: result.data
})
}
})
}
返回的json数据以键值对方式储存在map中。前端wxml文件里只需要类似{{map.xian_api}}这样写即可取出数据。
然后其他页面的部分也不是我做的。最后做出来的结果如下: