不是太喜欢Spring Boot这种“黑盒”框架,所以在正式项目中一般不会去使用。正好有个实验性质的爬虫项目前期,所以用Spring Boot集成WebMagic做一下尝试,看看是否能改变之前的刻板印象。
参考了博客 Eclipse中spring boot的安装和创建简单的Web应用,通过Eclipse Marketplace安装Spring Boot插件
创建Spring Boot项目,依赖勾选了MyBatis/MySQL/Redis/Web。
这里需要把使用的数据库依赖一并选中,我用的是MySQL,不然在之后会提示找不到MySQL驱动包,需要手动添加依赖。
创建项目,建立一个简单的Controller测试一下
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
/**
* @author Zln
* @version 2018-12-24 11:12
*
* 示例
*/
@RestController
@RequestMapping("/sample")
public class SampleController {
@RequestMapping("/hello")
public String hello(String name) {
return "Hello " + name;
}
}
在SpiderApplication上Run As->Spring Boot App,出现错误提示
***************************
APPLICATION FAILED TO START
***************************
Description:
Failed to configure a DataSource: 'url' attribute is not specified and no embedded datasource could be configured.
Reason: Failed to determine a suitable driver class
据查是因为还没有配置数据库连接,因为添加了MyBatis依赖,Spring Boot启动的时候会去尝试查找链接数据库,这里可以选择先屏蔽掉数据库配置或者在resource/application.properties添加链接信息。
因为本地有现成的项目数据库,选择添加数据库配置
spring.datasource.url=jdbc:mysql://192.168.8.10:3307/hyms?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true
spring.datasource.username=hyms
spring.datasource.password=hyms
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
数据库连接配置好以后,再次运行SpiderApplication。可以正常启动,在浏览器访问
http://localhost:8080/sample/hello?name=test
能够看到页面的正常输入
Spring Boot内置了Tomcat,默认端口号8080,Context Path是"/"。会与正常项目有冲突,先进行修改。在applicaiton.properties中增加相关配置
server.port=8081
server.servlet.context-path=/spider
重新在浏览器运行
http://localhost:8081/spider/sample/hello?name=test1
查看页面输出正常
WebMagic官网:http://webmagic.io
添加Maven依赖
us.codecraft
webmagic-core
0.7.3
us.codecraft
webmagic-extension
0.7.3
按照官网的示例代码 实现PageProcessor 创建一个测试,运行发现提示异常
14:37:43.356 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 1]
14:37:43.358 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft error
javax.net.ssl.SSLException: Received fatal alert: protocol_version
at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
at sun.security.ssl.Alerts.getSSLException(Alerts.java:154)
at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2023)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1125)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:394)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
14:37:44.363 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.
查找了一下原来是0.7.3版本在抓取只支持TLS1.2的https站点时候有问题(不知道其他版本是否有此问题,未尝试)。需要自行修改HttpClientGenerator,并创建HttpClientDownloader引入修改后的HttpClientGenerator。在创建Spider时引入新的HttpClientDownloader即可。
相关问题作者在github上已经有回复 参见:Https下无法抓取只支持TLS1.2的站点,并且github上下载到的源代码已经是修改过的了。
下载 https://github.com/code4craft/webmagic/tree/master/webmagic-core/src/main/java/us/codecraft/webmagic/downloader下的HttpClientDownloader.java和HttpClientGenerator.java放入工程内,并在创建Spider时指定新的HttpClientDownloader。再次运行,可正常读取示例中内容。
抄官网代码的时候没注意Spider.create后面的new GithubRepoPageProcessor()是示例自带的process,在后续修改的过程中发现一直无效才看到,这里需要注意一下
Spider.create(new SampleProcessor()).setDownloader(new HttpClientDownloader())
//从"https://github.com/code4craft"开始抓
.addUrl("https://github.com/code4craft")
//开启5个线程抓取
.thread(1)
//启动爬虫
.run();
也可以通过直接将github上项目重新打包deploy本地仓库的方式。
通过查看https://h5.ele.me/,查找到店铺列表api url:https://h5.ele.me/restapi/shopping/v3/restaurants?latitude=31.032697&longitude=121.216669&offset=0&limit=8&extras[]=activities&extras[]=tags&extra_filters=home&rank_id=&terminal=h5,作为爬虫入口。
饿了么使用高德地图坐标系,地址经纬度参数可以通过高德地图提供的API进行获取,或直接访问页面https://lbs.amap.com/api/javascript-api/example/map/click-to-get-lnglat进行人工获取
店铺详情及菜单信息,通过访问https://www.ele.me,并点击店铺详情获得https://www.ele.me/shop/E1209616406188511795。shop后为店铺ID,可通过第一步获取的店铺列表中数据进行替换,但是店铺ID可能有时效性,历史店铺ID会出现不可用的情况,建议抓取店铺列表后,根据即时数据进行拼接URL。这两部分具体的JSON数据分析就不详细写了,通过浏览器执行拿到JSON返回结果,格式化一下就比较清楚了。
由于JsonPath在解析菜品数据的时候无法解析JsonArray,具体什么原因不清楚,所以使用了框架引入的FastJson
创建WebMagic的Processor,预留实例化参数为经纬度,是否抓取菜品信息
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.JSONPath;
import com.zln.spider.pojo.ElemeFood;
import com.zln.spider.pojo.ElemeShop;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
/**
* @author Zln
* @version 2018-12-24 14:47
*
* 饿了么店铺 爬虫
*/
public class ElemeShopProcessor implements PageProcessor {
Logger logger = LoggerFactory.getLogger(getClass());
private Site site = Site.me()
.setDomain("ele.me")
.setSleepTime(100)
.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36");
private String longitude;
private String latitude;
private Boolean crawlMenu = false; // 是否抓取菜单信息
/*
* 数据处理类型
*/
public static enum ProcessType {
SHOP, FOOD
}
/**
* 饿了么店铺爬虫实例化方法
* @param longitude 经度
* @param latitude 纬度
* @param crawlMenu 是否抓取菜单
*/
public ElemeShopProcessor(String longitude, String latitude,
Boolean crawlMenu) {
this.longitude = longitude;
this.latitude = latitude;
this.crawlMenu = crawlMenu;
}
@Override
public void process(Page page) {
// H5列表数据
if (page.getUrl()
.regex("https://h5\\.ele\\.me/restapi/shopping/v3/restaurants+")
.match()) {
logger.info("解析H5商户列表数据");
logger.debug(page.getRawText());
JSONObject joData = JSON.parseObject(page.getRawText());
JSONArray jaRestaurant = (JSONArray) JSONPath.eval(joData,"$.items.restaurant");
// 解析餐厅数据
if (jaRestaurant != null) {
List listShops = new ArrayList<>();
for (Object oRest : jaRestaurant) {
// 遍历餐厅数据,解析为ElemeShop对象
JSONObject joRest = (JSONObject) oRest;
ElemeShop objShop = JSONObject.toJavaObject(joRest,ElemeShop.class);
logger.debug(objShop.toString());
listShops.add(objShop);
if (crawlMenu) {
// 需要抓取菜单信息,加入后续targetRequest
String targetUrl = "https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=" + objShop.getId() + "&terminal=web";
page.addTargetRequest(targetUrl);
}
}
page.putField("type", ProcessType.SHOP);
page.putField("listShops", listShops);
}
} else {
logger.info("解析菜品");
logger.debug(page.getRawText());
Set setItemIds = new HashSet<>(); // 保存itemId用于菜品去重
JSONArray jaMenuGroup = JSON.parseArray(page.getRawText());
if (jaMenuGroup != null) {
List listFoods = new ArrayList<>();
for (Object oMenuGroup : jaMenuGroup) {
JSONArray jaFoods = (JSONArray) JSONPath.eval(oMenuGroup,"$.foods");
for (Object oFood : jaFoods) {
JSONObject joFood = (JSONObject)oFood;
// 获取每个菜品数据
String strItemId = (String) JSONPath.eval(joFood,"$.item_id"); // itemId
// 判断是否重复
if (setItemIds.contains(strItemId)) {
// 重复菜品,跳过
continue;
}
else {
setItemIds.add(strItemId);
}
ElemeFood objFood = JSONObject.toJavaObject(joFood,ElemeFood.class);
logger.debug(objFood.toString());
listFoods.add(objFood);
}
}
page.putField("type", ProcessType.FOOD);
page.putField("listFoods", listFoods);
}
}
}
@Override
public Site getSite() {
return site;
}
/**
* 获取起始url
* @return
*/
public String getUrl() {
// 门店查询页面 https://www.ele.me/place/wtw0w37dxs0r?latitude=31.032709&longitude=121.217287
// 门店里列表json https://h5.ele.me/restapi/shopping/v3/restaurants?latitude=31.107641&longitude=121.252976&offset=0&limit=1&extras[]=activities&extras[]=tags&terminal=h5
// 店铺json https://www.ele.me/shop/E7326872827353281855
/*String url = "https://www.ele.me/place/";
String geohash = "wtw0w37dxs0r";*/
String url = "https://h5.ele.me/restapi/shopping/v3/restaurants?latitude=" + latitude + "&longitude=" + longitude + "&offset=0&limit=3&extras[]=activities&extras[]=tags&terminal=h5";
return url;
}
}
fastjson解析用的三个pojo类
import com.alibaba.fastjson.annotation.JSONField;
/**
* @author Zln
* @version 2018-12-28 11:14
*
* 饿了么门店对象
*/
public class ElemeShop {
private String id;
private String name;
private String address;
private String latitude;
private String longitude;
private String phone;
private Double rating;
@JSONField(name = "rating_count")
private Integer ratingCount;
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAddress() {
return address;
}
public void setAddress(String address) {
this.address = address;
}
public String getLatitude() {
return latitude;
}
public void setLatitude(String latitude) {
this.latitude = latitude;
}
public String getLongitude() {
return longitude;
}
public void setLongitude(String longitude) {
this.longitude = longitude;
}
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public Double getRating() {
return rating;
}
public void setRating(Double rating) {
this.rating = rating;
}
public Integer getRatingCount() {
return ratingCount;
}
public void setRatingCount(Integer ratingCount) {
this.ratingCount = ratingCount;
}
@Override
public String toString() {
return "ElemeShop [id=" + id + ", name=" + name + ", address=" + address
+ ", latitude=" + latitude + ", longitude=" + longitude
+ ", phone=" + phone + ", rating=" + rating + ", ratingCount="
+ ratingCount + "]";
}
}
import java.util.List;
import com.alibaba.fastjson.annotation.JSONField;
/**
* @author Zln
* @version 2018-12-28 14:18
*
* 饿了么菜品
*/
public class ElemeFood {
@JSONField(name = "item_id")
private String itemId;
private String name;
private Double rating;
@JSONField(name = "rating_count")
private Long ratingCount;
private String description;
private List specfoods;
public String getItemId() {
return itemId;
}
public void setItemId(String itemId) {
this.itemId = itemId;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Double getRating() {
return rating;
}
public void setRating(Double rating) {
this.rating = rating;
}
public Long getRatingCount() {
return ratingCount;
}
public void setRatingCount(Long ratingCount) {
this.ratingCount = ratingCount;
}
public String getDescription() {
return description;
}
public void setDescription(String description) {
this.description = description;
}
public List getSpecfoods() {
return specfoods;
}
public void setSpecfoods(List specfoods) {
this.specfoods = specfoods;
}
@Override
public String toString() {
return "ElemeFood [itemId=" + itemId + ", name=" + name + ", rating="
+ rating + ", ratingCount=" + ratingCount + ", description="
+ description + ", specfoods=" + specfoods + "]";
}
}
import java.math.BigDecimal;
import java.util.List;
import com.alibaba.fastjson.annotation.JSONField;
/**
* @author Zln
* @version 2018-12-28 14:25
*
* 菜品规格
*/
public class ElemeFoodSpec {
private String name;
@JSONField(name = "food_id")
private String foodId;
@JSONField(name = "item_id")
private String itemId;
@JSONField(name = "original_price")
private BigDecimal originalPrice;
@JSONField(name = "packing_fee")
private BigDecimal packingFee;
private BigDecimal price;
@JSONField(name = "sku_id")
private String skuId;
@JSONField(name = "restaurant_id")
private String restaurantId;
private List specs; // 规格名称
public String getFoodId() {
return foodId;
}
public void setFoodId(String foodId) {
this.foodId = foodId;
}
public String getItemId() {
return itemId;
}
public void setItemId(String itemId) {
this.itemId = itemId;
}
public BigDecimal getOriginalPrice() {
return originalPrice;
}
public void setOriginalPrice(BigDecimal originalPrice) {
this.originalPrice = originalPrice;
}
public BigDecimal getPackingFee() {
return packingFee;
}
public void setPackingFee(BigDecimal packingFee) {
this.packingFee = packingFee;
}
public BigDecimal getPrice() {
return price;
}
public void setPrice(BigDecimal price) {
this.price = price;
}
public String getSkuId() {
return skuId;
}
public void setSkuId(String skuId) {
this.skuId = skuId;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getRestaurantId() {
return restaurantId;
}
public void setRestaurantId(String restaurantId) {
this.restaurantId = restaurantId;
}
public List getSpecs() {
return specs;
}
public void setSpecs(List specs) {
this.specs = specs;
}
public String getSpceName() {
if(this.specs != null && this.specs.size() > 0) {
return this.specs.get(0).getValue();
}
else {
return null;
}
}
@Override
public String toString() {
return "ElemeFoodSpec [name=" + name + ", foodId=" + foodId
+ ", itemId=" + itemId + ", originalPrice=" + originalPrice
+ ", packingFee=" + packingFee + ", price=" + price + ", skuId="
+ skuId + ", restaurantId=" + restaurantId + ", specs=" + specs
+ "]";
}
/**
* 规格
*
* @author Zhouluning
*
*/
public static class Spec {
private String value; // 规格名称
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
@Override
public String toString() {
return "Spec [value=" + value + "]";
}
}
}
创建Pipeline,在WebMagic文档中有结合spring的Pipeline的用法,不过没有太理解,网上也有查找直接拿Pipeline当Service用的,感觉也不是太正确的用法,只能按照自己的理解写了一下。每次ElemeShopProcessor.process执行完成后,将解析出的数据通过page.putField保存,然后在pipeline中通过ResultItems.get进行获取。下面代码只有保存店铺的部分,保存菜品的相似
import java.util.Date;
import java.util.List;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.BeanUtils;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import com.zln.spider.eleme.ElemeShopProcessor.ProcessType;
import com.zln.spider.entity.DcShopInfo;
import com.zln.spider.entity.DcTask;
import com.zln.spider.mapper.DcShopInfoMapper;
import com.zln.spider.pojo.ElemeFood;
import com.zln.spider.pojo.ElemeShop;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
/**
* @author Zln
* @version 2018-12-28 13:28
*
* 饿了么门店信息pipeline
*/
@Component("ElemeShopPipeline")
public class ElemeShopPipeline implements Pipeline{
Logger logger = LoggerFactory.getLogger(getClass());
@Autowired
DcShopInfoMapper dcShopInfoMapper;
@SuppressWarnings("unchecked")
@Override
public void process(ResultItems rs, Task task) {
if(rs.get("type") != null) {
ProcessType type = (ProcessType)rs.get("type");
switch (type) {
case SHOP:
// 店铺信息处理
List listShops = (List)rs.get("listShops");
for (ElemeShop elemeShop : listShops) {
logger.info("save!" + elemeShop);
saveShopInfo(elemeShop);
}
break;
case FOOD:
// 菜品信息处理
List listFoods = (List)rs.get("listFoods");
for (ElemeFood elemeFood : listFoods) {
logger.info("save!" + elemeFood);
}
break;
default:
logger.error("未知处理类型!");
break;
}
}
else {
// 未抓取到信息
logger.error("未抓取到需要处理的信息!");
}
}
private void saveShopInfo(ElemeShop elemeShop) {
DcShopInfo objDcShopInfo = new DcShopInfo();
BeanUtils.copyProperties(elemeShop, objDcShopInfo);
objDcShopInfo.setChannelShopId(elemeShop.getId());
objDcShopInfo.setCreateTime(new Date());
dcShopInfoMapper.insert(objDcShopInfo);
}
}
使用mybatis generator生成对应的entity和mapper。springboot集成mybatis可以参考 spring boot(六):如何优雅的使用mybatis
import java.util.Date;
public class DcShopInfo {
/**
*
* DC_SHOP_INFO.ID
*
* @mbg.generated
*/
private Long id;
/**
* 任务ID
* DC_SHOP_INFO.TASK_ID
*
* @mbg.generated
*/
private Long taskId;
/**
* 数据渠道
1:饿了么
2:美团
* DC_SHOP_INFO.CHANNEL
*
* @mbg.generated
*/
private Integer channel;
/**
* 店铺名称
* DC_SHOP_INFO.NAME
*
* @mbg.generated
*/
private String name;
/**
* 店铺地址
* DC_SHOP_INFO.ADDRESS
*
* @mbg.generated
*/
private String address;
/**
* 店铺坐标-纬度
* DC_SHOP_INFO.LATITUDE
*
* @mbg.generated
*/
private String latitude;
/**
* 店铺坐标-经度
* DC_SHOP_INFO.LONGITUDE
*
* @mbg.generated
*/
private String longitude;
/**
* 联系电话
* DC_SHOP_INFO.PHONE
*
* @mbg.generated
*/
private String phone;
/**
* 评分
* DC_SHOP_INFO.RATING
*
* @mbg.generated
*/
private Double rating;
/**
* 评价数
* DC_SHOP_INFO.RATING_COUNT
*
* @mbg.generated
*/
private Integer ratingCount;
/**
* 渠道店铺ID
* DC_SHOP_INFO.CHANNEL_SHOP_ID
*
* @mbg.generated
*/
private String channelShopId;
/**
* 创建日期
* DC_SHOP_INFO.CREATE_TIME
*
* @mbg.generated
*/
private Date createTime;
/**
*
* @mbg.generated
*/
public Long getId() {
return id;
}
/**
*
* @mbg.generated
*/
public void setId(Long id) {
this.id = id;
}
/**
*
* @mbg.generated
*/
public Long getTaskId() {
return taskId;
}
/**
*
* @mbg.generated
*/
public void setTaskId(Long taskId) {
this.taskId = taskId;
}
/**
*
* @mbg.generated
*/
public Integer getChannel() {
return channel;
}
/**
*
* @mbg.generated
*/
public void setChannel(Integer channel) {
this.channel = channel;
}
/**
*
* @mbg.generated
*/
public String getName() {
return name;
}
/**
*
* @mbg.generated
*/
public void setName(String name) {
this.name = name;
}
/**
*
* @mbg.generated
*/
public String getAddress() {
return address;
}
/**
*
* @mbg.generated
*/
public void setAddress(String address) {
this.address = address;
}
/**
*
* @mbg.generated
*/
public String getLatitude() {
return latitude;
}
/**
*
* @mbg.generated
*/
public void setLatitude(String latitude) {
this.latitude = latitude;
}
/**
*
* @mbg.generated
*/
public String getLongitude() {
return longitude;
}
/**
*
* @mbg.generated
*/
public void setLongitude(String longitude) {
this.longitude = longitude;
}
/**
*
* @mbg.generated
*/
public String getPhone() {
return phone;
}
/**
*
* @mbg.generated
*/
public void setPhone(String phone) {
this.phone = phone;
}
/**
*
* @mbg.generated
*/
public Double getRating() {
return rating;
}
/**
*
* @mbg.generated
*/
public void setRating(Double rating) {
this.rating = rating;
}
/**
*
* @mbg.generated
*/
public Integer getRatingCount() {
return ratingCount;
}
/**
*
* @mbg.generated
*/
public void setRatingCount(Integer ratingCount) {
this.ratingCount = ratingCount;
}
/**
*
* @mbg.generated
*/
public String getChannelShopId() {
return channelShopId;
}
/**
*
* @mbg.generated
*/
public void setChannelShopId(String channelShopId) {
this.channelShopId = channelShopId;
}
/**
*
* @mbg.generated
*/
public Date getCreateTime() {
return createTime;
}
/**
*
* @mbg.generated
*/
public void setCreateTime(Date createTime) {
this.createTime = createTime;
}
}
import java.util.List;
import org.apache.ibatis.annotations.Delete;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Result;
import org.apache.ibatis.annotations.Results;
import org.apache.ibatis.annotations.Select;
import org.apache.ibatis.annotations.Update;
import org.apache.ibatis.type.JdbcType;
import com.zln.spider.entity.DcShopInfo;
public interface DcShopInfoMapper {
/**
* This method was generated by MyBatis Generator.
* This method corresponds to the database table dc_shop_info
*
* @mbg.generated
*/
@Delete({
"delete from dc_shop_info",
"where ID = #{id,jdbcType=BIGINT}"
})
int deleteByPrimaryKey(Long id);
/**
* This method was generated by MyBatis Generator.
* This method corresponds to the database table dc_shop_info
*
* @mbg.generated
*/
@Insert({
"insert into dc_shop_info (ID, TASK_ID, ",
"CHANNEL, NAME, ADDRESS, ",
"LATITUDE, LONGITUDE, ",
"PHONE, RATING, RATING_COUNT, ",
"CHANNEL_SHOP_ID, CREATE_TIME)",
"values (#{id,jdbcType=BIGINT}, #{taskId,jdbcType=BIGINT}, ",
"#{channel,jdbcType=INTEGER}, #{name,jdbcType=VARCHAR}, #{address,jdbcType=VARCHAR}, ",
"#{latitude,jdbcType=VARCHAR}, #{longitude,jdbcType=VARCHAR}, ",
"#{phone,jdbcType=VARCHAR}, #{rating,jdbcType=DOUBLE}, #{ratingCount,jdbcType=INTEGER}, ",
"#{channelShopId,jdbcType=VARCHAR}, #{createTime,jdbcType=TIMESTAMP})"
})
int insert(DcShopInfo record);
/**
* This method was generated by MyBatis Generator.
* This method corresponds to the database table dc_shop_info
*
* @mbg.generated
*/
@Select({
"select",
"ID, TASK_ID, CHANNEL, NAME, ADDRESS, LATITUDE, LONGITUDE, PHONE, RATING, RATING_COUNT, ",
"CHANNEL_SHOP_ID, CREATE_TIME",
"from dc_shop_info",
"where ID = #{id,jdbcType=BIGINT}"
})
@Results({
@Result(column="ID", property="id", jdbcType=JdbcType.BIGINT, id=true),
@Result(column="TASK_ID", property="taskId", jdbcType=JdbcType.BIGINT),
@Result(column="CHANNEL", property="channel", jdbcType=JdbcType.INTEGER),
@Result(column="NAME", property="name", jdbcType=JdbcType.VARCHAR),
@Result(column="ADDRESS", property="address", jdbcType=JdbcType.VARCHAR),
@Result(column="LATITUDE", property="latitude", jdbcType=JdbcType.VARCHAR),
@Result(column="LONGITUDE", property="longitude", jdbcType=JdbcType.VARCHAR),
@Result(column="PHONE", property="phone", jdbcType=JdbcType.VARCHAR),
@Result(column="RATING", property="rating", jdbcType=JdbcType.DOUBLE),
@Result(column="RATING_COUNT", property="ratingCount", jdbcType=JdbcType.INTEGER),
@Result(column="CHANNEL_SHOP_ID", property="channelShopId", jdbcType=JdbcType.VARCHAR),
@Result(column="CREATE_TIME", property="createTime", jdbcType=JdbcType.TIMESTAMP)
})
DcShopInfo selectByPrimaryKey(Long id);
/**
* This method was generated by MyBatis Generator.
* This method corresponds to the database table dc_shop_info
*
* @mbg.generated
*/
@Select({
"select",
"ID, TASK_ID, CHANNEL, NAME, ADDRESS, LATITUDE, LONGITUDE, PHONE, RATING, RATING_COUNT, ",
"CHANNEL_SHOP_ID, CREATE_TIME",
"from dc_shop_info"
})
@Results({
@Result(column="ID", property="id", jdbcType=JdbcType.BIGINT, id=true),
@Result(column="TASK_ID", property="taskId", jdbcType=JdbcType.BIGINT),
@Result(column="CHANNEL", property="channel", jdbcType=JdbcType.INTEGER),
@Result(column="NAME", property="name", jdbcType=JdbcType.VARCHAR),
@Result(column="ADDRESS", property="address", jdbcType=JdbcType.VARCHAR),
@Result(column="LATITUDE", property="latitude", jdbcType=JdbcType.VARCHAR),
@Result(column="LONGITUDE", property="longitude", jdbcType=JdbcType.VARCHAR),
@Result(column="PHONE", property="phone", jdbcType=JdbcType.VARCHAR),
@Result(column="RATING", property="rating", jdbcType=JdbcType.DOUBLE),
@Result(column="RATING_COUNT", property="ratingCount", jdbcType=JdbcType.INTEGER),
@Result(column="CHANNEL_SHOP_ID", property="channelShopId", jdbcType=JdbcType.VARCHAR),
@Result(column="CREATE_TIME", property="createTime", jdbcType=JdbcType.TIMESTAMP)
})
List selectAll();
/**
* This method was generated by MyBatis Generator.
* This method corresponds to the database table dc_shop_info
*
* @mbg.generated
*/
@Update({
"update dc_shop_info",
"set TASK_ID = #{taskId,jdbcType=BIGINT},",
"CHANNEL = #{channel,jdbcType=INTEGER},",
"NAME = #{name,jdbcType=VARCHAR},",
"ADDRESS = #{address,jdbcType=VARCHAR},",
"LATITUDE = #{latitude,jdbcType=VARCHAR},",
"LONGITUDE = #{longitude,jdbcType=VARCHAR},",
"PHONE = #{phone,jdbcType=VARCHAR},",
"RATING = #{rating,jdbcType=DOUBLE},",
"RATING_COUNT = #{ratingCount,jdbcType=INTEGER},",
"CHANNEL_SHOP_ID = #{channelShopId,jdbcType=VARCHAR},",
"CREATE_TIME = #{createTime,jdbcType=TIMESTAMP}",
"where ID = #{id,jdbcType=BIGINT}"
})
int updateByPrimaryKey(DcShopInfo record);
}
创建TestCast或者通过添加定时的方式执行写好的爬虫
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;
import com.zln.spider.eleme.ElemeShopPipeline;
import com.zln.spider.eleme.ElemeShopProcessor;
import us.codecraft.webmagic.Spider;
@RunWith(SpringRunner.class)
@SpringBootTest
public class SpiderApplicationTests {
@Qualifier("ElemeShopPipeline")
@Autowired
ElemeShopPipeline elemeShopPipeline;
@Test
public void testElemeShopSpider() {
String latitude = "31.032697";
String longitude = "121.216669";
ElemeShopProcessor processor = new ElemeShopProcessor(longitude,latitude, false);
Spider.create(processor).addPipeline(elemeShopPipeline)
.addUrl(processor.getUrl())
// 开启1个线程抓取
.thread(1)
// 启动爬虫
.run();
}
}
执行后在数据库中可以看到抓取到的店铺信息
WegMagic官网